In this post I’m doing some topic modelling. Topic modelling is a way of finding abstact topics in collection of documents. I’m using Sherlock Holmes stories and try to find out which word contributes to how much in telling what the story is about. One way to think about is certain words will play important role in defining what the story is about and the frequncy of those words play vital role in our task.

I’m using gutenbergr package to download 12 Sherlock Holmes stories.

sherlock_raw<-gutenberg_download(1661)

sherlock<-sherlock_raw %>% 
  mutate(story = ifelse(str_detect(text,"ADVENTURE"),text,NA),
         line = row_number()) %>% 
  fill(story) %>% 
  filter(story!= "THE ADVENTURES OF SHERLOCK HOLMES")

Few steps are done to prepare the data but still out data is not in tidy format so let tidy it.

sherlock_tidy<-sherlock %>%
  unnest_tokens(word,text) %>% 
  anti_join(stop_words) %>% 
  filter(!word %in% c("holmes", "sherlock", "HOLMES", "SHERLOCK"))

One of the ways to know what a document is by analysing most frequently used words. Here I’m finding 5 most frequently used words in each of the story. Its called Term Frequency (tf).

top_5_words<-sherlock_tidy %>% 
  count(story, word) %>% 
  #arrange(-n) %>% 
  group_by(story) %>% 
  top_n(5) %>% 
  ungroup()
top_5_words
## # A tibble: 65 x 3
##    story                               word           n
##    <chr>                               <chr>      <int>
##  1 ADVENTURE I. A SCANDAL IN BOHEMIA   house         15
##  2 ADVENTURE I. A SCANDAL IN BOHEMIA   king          17
##  3 ADVENTURE I. A SCANDAL IN BOHEMIA   majesty       16
##  4 ADVENTURE I. A SCANDAL IN BOHEMIA   photograph    21
##  5 ADVENTURE I. A SCANDAL IN BOHEMIA   street        18
##  6 ADVENTURE II. THE RED-HEADED LEAGUE business      19
##  7 ADVENTURE II. THE RED-HEADED LEAGUE headed        19
##  8 ADVENTURE II. THE RED-HEADED LEAGUE red           29
##  9 ADVENTURE II. THE RED-HEADED LEAGUE time          19
## 10 ADVENTURE II. THE RED-HEADED LEAGUE wilson        20
## # ... with 55 more rows

we can visualize our results.

top_5_words %>% 
  mutate(word = reorder_within(word,n,story)) %>%
  ggplot(aes(word,n, fill=story))+
  geom_col(show.legend = FALSE)+
  facet_wrap(~story, scales = "free_y")+
  scale_x_reordered() +
  coord_flip()

By going through frequency of words we will find lots of words within each story but they are less significant on telling us what the story is all about. Yet one another sophisticated method is knowing the Inverse Document Frequency (idf) of words. Its natural log of ratio between number of documents divided by number of documents containing that word.

Suppose, a word is found in all the documents, then this ratio becomes 1 and natural log of 1 is 0. So all the words which are present in most of the documents will have low value. This shows those words are not suitable to describe as they will generalize our result. But if a word is only available in one or two documents, then its value will be higher as the ratio will be less than 1 and natural log of that value will be greater than 0.

In Tidytext package we have bind_tf_idf function which we can use to calculate tf_idf this value. tf_idf is product of Term Frequency (tf) and Inverse Document Frequency (idf). We can find 10 top words from each docments which are unique to only one of each and visualise the result.

sherlock_tf_idf<-sherlock_tidy %>% 
  count(story,word,sort = TRUE) %>% 
  bind_tf_idf(word,story,n) %>% 
  arrange(-tf_idf) %>% 
  group_by(story) %>% 
  top_n(10) %>% 
  ungroup
sherlock_tf_idf
## # A tibble: 129 x 6
##    story                                word         n      tf   idf tf_idf
##    <chr>                                <chr>    <int>   <dbl> <dbl>  <dbl>
##  1 X. THE ADVENTURE OF THE NOBLE BACHE~ simon       39 0.0159   2.48 0.0394
##  2 XII. THE ADVENTURE OF THE COPPER BE~ rucastle    34 0.0118   2.48 0.0293
##  3 ADVENTURE IV. THE BOSCOMBE VALLEY M~ mccarthy    33 0.0118   2.48 0.0292
##  4 ADVENTURE III. A CASE OF IDENTITY    hosmer      23 0.0114   2.48 0.0283
##  5 ADVENTURE III. A CASE OF IDENTITY    angel       20 0.00991  2.48 0.0246
##  6 ADVENTURE III. A CASE OF IDENTITY    windiba~    20 0.00991  2.48 0.0246
##  7 XI. THE ADVENTURE OF THE BERYL CORO~ coronet     27 0.00990  2.48 0.0246
##  8 ADVENTURE VI. THE MAN WITH THE TWIS~ clair       23 0.00785  2.48 0.0195
##  9 VII. THE ADVENTURE OF THE BLUE CARB~ goose       26 0.0108   1.79 0.0194
## 10 XII. THE ADVENTURE OF THE COPPER BE~ hunter      21 0.00729  2.48 0.0181
## # ... with 119 more rows
sherlock_tf_idf %>% 
  mutate(word = reorder_within(word,tf_idf,story)) %>%
  ggplot(aes(word,tf_idf, fill=story))+
  geom_col(show.legend = FALSE)+
  facet_wrap(~story, scales = "free_y", ncol = 3)+
  scale_x_reordered() +
  coord_flip()

So these words can show in someway what each story is about. Now we are going to do some topic modelling.

Topic Modelling

We are using stm package to do topic modelling. First we have to convert our tidy data into Document Term Matrix

sherlock_dfm<-sherlock_tidy %>% 
  count(story,word,sort = TRUE) %>% 
  cast_dfm(story,word,n)

We are trining our model with 6 topics. And we are using tidy function to define our output of our model. First we calculate `Beta’ value it defines per-topic-per-word probability.

sherlock_model<-stm(sherlock_dfm, K = 6,init.type = "Spectral",verbose = FALSE)

top_10_beta<-tidy(sherlock_model) %>%
  arrange(-beta) %>% 
  group_by(topic) %>% 
  top_n(10,beta) %>% 
  ungroup

top_10_beta
## # A tibble: 60 x 3
##    topic term      beta
##    <int> <chr>    <dbl>
##  1     1 st     0.0132 
##  2     2 hat    0.0121 
##  3     2 goose  0.0108 
##  4     2 stone  0.00876
##  5     4 father 0.00856
##  6     6 time   0.00838
##  7     2 bird   0.00793
##  8     1 simon  0.00724
##  9     2 geese  0.00709
## 10     5 door   0.00687
## # ... with 50 more rows

Here if we just isolate word ‘st’ and try to see its beta distribution, we can see the respective probabilities of the word being in each topic.

tidy(sherlock_model) %>% 
  filter(term == "st") %>% 
  arrange(-beta)
## # A tibble: 6 x 3
##   topic term      beta
##   <int> <chr>    <dbl>
## 1     1 st    1.32e- 2
## 2     3 st    1.28e- 3
## 3     6 st    5.71e- 4
## 4     2 st    4.16e- 4
## 5     4 st    1.98e- 4
## 6     5 st    2.64e-30

Now let’s visualize this result.

top_10_beta %>% 
  mutate(term = reorder_within(term,beta,topic),
         topic = paste0("TOPIC", topic)) %>%
  ggplot(aes(term,beta, fill=topic))+
  geom_col(show.legend = FALSE)+
  facet_wrap(~topic, scales = "free_y", ncol = 3)+
  scale_x_reordered() +
  coord_flip()

There is one more distribution called ‘gamma’. Its per-document-per-topic probability. Lets calculate and vizualise the result.

sherlock_gamma<-tidy(sherlock_model, matrix = "gamma", document_names = rownames(sherlock_dfm))
sherlock_gamma %>% 
  ggplot(aes(gamma, fill= topic))+
  geom_histogram(show.legend = FALSE)+
  facet_wrap(~topic)

It shows which document belongs to which topic.

Conclusion

Topic Modelling is a vast subject. There are more into it like Topic Modelling using ‘ngrams’. Please refer to great post by Julia Silgie