Tidytext makes it so much easier

Recently I have been watching Tidytuesday screencast from David Robinson. In his screencast he selects never seen before dataset and analyses it using R. Today I’m going to follow in his footsetp. In this blog post I’m going to analyse a set of data and visualize the result using ggplot2. The dataset is collected from medium.com. I’m going to breakdown all the titles of the articles into indivisual words and try to see which word is used the most in all of them.

Read in data

medium_datasci <- read_csv("medium_datasci.csv")

Cleaning data

The dataset contains the first column as x1 which we don’t need so lets deselect it and save the dataframe as medium_processed.

medium_processed<- medium_datasci %>% 
  select(-x1)

There is a handy function called summarise_at which can be used as

medium_processed %>% 
  summarise_at(vars(starts_with("tag_")),sum)

## # A tibble: 1 x 8
##   tag_ai tag_artificial_~ tag_big_data tag_data tag_data_science
##    <dbl>            <dbl>        <dbl>    <dbl>            <dbl>
## 1  13763            29580         8686     9420            15424
## # ... with 3 more variables: tag_data_visualization <dbl>,
## #   tag_deep_learning <dbl>, tag_machine_learning <dbl>

It summarises only those columns which starts with tag_ and the summarise function is sum in this case.

More data processing

medium_processed %>% 
  select(starts_with("tag"))

## # A tibble: 78,388 x 8
##    tag_ai tag_artificial_~ tag_big_data tag_data tag_data_science
##     <dbl>            <dbl>        <dbl>    <dbl>            <dbl>
##  1      1                0            0        0                0
##  2      1                0            0        0                0
##  3      1                0            0        0                0
##  4      1                0            0        0                0
##  5      1                0            0        0                0
##  6      1                0            0        0                0
##  7      1                0            0        0                0
##  8      1                0            0        0                0
##  9      1                0            0        0                0
## 10      1                0            0        0                0
## # ... with 78,378 more rows, and 3 more variables:
## #   tag_data_visualization <dbl>, tag_deep_learning <dbl>,
## #   tag_machine_learning <dbl>

We can see value 1 or 0 is assigned to indicate whether certain tag like ai or big_data appear in the article. This gives us a good opportunity to gather the data. It gathers all those columns which starts with tag into 2 columns called tag and value.

medium_gathered<- medium_processed %>% 
  gather(tag,value,starts_with("tag"))

medium_gathered

## # A tibble: 627,104 x 14
##    title subtitle image author publication  year month   day reading_time
##    <chr> <chr>    <dbl> <chr>  <chr>       <dbl> <dbl> <dbl>        <dbl>
##  1 Onli~ Online ~     1 Emma ~ <NA>         2017     8     1            5
##  2 A.I.~ <NA>         0 Sanpa~ <NA>         2017     8     1            2
##  3 Futu~ From Ph~     1 Z      <NA>         2017     8     1            3
##  4 The ~ A true ~     1 Emiko~ MILLENNIAL~  2017     8     1            5
##  5 Os M~ mas per~     1 Giova~ NEW ORDER    2017     8     1            3
##  6 The ~ Origina~     1 Syed ~ Towards Da~  2017     8     1            6
##  7 Digi~ Health ~     1 Anna ~ <NA>         2017     8     1            5
##  8 Addr~ https:/~     0 The A~ <NA>         2017     8     1            0
##  9 Wir ~ <NA>         1 Angel~ <NA>         2017     8     1            4
## 10 <NA>  199973       0 The p~ <NA>         2017     8     1            3
## # ... with 627,094 more rows, and 5 more variables: claps <dbl>,
## #   url <chr>, author_url <chr>, tag <chr>, value <dbl>

We don’t need"tag_ so can be removed and filter out only value 1 which indicate the tag exists.

medium_filter<-medium_gathered %>% 
  mutate(tag = str_remove(tag,"tag_")) %>% 
  filter(value ==1)

## Warning: package 'bindrcpp' was built under R version 3.5.2

medium_filter

## # A tibble: 112,994 x 14
##    title subtitle image author publication  year month   day reading_time
##    <chr> <chr>    <dbl> <chr>  <chr>       <dbl> <dbl> <dbl>        <dbl>
##  1 Onli~ Online ~     1 Emma ~ <NA>         2017     8     1            5
##  2 A.I.~ <NA>         0 Sanpa~ <NA>         2017     8     1            2
##  3 Futu~ From Ph~     1 Z      <NA>         2017     8     1            3
##  4 The ~ A true ~     1 Emiko~ MILLENNIAL~  2017     8     1            5
##  5 Os M~ mas per~     1 Giova~ NEW ORDER    2017     8     1            3
##  6 The ~ Origina~     1 Syed ~ Towards Da~  2017     8     1            6
##  7 Digi~ Health ~     1 Anna ~ <NA>         2017     8     1            5
##  8 Addr~ https:/~     0 The A~ <NA>         2017     8     1            0
##  9 Wir ~ <NA>         1 Angel~ <NA>         2017     8     1            4
## 10 <NA>  199973       0 The p~ <NA>         2017     8     1            3
## # ... with 112,984 more rows, and 5 more variables: claps <dbl>,
## #   url <chr>, author_url <chr>, tag <chr>, value <dbl>

Here if you see, the data is organised into tidy format row-wise. It seems all good but we can’t do nothing with a sentence (title), unless we break sentences into words and arrange them row-wise again. Here comes the magic of tidytext.

medium_words<-medium_filter %>% 
  unnest_tokens(word,title)

medium_words

## # A tibble: 841,539 x 14
##    subtitle image author publication  year month   day reading_time claps
##    <chr>    <dbl> <chr>  <chr>       <dbl> <dbl> <dbl>        <dbl> <dbl>
##  1 Online ~     1 Emma ~ <NA>         2017     8     1            5    12
##  2 Online ~     1 Emma ~ <NA>         2017     8     1            5    12
##  3 Online ~     1 Emma ~ <NA>         2017     8     1            5    12
##  4 Online ~     1 Emma ~ <NA>         2017     8     1            5    12
##  5 Online ~     1 Emma ~ <NA>         2017     8     1            5    12
##  6 Online ~     1 Emma ~ <NA>         2017     8     1            5    12
##  7 <NA>         0 Sanpa~ <NA>         2017     8     1            2    11
##  8 From Ph~     1 Z      <NA>         2017     8     1            3     1
##  9 From Ph~     1 Z      <NA>         2017     8     1            3     1
## 10 From Ph~     1 Z      <NA>         2017     8     1            3     1
## # ... with 841,529 more rows, and 5 more variables: url <chr>,
## #   author_url <chr>, tag <chr>, value <dbl>, word <chr>

Each sentence is broken into words and arranged into each row. If we want to see what word appears the most, we can use handy count function with sort = TRUE

Polish the result

medium_words %>% 
  count(word, sort = TRUE)

## # A tibble: 44,259 x 2
##    word         n
##    <chr>    <int>
##  1 the      27555
##  2 to       19378
##  3 data     18856
##  4 of       16417
##  5 and      16327
##  6 in       15526
##  7 ai       15163
##  8 learning 13735
##  9 a        13463
## 10 for      11374
## # ... with 44,249 more rows

But as you can see there are lots of words like ‘the’, ‘a’,‘in’. These are called stop words and we have to filter them out. Fot that we use ‘anti_join’ with tidytext’s built in stop_words dataset. We need to filter out numbers and few other words as a clean up step.

medium_words_sort<-medium_words %>% 
  anti_join(stop_words) %>% 
  filter(word != "de",
         str_detect(word,"[a-z]")) %>% 
  count(word, sort = TRUE)

medium_words_sort

## # A tibble: 42,727 x 2
##    word             n
##    <chr>        <int>
##  1 data         18856
##  2 ai           15163
##  3 learning     13735
##  4 machine       8732
##  5 intelligence  6441
##  6 artificial    6306
##  7 deep          4832
##  8 science       3837
##  9 future        2574
## 10 neural        2298
## # ... with 42,717 more rows

Plot the result

Our result shows the word ‘data’ is the hightest, followed by the word ‘ai’. We are almost done but what good if we can’t see the result visually. so now comes the fun part. Lets pick up top 10 and try to plot out findings.

medium_words_sort %>% 
  head(10) %>% 
  mutate(word = fct_reorder(word,n)) %>% 
  ggplot(aes(word,n,fill = word))+
  geom_col(show.legend = FALSE)+
  coord_flip()

we used fct_reorder so words are arranged by the order of their count (n).

Conclusion

In few steps how we are able to analyse the data and quickly visualize the result. I love how tidytext simplified the complex procedure and helped to produce the output elegantly. If you want to know more please visit Tidytuesday project and ofcourse out of all for the best screencast visit David Robinson’s screencast.

Tidytext makes it so much easier

Read in data

Cleaning data

More data processing

Polish the result

Plot the result

Conclusion

Nabin Dangol

Hypothesis Testing : t-statistics

Log Odds Ratio Vs tf-idf Vs Weighted Log Odds

Machine Learning With Tidyverse

Topic Modelling In R

Let's do some Purrr...

Tidytext makes it so much easier