Week 9: Text mining 2

Introduction

In this practical, we will learn word embeddings to represent text data, and we will also analyse a recurrent neural network.

We use the following packages:

Code
library(magrittr)  # for pipes
library(tidyverse) # for tidy data and pipes
library(ggplot2)   # for visualization
library(wordcloud) # to create pretty word clouds
library(stringr)   # for regular expressions
library(text2vec)  # for word embedding
library(tidytext)  # for text mining
library(tensorflow)
library(keras)

Take-home exercises

Word embedding

In the first part of the practical, we will apply word embedding approaches. A key idea in working with text data concerns representing words as numeric quantities. There are a number of ways to go about this as we reviewed in the lecture. Word embedding techniques such as word2vec and GloVe use neural networks approaches to construct word vectors. With these vector representations of words we can see how similar they are to each other, and also perform other tasks such as sentiment classification.

Let’s start the word embedding part with installing the harrypotter package using devtools. The harrypotter package supplies the first seven novels in the Harry Potter series. You can install and load this package with the following code:

Code
#devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter) # Not to be confused with the CRAN palettes package

1. Use the code below to load the first seven novels in the Harry Potter series.

Code
hp_books <- c("philosophers_stone", "chamber_of_secrets",
              "prisoner_of_azkaban", "goblet_of_fire",
              "order_of_the_phoenix", "half_blood_prince",
              "deathly_hallows")

hp_words <- list(
  philosophers_stone,
  chamber_of_secrets,
  prisoner_of_azkaban,
  goblet_of_fire,
  order_of_the_phoenix,
  half_blood_prince,
  deathly_hallows
) %>%
  # name each list element
  set_names(hp_books) %>%
  # convert each book to a data frame and merge into a single data frame
  map_df(as_tibble, .id = "book") %>%
  # convert book to a factor
  mutate(book = factor(book, levels = hp_books)) %>%
  # remove empty chapters
  filter(!is.na(value)) %>%
  # create a chapter id column
  group_by(book) %>%
  mutate(chapter = row_number(book))

head(hp_words)
## # A tibble: 6 × 3
## # Groups:   book [1]
##   book               value                                               chapter
##   <fct>              <chr>                                                 <int>
## 1 philosophers_stone "THE BOY WHO LIVED  Mr. and Mrs. Dursley, of num…       1
## 2 philosophers_stone "THE VANISHING GLASS  Nearly ten years had passe…       2
## 3 philosophers_stone "THE LETTERS FROM NO ONE  The escape of the Braz…       3
## 4 philosophers_stone "THE KEEPER OF THE KEYS  BOOM. They knocked agai…       4
## 5 philosophers_stone "DIAGON ALLEY  Harry woke early the next morning…       5
## 6 philosophers_stone "THE JOURNEY FROM PLATFORM NINE AND THREE-QUARTERS…       6

2. Convert the hp_words object into a dataframe and use the unnest_tokens() function from the tidytext package to tokenize the dataframe.

Code
# tokenize the data frame
hp_words <- as.data.frame(hp_words) %>%
  unnest_tokens(word, value)

head(hp_words)
##                 book chapter  word
## 1 philosophers_stone       1   the
## 2 philosophers_stone       1   boy
## 3 philosophers_stone       1   who
## 4 philosophers_stone       1 lived
## 5 philosophers_stone       1    mr
## 6 philosophers_stone       1   and

3. Remove the stop words from the tokenized data frame.

Code
hp_words <- hp_words %>%
  anti_join(stop_words)
## Joining with `by = join_by(word)`

head(hp_words)
##                 book chapter    word
## 1 philosophers_stone       1     boy
## 2 philosophers_stone       1   lived
## 3 philosophers_stone       1 dursley
## 4 philosophers_stone       1  privet
## 5 philosophers_stone       1   drive
## 6 philosophers_stone       1   proud

4. Creates a vocabulary of unique terms using the create_vocabulary() function from the text2vec package and remove the words that they appear less than 5 times.

Code
hp_words_ls <- list(hp_words$word)
it <- itoken(hp_words_ls, progressbar = FALSE) # create index-tokens
hp_vocab <- create_vocabulary(it)
hp_vocab <- prune_vocabulary(hp_vocab, term_count_min = 5)

hp_vocab
## Number of docs: 1 
## 0 stopwords:  ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##             term term_count doc_count
##    1:         10          5         1
##    2:      aaaah          5         1
##    3:      aargh          5         1
##    4:    abysmal          5         1
##    5:       acts          5         1
##   ---                                
## 8626:     looked       2344         1
## 8627: dumbledore       2874         1
## 8628:   hermione       4912         1
## 8629:        ron       5750         1
## 8630:      harry      16560         1

5. The next step is to create a token co-occurrence matrix (TCM). The definition of whether two words occur together is arbitrary. First create a vocab_vectorizer, then use a window of 5 for context words to create the TCM.

Code
# maps words to indices
vectorizer <- vocab_vectorizer(hp_vocab)

# use window of 10 for context words
hp_tcm <- create_tcm(it, vectorizer, skip_grams_window = 5)

dim(hp_tcm)
## [1] 8630 8630

6. Use the GlobalVectors as given in the code below to fit the word vectors on our data set. Choose the embedding size (rank variable) equal to 50, and the maximum number of co-occurrences equal to 10. Train word vectors in 20 iterations. You can check the full input arguments of the fit_transform function from here.

Code
glove <- GlobalVectors$new(rank = 50, x_max = 10)
hp_wv_main <- glove$fit_transform(hp_tcm, n_iter = 20, convergence_tol = 0.001)
## INFO  [15:00:04.348] epoch 1, loss 0.1201
## INFO  [15:00:04.499] epoch 2, loss 0.0783
## INFO  [15:00:04.641] epoch 3, loss 0.0679
## INFO  [15:00:04.782] epoch 4, loss 0.0618
## INFO  [15:00:04.921] epoch 5, loss 0.0575
## INFO  [15:00:05.069] epoch 6, loss 0.0543
## INFO  [15:00:05.226] epoch 7, loss 0.0518
## INFO  [15:00:05.374] epoch 8, loss 0.0498
## INFO  [15:00:05.517] epoch 9, loss 0.0482
## INFO  [15:00:05.661] epoch 10, loss 0.0468
## INFO  [15:00:05.806] epoch 11, loss 0.0456
## INFO  [15:00:05.952] epoch 12, loss 0.0446
## INFO  [15:00:06.098] epoch 13, loss 0.0437
## INFO  [15:00:06.242] epoch 14, loss 0.0429
## INFO  [15:00:06.386] epoch 15, loss 0.0422
## INFO  [15:00:06.531] epoch 16, loss 0.0416
## INFO  [15:00:06.671] epoch 17, loss 0.0410
## INFO  [15:00:06.814] epoch 18, loss 0.0405
## INFO  [15:00:06.960] epoch 19, loss 0.0400
## INFO  [15:00:07.105] epoch 20, loss 0.0396

7. The GloVe model learns two sets of word vectors: main and context. Essentially they are the same since the model is symmetric. From the experience learning two sets of word vectors leads to higher quality embeddings (read more here). Best practice is to combine both the main word vectors and the context word vectors into one matrix. Extract the word vectors and save the summation of them for further questions.

Code
dim(hp_wv_main)
## [1] 8630   50

hp_wv_context <- glove$components
dim(hp_wv_context)
## [1]   50 8630

# Either word-vectors matrices could work, but the developers of the technique suggest the sum/mean may work better
hp_word_vectors <- hp_wv_main + t(hp_wv_context)

dim(hp_word_vectors)
## [1] 8630   50

8. Find the most similar words to words “harry”, “death”, and “love”. Use the sim2 function with the cosine similary measure.

Code
harry <- hp_word_vectors["harry", , drop = F]

cos_sim_rom <- sim2(x = hp_word_vectors, y = harry, method = "cosine", norm = "l2")
head(sort(cos_sim_rom[,1], decreasing = T), 10)
##      harry        ron   hermione     moment       time     looked    quickly 
##  1.0000000  0.8667391  0.8623137  0.8209236  0.8201845  0.7690142  0.7681430 
##       left   happened dumbledore 
##  0.7657225  0.7393362  0.7185179

death <- hp_word_vectors["death", , drop = F]

cos_sim_rom <- sim2(x = hp_word_vectors, y = death, method = "cosine", norm = "l2")
head(sort(cos_sim_rom[,1], decreasing = T), 10)
##       death      eaters       eater   voldemort     escaped voldemort's 
##   1.0000000   0.9473222   0.8886765   0.6598957   0.6258171   0.5619242 
##       fight      yelled        lord      people 
##   0.5611892   0.5384620   0.5292734   0.5147428

love <- hp_word_vectors["love", , drop = F]

cos_sim_rom <- sim2(x = hp_word_vectors, y = love, method = "cosine", norm = "l2")
head(sort(cos_sim_rom[,1], decreasing = T), 10)
##        love ingredients      potion   horcruxes        hunt    bothered 
##   1.0000000   0.5519101   0.5482892   0.5233350   0.5223741   0.5120451 
##     potions     realise     lessons        easy 
##   0.5020931   0.4923215   0.4852735   0.4824427

9. Now you can play with word vectors! For example, add the word vector of “harry” with the word vector of “love” and subtract them from the word vector of “death”. What are the top terms in your result?

Code
test <- hp_word_vectors["harry", , drop = F] -
  hp_word_vectors["death", , drop = F] +
  hp_word_vectors["love", , drop = F]

cos_sim_test <- sim2(x = hp_word_vectors, y = test, method = "cosine", norm = "l2")
head(sort(cos_sim_test[,1], decreasing = T), 10)
##     harry  hermione       ron  thinking  muttered      time        er   weasley 
## 0.6582883 0.6350198 0.6141246 0.5998959 0.5833623 0.5522759 0.5499507 0.5453136 
##      love   quickly 
## 0.5417902 0.5401518

Lab exercises

Sentiment classification with RNN

For sentiment classification with pre-trained word vectors, we want to use GloVe pretrained word vectors. These word vectors were trained on Wikipedia 2014 and Gigaword 5 containing 6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors. Download the glove.6B.300d.txt file manually from the website or use the code below for this purpose.

Code
# Download Glove vectors if necessary
# if (!file.exists('glove.6B.zip')) {
#   download.file('https://nlp.stanford.edu/data/glove.6B.zip',destfile = 'glove.6B.zip')
#   unzip('data/glove.6B.zip')
# }

10. Use the code below to load the pre-trained word vectors from the file ‘glove.6B.300d.txt’ (if you have memory issues load the file ‘glove.6B.50d.txt’ instead).

Code

# load glove vectors
vectors <- data.table::fread('data/glove.6B.300d.txt', data.table = F, encoding = 'UTF-8')
colnames(vectors) <- c('word', paste('dim',1:300,sep = '_'))

# convert vectors to dataframe
vectors <- as_tibble(vectors)

11. IMDB movie reviews is a labeled data set available with the text2vec package. This data set consists of 5000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of the reviews is binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 has a sentiment score of 1. No individual movie has more than 30 reviews. Load this data set and convert it to a dataframe.

Code
# load an example dataset from text2vec
data("movie_review")
as_tibble(movie_review)
## # A tibble: 5,000 × 3
##    id      sentiment review                                                     
##    <chr>       <int> <chr>                                                      
##  1 5814_8          1 "With all this stuff going down at the moment with MJ i've…
##  2 2381_9          1 "\\\"The Classic War of the Worlds\\\" by Timothy Hines is…
##  3 7759_3          0 "The film starts with a manager (Nicholas Bell) giving wel…
##  4 3630_4          0 "It must be assumed that those who praised this film (\\\"…
##  5 9495_8          1 "Superbly trashy and wondrously unpretentious 80's exploit…
##  6 8196_8          1 "I dont know why people think this is such a bad movie. It…
##  7 7166_2          0 "This movie could have been very good, but comes up way sh…
##  8 10633_1         0 "I watched this video at a friend's house. I'm glad I did …
##  9 319_1           0 "A friend of mine bought this film for 1, and even then it…
## 10 8713_10         1 "<br /><br />This movie is full of references. Like \\\"Ma…
## # ℹ 4,990 more rows

12. To create a learning model using Keras, let’s first define the hyperparameters. Define the parameters of your Keras model with a maximum of 10000 words, maxlen of 60 and word embedding size of 300 (if you had memory problems change the embedding dimension to a smaller value, e.g., 50).

Code
max_words <- 1e4
maxlen    <- 60
dim_size  <- 300

13. Use the text_tokenizer function from Keras and tokenize the imdb review data using a maximum of 10000 words.

Code
# tokenize the input data and then fit the created object
word_seqs <- text_tokenizer(num_words = max_words) %>%
  fit_text_tokenizer(movie_review$review)

14. Transform each text into a sequence of integers (word indices) and use the pad_sequences function to pad the sequences.

Code
# apply tokenizer to the text and get indices instead of words
# later pad the sequence
x_train <- texts_to_sequences(word_seqs, movie_review$review) %>%
  pad_sequences(maxlen = maxlen)

15. Convert the sequence into a dataframe.

Code
# unlist word indices
word_indices <- unlist(word_seqs$word_index)

# then place them into data.frame
dic <- data.frame(word = names(word_indices), key = word_indices, stringsAsFactors = FALSE) %>%
  arrange(key) %>%
  filter(row_number() <= max_words)

16. Use the code below to join the dataframe of sequences (word indices) from the IMDB reviews with GloVe pre-trained word vectors.

Code
# join the words with GloVe vectors and
# if a word does not exist in GloVe, then fill NA's with 0
word_embeds <- dic  %>%
  left_join(vectors) %>%
  select(starts_with("dim")) %>%
  replace(., is.na(.), 0) %>%
  as.matrix()
## Joining with `by = join_by(word)`

17. Extract the outcome variable from the sentiment column in the original dataframe and name it y_train.

Code
# the outcome variable
y_train <- as.matrix(movie_review$sentiment)

18. Use the Keras functional API and create a recurrent neural network model as below. Can you describe this model?

Code
# Use Keras Functional API
input <- layer_input(shape = list(maxlen), name = "input")

model <- input %>%
  layer_embedding(input_dim = max_words, output_dim = dim_size, input_length = maxlen,
                  # put weights into list and do not allow training
                  weights = list(word_embeds), trainable = FALSE) %>%
  layer_spatial_dropout_1d(rate = 0.2) %>%
  bidirectional(
    layer_gru(units = 80, return_sequences = TRUE)
  )
max_pool <- model %>% layer_global_max_pooling_1d()
ave_pool <- model %>% layer_global_average_pooling_1d()

output <- layer_concatenate(list(ave_pool, max_pool)) %>%
  layer_dense(units = 1, activation = "sigmoid")

model <- keras_model(input, output)

# model summary
model
## Model: "model"
## ________________________________________________________________________________
##  Layer (type)       Output Shape         Para   Connected to         Trainable  
##                                          m #                                    
## ================================================================================
##  input (InputLayer  [(None, 60)]         0      []                   Y          
##  )                                                                              
##  embedding (Embedd  (None, 60, 300)      3000   ['input[0][0]']      N          
##  ing)                                    000                                    
##  spatial_dropout1d  (None, 60, 300)      0      ['embedding[0][0]'   Y          
##   (SpatialDropout1                              ]                               
##  D)                                                                             
##  bidirectional (Bi  (None, 60, 160)      1833   ['spatial_dropout1   Y          
##  directional)                            60     d[0][0]']                       
##  global_average_po  (None, 160)          0      ['bidirectional[0]   Y          
##  oling1d (GlobalAv                              [0]']                           
##  eragePooling1D)                                                                
##  global_max_poolin  (None, 160)          0      ['bidirectional[0]   Y          
##  g1d (GlobalMaxPoo                              [0]']                           
##  ling1D)                                                                        
##  concatenate (Conc  (None, 320)          0      ['global_average_p   Y          
##  atenate)                                       ooling1d[0][0]',                
##                                                  'global_max_pooli              
##                                                 ng1d[0][0]']                    
##  dense (Dense)      (None, 1)            321    ['concatenate[0][0   Y          
##                                                 ]']                             
## ================================================================================
## Total params: 3183681 (12.14 MB)
## Trainable params: 183681 (717.50 KB)
## Non-trainable params: 3000000 (11.44 MB)
## ________________________________________________________________________________

19. Compile the model with an ‘adam’ optimizer, and the binary_crossentropy loss. You can choose accuracy or AUC for the metrics.

Code
# instead of accuracy we can use "AUC" metrics from "tensorflow.keras"
model %>% compile(
  optimizer = "adam", # optimizer = optimizer_rmsprop(),
  loss = "binary_crossentropy",
  metrics = tensorflow::tf$keras$metrics$AUC() # metrics = c('accuracy')
)

20. Fit the model with 10 epochs (iterations), batch_size = 32, and validation_split = 0.2. Check the training performance versus the validation performance.

Code
history <- model %>% keras::fit(
  x_train, y_train,
  epochs = 10,
  batch_size = 32,
  validation_split = 0.2
)
## Epoch 1/10
## 125/125 - 8s - loss: 0.6879 - auc: 0.5685 - val_loss: 0.6835 - val_auc: 0.6857 - 8s/epoch - 68ms/step
## Epoch 2/10
## 125/125 - 6s - loss: 0.6388 - auc: 0.7006 - val_loss: 0.6707 - val_auc: 0.7247 - 6s/epoch - 48ms/step
## Epoch 3/10
## 125/125 - 6s - loss: 0.5849 - auc: 0.7884 - val_loss: 0.5976 - val_auc: 0.7562 - 6s/epoch - 48ms/step
## Epoch 4/10
## 125/125 - 6s - loss: 0.5343 - auc: 0.8284 - val_loss: 0.5723 - val_auc: 0.7813 - 6s/epoch - 48ms/step
## Epoch 5/10
## 125/125 - 6s - loss: 0.4721 - auc: 0.8829 - val_loss: 0.5506 - val_auc: 0.7945 - 6s/epoch - 48ms/step
## Epoch 6/10
## 125/125 - 6s - loss: 0.4241 - auc: 0.9042 - val_loss: 0.6003 - val_auc: 0.8075 - 6s/epoch - 48ms/step
## Epoch 7/10
## 125/125 - 6s - loss: 0.3677 - auc: 0.9368 - val_loss: 0.5321 - val_auc: 0.8190 - 6s/epoch - 48ms/step
## Epoch 8/10
## 125/125 - 6s - loss: 0.3179 - auc: 0.9556 - val_loss: 0.5123 - val_auc: 0.8265 - 6s/epoch - 48ms/step
## Epoch 9/10
## 125/125 - 6s - loss: 0.2677 - auc: 0.9729 - val_loss: 0.5080 - val_auc: 0.8292 - 6s/epoch - 48ms/step
## Epoch 10/10
## 125/125 - 6s - loss: 0.2207 - auc: 0.9842 - val_loss: 0.5098 - val_auc: 0.8364 - 6s/epoch - 49ms/step

plot(history)