Materials for Applied Data Science profile course INFOMDA2 Battling the curse of dimensionality.
In this practical, we will learn word embeddings to represent text data, and we will also analyse a recurrent neural network.
We use the following packages:
library(magrittr) # for pipes
library(tidyverse) # for tidy data and pipes
library(ggplot2) # for visualization
library(wordcloud) # to create pretty word clouds
library(stringr) # for regular expressions
library(text2vec) # for word embedding
library(tidytext) # for text mining
library(tensorflow)
library(keras)
In the first part of the practical, we will apply word embedding approaches. A key idea in working with text data concerns representing words as numeric quantities. There are a number of ways to go about this as we reviewed in the lecture. Word embedding techniques such as word2vec and GloVe use neural networks approaches to construct word vectors. With these vector representations of words we can see how similar they are to each other, and also perform other tasks such as sentiment classification.
Let’s start the word embedding part with installing the harrypotter
package using
devtools. The
harrypotter
package supplies the first seven novels in the Harry
Potter series. You can install and load this package with the following
code:
#devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter) # Not to be confused with the CRAN palettes package
1. Use the code below to load the first seven novels in the Harry Potter series.
hp_books <- c("philosophers_stone", "chamber_of_secrets",
"prisoner_of_azkaban", "goblet_of_fire",
"order_of_the_phoenix", "half_blood_prince",
"deathly_hallows")
hp_words <- list(
philosophers_stone,
chamber_of_secrets,
prisoner_of_azkaban,
goblet_of_fire,
order_of_the_phoenix,
half_blood_prince,
deathly_hallows
) %>%
# name each list element
set_names(hp_books) %>%
# convert each book to a data frame and merge into a single data frame
map_df(as_tibble, .id = "book") %>%
# convert book to a factor
mutate(book = factor(book, levels = hp_books)) %>%
# remove empty chapters
filter(!is.na(value)) %>%
# create a chapter id column
group_by(book) %>%
mutate(chapter = row_number(book))
head(hp_words)
2. Convert the hp_words object into a dataframe and use the
unnest_tokens()
function from the tidytext package to tokenize the
dataframe.
3. Remove the stop words from the tokenized data frame.
4. Creates a vocabulary of unique terms using the create_vocabulary() function from the text2vec package and remove the words that they appear less than 5 times.
5. The next step is to create a token co-occurrence matrix (TCM). The definition of whether two words occur together is arbitrary. First create a vocab_vectorizer, then use a window of 5 for context words to create the TCM.
6. Use the GlobalVectors as given in the code below to fit the word vectors on our data set. Choose the embedding size (rank variable) equal to 50, and the maximum number of co-occurrences equal to 10. Train word vectors in 20 iterations. You can check the full input arguments of the fit_transform function from here.
glove <- GlobalVectors$new(rank = 50, x_max = 10)
hp_wv_main <- glove$fit_transform(hp_tcm, n_iter = 20, convergence_tol = 0.001)
7. The GloVe model learns two sets of word vectors: main and context. Essentially they are the same since the model is symmetric. From the experience learning two sets of word vectors leads to higher quality embeddings (read more here). Best practice is to combine both the main word vectors and the context word vectors into one matrix. Extract the word vectors and save the summation of them for further questions.
8. Find the most similar words to words “harry”, “death”, and “love”.
Use the sim2
function with the cosine similary measure.
9. Now you can play with word vectors! For example, add the word vector of “harry” with the word vector of “love” and subtract them from the word vector of “death”. What are the top terms in your result?
For sentiment classification with pre-trained word vectors, we want to
use GloVe pretrained word
vectors. These word vectors were trained on Wikipedia 2014 and Gigaword
5 containing 6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d
vectors. Download the glove.6B.300d.txt
file manually from the website
or use the code below for this purpose.
# Download Glove vectors if necessary
# if (!file.exists('glove.6B.zip')) {
# download.file('https://nlp.stanford.edu/data/glove.6B.zip',destfile = 'glove.6B.zip')
# unzip('data/glove.6B.zip')
# }
10. Use the code below to load the pre-trained word vectors from the file ‘glove.6B.300d.txt’ (if you have memory issues load the file ‘glove.6B.50d.txt’ instead).
# load glove vectors
vectors <- data.table::fread('data/glove.6B.300d.txt', data.table = F, encoding = 'UTF-8')
colnames(vectors) <- c('word', paste('dim',1:300,sep = '_'))
# convert vectors to dataframe
vectors <- as_tibble(vectors)
11. IMDB movie reviews is a labeled data set available with the
text2vec
package. This data set consists of 5000 IMDB movie reviews,
specially selected for sentiment analysis. The sentiment of the reviews
is binary, meaning an IMDB rating < 5 results in a sentiment score of
0, and a rating >=7 has a sentiment score of 1. No individual movie has
more than 30 reviews. Load this data set and convert it to a
dataframe.
12. To create a learning model using Keras
, let’s first define the
hyperparameters. Define the parameters of your Keras
model with a
maximum of 10000 words, maxlen of 60 and word embedding size of 300 (if
you had memory problems change the embedding dimension to a smaller
value, e.g., 50).
13. Use the text_tokenizer function from Keras
and tokenize the imdb
review data using a maximum of 10000 words.
14. Transform each text into a sequence of integers (word indices) and
use the pad_sequences
function to pad the sequences.
15. Convert the sequence into a dataframe.
16. Use the code below to join the dataframe of sequences (word indices) from the IMDB reviews with GloVe pre-trained word vectors.
# join the words with GloVe vectors and
# if a word does not exist in GloVe, then fill NA's with 0
word_embeds <- dic %>%
left_join(vectors) %>%
select(starts_with("dim")) %>%
replace(., is.na(.), 0) %>%
as.matrix()
## Joining with `by = join_by(word)`
17. Extract the outcome variable from the sentiment
column in the
original dataframe and name it y_train.
18. Use the Keras
functional API and create a recurrent neural
network model as below. Can you describe this model?
# Use Keras Functional API
input <- layer_input(shape = list(maxlen), name = "input")
model <- input %>%
layer_embedding(input_dim = max_words, output_dim = dim_size, input_length = maxlen,
# put weights into list and do not allow training
weights = list(word_embeds), trainable = FALSE) %>%
layer_spatial_dropout_1d(rate = 0.2) %>%
bidirectional(
layer_gru(units = 80, return_sequences = TRUE)
)
max_pool <- model %>% layer_global_max_pooling_1d()
ave_pool <- model %>% layer_global_average_pooling_1d()
output <- layer_concatenate(list(ave_pool, max_pool)) %>%
layer_dense(units = 1, activation = "sigmoid")
model <- keras_model(input, output)
# model summary
model
19. Compile the model with an ‘adam’ optimizer, and the binary_crossentropy loss. You can choose accuracy or AUC for the metrics.
20. Fit the model with 10 epochs (iterations), batch_size = 32, and validation_split = 0.2. Check the training performance versus the validation performance.