In this practical, we will learn word embeddings to represent text data, and we will also analyse a recurrent neural network.

We use the following packages:

library(magrittr)  # for pipes
library(tidyverse) # for tidy data and pipes
library(ggplot2)   # for visualization
library(wordcloud) # to create pretty word clouds
library(stringr)   # for regular expressions
library(text2vec)  # for word embedding
library(tidytext)  # for text mining

Word embedding

In the first part of the practical, we will apply word embedding approaches. A key idea in working with text data concerns representing words as numeric quantities. There are a number of ways to go about this as we reviewed in the lecture. Word embedding techniques such as word2vec and GloVe use neural networks approaches to construct word vectors. With these vector representations of words we can see how similar they are to each other, and also perform other tasks such as sentimetn classification.

Let’s start the word embedding part with installing the harrypotter package using devtools. The harrypotter package supplies the first seven novels in the Harry Potter series. You can install and load this package with the following code:

library(harrypotter) # Not to be confused with the CRAN palettes package

  1. Use the code below to load the first seven novels in the Harry Potter series.

hp_books <- c("philosophers_stone", "chamber_of_secrets",
              "prisoner_of_azkaban", "goblet_of_fire",
              "order_of_the_phoenix", "half_blood_prince",

hp_words <- list(
) %>%
  # name each list element
  set_names(hp_books) %>%
  # convert each book to a data frame and merge into a single data frame
  map_df(as_tibble, .id = "book") %>%
  # convert book to a factor
  mutate(book = factor(book, levels = hp_books)) %>%
  # remove empty chapters
  filter(! %>%
  # create a chapter id column
  group_by(book) %>%
  mutate(chapter = row_number(book))


  1. Convert the hp_words object into a dataframe and use the unnest_tokens() function from the tidytext package to tokenize the dataframe.

# tokenize the data frame
hp_words <- %>%
  unnest_tokens(word, value)


  1. Remove the stop words from the tokenized data frame.

hp_words <- hp_words %>% 
## Joining, by = "word"


  1. Creates a vocabulary of unique terms using the create_vocabulary() function from the text2vec package and remove the words that they appear less than 5 times.

hp_words_ls <- list(hp_words$word)
it <- itoken(hp_words_ls, progressbar = FALSE) # create index-tokens
hp_vocab <- create_vocabulary(it)
hp_vocab <- prune_vocabulary(hp_vocab, term_count_min = 5)