In this practical, we will learn word embeddings to represent text data, and we will also analyse a recurrent neural network.
We use the following packages:
library(magrittr) # for pipes
library(tidyverse) # for tidy data and pipes
library(ggplot2) # for visualization
library(wordcloud) # to create pretty word clouds
library(stringr) # for regular expressions
library(text2vec) # for word embedding
library(tidytext) # for text mining
library(tensorflow)
library(keras)
In the first part of the practical, we will apply word embedding approaches. A key idea in working with text data concerns representing words as numeric quantities. There are a number of ways to go about this as we reviewed in the lecture. Word embedding techniques such as word2vec and GloVe use neural networks approaches to construct word vectors. With these vector representations of words we can see how similar they are to each other, and also perform other tasks such as sentimetn classification.
Let’s start the word embedding part with installing the
harrypotter
package using devtools.
The harrypotter
package supplies the first seven novels in
the Harry Potter series. You can install and load this package with the
following code:
#devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter) # Not to be confused with the CRAN palettes package
<- c("philosophers_stone", "chamber_of_secrets",
hp_books "prisoner_of_azkaban", "goblet_of_fire",
"order_of_the_phoenix", "half_blood_prince",
"deathly_hallows")
<- list(
hp_words
philosophers_stone,
chamber_of_secrets,
prisoner_of_azkaban,
goblet_of_fire,
order_of_the_phoenix,
half_blood_prince,
deathly_hallows%>%
) # name each list element
set_names(hp_books) %>%
# convert each book to a data frame and merge into a single data frame
map_df(as_tibble, .id = "book") %>%
# convert book to a factor
mutate(book = factor(book, levels = hp_books)) %>%
# remove empty chapters
filter(!is.na(value)) %>%
# create a chapter id column
group_by(book) %>%
mutate(chapter = row_number(book))
head(hp_words)
unnest_tokens()
function from the tidytext package to
tokenize the dataframe.# tokenize the data frame
<- as.data.frame(hp_words) %>%
hp_words unnest_tokens(word, value)
head(hp_words)
<- hp_words %>%
hp_words anti_join(stop_words)
## Joining, by = "word"
head(hp_words)
<- list(hp_words$word)
hp_words_ls <- itoken(hp_words_ls, progressbar = FALSE) # create index-tokens
it <- create_vocabulary(it)
hp_vocab <- prune_vocabulary(hp_vocab, term_count_min = 5)
hp_vocab
hp_vocab