Materials for Applied Data Science profile course INFOMDA2 Battling the curse of dimensionality.

Week 8: Text mining 1


In this practical, we are going to use the following packages to create document-term matrices on BBC news data set and apply LDA topic modeling.


Take-home exercises

Vector space model: document-term matrix

The data set used in this practical is the BBC News data set. You can use the provided “news_dataset.rda” for this purpose. The raw data set can also be downloaded from here.

This data set consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004 to 2005. These areas are:

1. Use the code below to load the data set and inspect its first rows.


2. Find out about the name of the categories and the number of observations in each of them.

3. Convert the data set into a document-term matrix and use the findFreqTerms function to keep the terms which their frequency is higher than 10. It is also a good idea to apply some text preprocessing before this conversion: e.g., remove non-UTF-8 characters, convert the words into lowercase, remove punctuation, numbers, stopwords, and whitespaces.

4. Partition the original data into training and test sets with 80% for training and 20% for test.

5. Create separate document-term matrices for the training and the test sets using the previous frequent terms as the input dictionary and convert them into data frames.

6. OPTIONAL: Use the cbind function to add the categories to the train_dtm data and name the column y.

7. OPTIONAL: Fit a SVM model with a linear kernel on the training data set. Predict the categories for the training and test data.

Lab exercises

Topic modeling

Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.

8. Use the LDA function from the topicmodels package to train an LDA model with 5 topics with the Gibbs sampling method.

9. The tidy() method is originally from the broom package (Robinson 2017), for tidying model objects. The tidytext package provides this method for extracting the per-topic-per-word probabilities, called “beta”, from the LDA model. Use this function and check the beta probabilites for each term and topic.

10. Use the code below to plot the top 20 terms within each topic.

lda_top_terms <- lda_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 20) %>% # We use dplyr’s slice_max() to find the top 10 terms within each topic.
  ungroup() %>%
  arrange(topic, -beta)

lda_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +

11. Use the code below to save the terms and topics in a wide format.

beta_wide <- lda_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  pivot_wider(names_from = topic, values_from = beta) %>% 
  mutate(log_ratio21 = log2(topic2 / topic1)) %>% 
  mutate(log_ratio31 = log2(topic3 / topic1))%>% 
  mutate(log_ratio41 = log2(topic4 / topic1))%>% 
  mutate(log_ratio51 = log2(topic5 / topic1))


12. Use the log ratios to visualize the words with the greatest differences between topic 1 and other topics. Below you see this analysis for topics 1 and 2.

# topic 1 versus topic 2
lda_top_terms1 <- beta_wide %>%
  slice_max(log_ratio21, n = 10) %>%
  arrange(term, -log_ratio21)

lda_top_terms2 <- beta_wide %>%
  slice_max(-log_ratio21, n = 10) %>%
  arrange(term, -log_ratio21)

lda_top_terms12 <- rbind(lda_top_terms1, lda_top_terms2)

# this is for ggplot to understand in which order to plot name on the x axis.
lda_top_terms12$term <- factor(lda_top_terms12$term, levels = lda_top_terms12$term[order(lda_top_terms12$log_ratio21)])

# Words with the greatest difference in beta between topic 2 and topic 1
lda_top_terms12 %>%
  ggplot(aes(log_ratio21, term, fill = (log_ratio21 > 0))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +

13. Besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics. We can examine the per-document-per-topic probabilities, called “gamma”, with the matrix = "gamma" argument in the tidy() function. Call this function for your LDA model and save the probabilities in a varibale named lda_documents.

14. Check the topic probabilities for documents with the index number of 1, 1000, 2000, 2225.

15. Use the code below to visualise the topic probabilities for the example documents in question 14.

# reorder titles in order of topic 1, topic 2, etc before plotting
lda_documents[lda_documents$document %in% c(1, 1000, 2000, 2225),] %>%
  mutate(document = reorder(document, gamma * topic)) %>%
  ggplot(aes(factor(topic), gamma)) +
  geom_boxplot() +
  facet_wrap(~ document) +
  labs(x = "topic", y = expression(gamma)) +

Alternative LDA implementations

The LDA() function in the topicmodels package is only one implementation of the latent Dirichlet allocation algorithm. For example, the mallet package (Mimno 2013) implements a wrapper around the MALLET Java package for text classification tools, and the tidytext package provides tidiers for this model output as well. The textmineR package has extensive functionality for topic modeling. You can fit Latent Dirichlet Allocation (LDA), Correlated Topic Models (CTM), and Latent Semantic Analysis (LSA) from within textmineR (