Materials for Applied Data Science profile course INFOMDA2 Battling the curse of dimensionality.
In this practical, we will apply hierarchical and k-means clustering to two synthetic datasets. We use the following packages:
library(MASS) library(tidyverse) library(patchwork) library(ggdendro)
The data can be generated by running the code below.
1. The code does not have comments. Add descriptive comments to the code below.
set.seed(123) sigma <- matrix(c(1, .5, .5, 1), 2, 2) sim_matrix <- mvrnorm(n = 100, mu = c(5, 5), Sigma = sigma) colnames(sim_matrix) <- c("x1", "x2") sim_df <- sim_matrix %>% as_tibble() %>% mutate(class = sample(c("A", "B", "C"), size = 100, replace = TRUE)) sim_df_small <- sim_df %>% mutate(x2 = case_when(class == "A" ~ x2 + .5, class == "B" ~ x2 - .5, class == "C" ~ x2 + .5), x1 = case_when(class == "A" ~ x1 - .5, class == "B" ~ x1 - 0, class == "C" ~ x1 + .5)) sim_df_large <- sim_df %>% mutate(x2 = case_when(class == "A" ~ x2 + 2.5, class == "B" ~ x2 - 2.5, class == "C" ~ x2 + 2.5), x1 = case_when(class == "A" ~ x1 - 2.5, class == "B" ~ x1 - 0, class == "C" ~ x1 + 2.5))
2. Prepare two unsupervised datasets by removing the class feature.
3. For each of these datasets, create a scatterplot. Combine the two
plots into a single frame (look up the
patchwork package to see how to
do this!) What is the difference between the two datasets?
4. Run a hierarchical clustering on these datasets and display the
result as dendrograms. Use euclidian distances and the complete
agglomeration method. Make sure the two plots have the same y-scale.
What is the difference between the dendrograms? (Hint: functions you’ll
5. For the dataset with small differences, also run a complete agglomeration hierarchical cluster with manhattan distance.
6. Use the
cutree() function to obtain the cluster assignments for
three clusters and compare the cluster assignments to the 3-cluster
euclidian solution. Do this comparison by creating two scatter plots
with cluster assignment mapped to the colour aesthetic. Which difference
do you see?
7. Create k-means clusterings with 2, 3, 4, and 6 classes on the large difference data. Again, create coloured scatter plots for these clusterings.
8. Do the same thing again a few times. Do you see the same results every time? where do you see differences?
9. Find a way online to perform bootstrap stability assessment for the 3 and 6-cluster solutions.
10. Create a function to perform k-medians clustering
Write this function from scratch: you may use base-R and tidyverse functions. Use Euclidean distance as your distance metric.
Input: - dataset (as a data frame) - K (number of clusters)
Output: - a vector of cluster assignments
Tip: use the unsupervised version of
K = 3 as a
11. Add an input parameter
smart_init. If this is set to
initialize cluster assignments using hierarchical clustering (from
hclust). Using the unsupervised sim_df_small, look at the number of
iterations needed when you use this method vs when you randomly