# INFOMDA2 Materials for Applied Data Science profile course INFOMDA2 Battling the curse of dimensionality.

# 1 Introduction

In this practical, we will apply hierarchical and k-means clustering to two synthetic datasets. We use the following packages:

``````library(MASS)
library(tidyverse)
library(patchwork)
library(ggdendro)
``````

The data can be generated by running the code below.

# 2 Take-home exercises

``````set.seed(123)
sigma      <- matrix(c(1, .5, .5, 1), 2, 2)
sim_matrix <- mvrnorm(n = 100, mu = c(5, 5),
Sigma = sigma)
colnames(sim_matrix) <- c("x1", "x2")

sim_df <-
sim_matrix %>%
as_tibble() %>%
mutate(class = sample(c("A", "B", "C"), size = 100,
replace = TRUE))

sim_df_small <-
sim_df %>%
mutate(x2 = case_when(class == "A" ~ x2 + .5,
class == "B" ~ x2 - .5,
class == "C" ~ x2 + .5),
x1 = case_when(class == "A" ~ x1 - .5,
class == "B" ~ x1 - 0,
class == "C" ~ x1 + .5))
sim_df_large <-
sim_df %>%
mutate(x2 = case_when(class == "A" ~ x2 + 2.5,
class == "B" ~ x2 - 2.5,
class == "C" ~ x2 + 2.5),
x1 = case_when(class == "A" ~ x1 - 2.5,
class == "B" ~ x1 - 0,
class == "C" ~ x1 + 2.5))
``````

2. Prepare two unsupervised datasets by removing the class feature.

3. For each of these datasets, create a scatterplot. Combine the two plots into a single frame (look up the `patchwork` package to see how to do this!) What is the difference between the two datasets?

## 2.1 Hierarchical clustering

4. Run a hierarchical clustering on these datasets and display the result as dendrograms. Use euclidian distances and the complete agglomeration method. Make sure the two plots have the same y-scale. What is the difference between the dendrograms? (Hint: functions you’ll need are `hclust`, `ggdendrogram`, and `ylim`)

5. For the dataset with small differences, also run a complete agglomeration hierarchical cluster with manhattan distance.

6. Use the `cutree()` function to obtain the cluster assignments for three clusters and compare the cluster assignments to the 3-cluster euclidian solution. Do this comparison by creating two scatter plots with cluster assignment mapped to the colour aesthetic. Which difference do you see?

# 3 Practical exercises

## 3.1 K-means clustering

7. Create k-means clusterings with 2, 3, 4, and 6 classes on the large difference data. Again, create coloured scatter plots for these clusterings.

8. Do the same thing again a few times. Do you see the same results every time? where do you see differences?

9. Find a way online to perform bootstrap stability assessment for the 3 and 6-cluster solutions.

# 4 Challenge question

10. Create a function to perform k-medians clustering

Write this function from scratch: you may use base-R and tidyverse functions. Use Euclidean distance as your distance metric.

Input: - dataset (as a data frame) - K (number of clusters)

Output: - a vector of cluster assignments

Tip: use the unsupervised version of `sim_df_large` with `K = 3` as a tryout-dataset

11. Add an input parameter `smart_init`. If this is set to `TRUE`, initialize cluster assignments using hierarchical clustering (from `hclust`). Using the unsupervised sim_df_small, look at the number of iterations needed when you use this method vs when you randomly initialize.