Materials for Applied Data Science profile course INFOMDA2 Battling the curse of dimensionality.
In this practical, we will apply model-based clustering on a data set of bank note measurements.
We use the following packages:
library(mclust)
library(tidyverse)
library(patchwork)
The data is built into the mclust
package and can be loaded as a
tibble
by running the following code:
df <- as_tibble(banknote)
1. Read the help file of the banknote
data set to understand what
it’s all about.
2. Create a scatter plot of the left (x-axis) and right (y-axis)
measurements on the data set. Map the Status
column to colour. Jitter
the points to avoid overplotting. Are the classes easy to distinguish
based on these features?
3. From now on, we will assume that we don’t have the labels. Remove
the Status
column from the data set.
4. Create density plots for all columns in the data set. Which single feature is likely to be best for clustering?
5. Use Mclust
to perform model-based clustering with 2 clusters on
the feature you chose. Assume equal variances. Name the model object
fit_E_2
. What are the means and variances of the clusters?
6. Use the formula from the slides and the model’s log-likelihood
(fit_E_2$loglik
) to compute the BIC for this model. Compare it to the
BIC stored in the model object (fit_E_2$bic
). Explain how many
parameters (m) you used and which parameters these are.
7. Plot the model-implied density using the plot()
function.
Afterwards, add rug marks of the original data to the plot using the
rug()
function from the base graphics system.
8. Use Mclust
to perform model-based clustering with 2 clusters on
this feature again, but now assume unequal variances. Name the model
object fit_V_2
. What are the means and variances of the clusters? Plot
the density again and note the differences.
9. How many parameters does this model have? Name them.
10. According to the deviance, which model fits better?
11. According to the BIC, which model is better?
We will now use all available information in the data set to cluster the observations.
12. Use Mclust with all 6 features to perform clustering. Allow all model types (shapes), and from 1 to 9 potential clusters. What is the optimal model based on the BIC?
13. How many mean parameters does this model have?
14. Run a 2-component VVV model on this data. Create a matrix of
bivariate contour (“density”) plots using the plot()
function. Which
features provide good component separation? Which do not?
15. Create a scatter plot just like the first scatter plot in this tutorial, but map the estimated class assignments to the colour aesthetic. Map the uncertainty (part of the fitted model list) to the size aesthetic, such that larger points indicate more uncertain class assignments. Jitter the points to avoid overplotting. What do you notice about the uncertainty?
NB: this procedure is very technical and will not be tested in-depth in the exam. It is meant to give you a start in high-dimensional clustering and an example of how to explore new packages.
16. Install and load the package HDclassif
. Read the introduction
and section 4.2, parts “First results” and “PCA representation” from the
associated paper
here.
This paper is from the Journal of Statistical Software, a very high-quality open journal describing statistical software packages. If a package has a JSS paper, always start there!
17. Run high-dimensional data clustering on the Crabs dataset using
demo("hddc")
. Choose the EM
algorithm with random
initialization
with the AkBkQkDk
model. Explain what happens in the plot window.