In this practical, we will apply model-based clustering on a data set of bank note measurements.
We use the following packages:
library(mclust)
library(tidyverse)
library(patchwork)
The data is built into the mclust
package and can be
loaded as a tibble
by running the following code:
<- as_tibble(banknote) df
1. Read the help file of the banknote
data set
to understand what it’s all about.
?banknote
2. Create a scatter plot of the left (x-axis) and right
(y-axis) measurements on the data set. Map the Status
column to colour. Jitter the points to avoid overplotting. Are the
classes easy to distinguish based on these features?
%>%
df ggplot(aes(x = Left, y = Right, colour = Status)) +
geom_point(position = position_jitter())
# the classes are not easy to distinguish: there is considerable overlap.
3. From now on, we will assume that we don’t have the labels.
Remove the Status
column from the data set.
<- df %>% select(-Status) df
4. Create density plots for all columns in the data set. Which single feature is likely to be best for clustering?
# big patchwork of density plots
%>% ggplot(aes(x = Length)) + geom_density() +
df %>% ggplot(aes(x = Left)) + geom_density() +
df %>% ggplot(aes(x = Right)) + geom_density() +
df %>% ggplot(aes(x = Bottom)) + geom_density() +
df %>% ggplot(aes(x = Top)) + geom_density() +
df %>% ggplot(aes(x = Diagonal)) + geom_density() df
# the Diagonal feature looks good! Look at the two bumps in its density plot.
# alternative:
library(ggridges)
%>%
df mutate_all(scale) %>%
pivot_longer(everything(), names_to = "feature", values_to = "value") %>%
ggplot(aes(x = value, y = feature, fill = feature)) +
geom_density_ridges() +
scale_fill_viridis_d(guide = "none") +
theme_minimal()
5. Use Mclust
to perform model-based clustering
with 2 clusters on the feature you chose. Assume equal variances. Name
the model object fit_E_2
. What are the means and variances
of the clusters?
<- Mclust(df %>% pull(Diagonal), G = 2, modelNames = "E")
fit_E_2 # means:
$parameters$mean fit_E_2
## 1 2
## 139.4464 141.5221
# variances:
$parameters$variance$sigmasq fit_E_2
## [1] 0.244004
6. Use the formula from the slides and the model’s
log-likelihood (fit_E_2$loglik
) to compute the BIC for this
model. Compare it to the BIC stored in the model object
(fit_E_2$bic
). Explain how many parameters (m) you used and
which parameters these are.
# This model has 4 parameters: 1 class probability pi, 2 means, and 1 variance
- 2 * fit_E_2$loglik + log(nrow(df)) * 4
## [1] 569.4667
$bic fit_E_2
## [1] -569.4667
# note: the BIC from the Mclust package is negative!
7. Plot the model-implied density using the
plot()
function. Afterwards, add rug marks of the original
data to the plot using the rug()
function from the base
graphics system.
# plot the density using base R plots
plot(fit_E_2, "density", xlab = "Diagonal measurement")
# add the observations using rug marks
rug(df %>% pull(Diagonal))
8. Use Mclust
to perform model-based clustering
with 2 clusters on this feature again, but now assume unequal
variances. Name the model object fit_V_2
. What are the
means and variances of the clusters? Plot the density again and note the
differences.
<- Mclust(df %>% pull(Diagonal), G = 2, modelNames = "V")
fit_V_2 $parameters$mean fit_V_2
## 1 2
## 139.4973 141.5604
$parameters$variance$sigmasq fit_V_2
## [1] 0.3589844 0.1500838
plot(fit_V_2, "density", xlab = "Diagonal measurement")
rug(df %>% pull(Diagonal))
# The left cluster has larger variance now, because the data is more spread out
9. How many parameters does this model have? Name them.
# 5 parameters:
# 1 class probability (pi)
# 2 means
# 2 variances
10. According to the deviance, which model fits better?
-2*fit_E_2$loglik
## [1] 548.2735
-2*fit_V_2$loglik
## [1] 537.0199
# The V model fits better (lower deviance)
11. According to the BIC, which model is better?
# remember, the package shows negative BIC
-fit_E_2$bic
## [1] 569.4667
-fit_V_2$bic
## [1] 563.5115
# the V model fits better (lower BIC), but it's closer because of the extra parameter!
We will now use all available information in the data set to cluster the observations.
12. Use Mclust with all 6 features to perform clustering. Allow all model types (shapes), and from 1 to 9 potential clusters. What is the optimal model based on the BIC?
<- Mclust(df)
fit summary(fit)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVE (ellipsoidal, equal orientation) model with 3 components:
##
## log-likelihood n df BIC ICL
## -663.3814 200 53 -1607.574 -1607.71
##
## Clustering table:
## 1 2 3
## 18 98 84
# The optimal model is a VVE model with 3 components
13. How many mean parameters does this model have?
$parameters$mean fit
## [,1] [,2] [,3]
## Length 215.023017 214.971360 214.78091
## Left 130.499684 129.929686 130.26435
## Right 130.304813 129.701143 130.17988
## Bottom 8.775842 8.301115 10.85719
## Top 11.173812 10.162379 11.10798
## Diagonal 138.731602 141.541673 139.62387
# 3 clusters * 6 variables = 18 mean parameters
14. Run a 2-component VVV model on this data. Create a matrix
of bivariate contour (“density”) plots using the plot()
function. Which features provide good component separation? Which do
not?
<- Mclust(df, 2, "VVV")
VVV2 plot(VVV2, "density")
# Diagonal-top, and diagonal-bottom separate the classes very well!
15. Create a scatter plot just like the first scatter plot in this tutorial, but map the estimated class assignments to the colour aesthetic. Map the uncertainty (part of the fitted model list) to the size aesthetic, such that larger points indicate more uncertain class assignments. Jitter the points to avoid overplotting. What do you notice about the uncertainty?
%>%
df ggplot(aes(
x = Left,
y = Right,
colour = as_factor(VVV2$classification),
size = VVV2$uncertainty
+
)) geom_point(position = position_jitter()) +
scale_size(range = c(1, 3)) +
labs(colour = "Cluster", size = "Uncertainty")
# only 3 points are slightly uncertain, and they are not around the border of the classes in these two dimensions. The other dimensions give enough information about those points near the border!
NB: this procedure is very technical and will not be tested in-depth in the exam. It is meant to give you a start in high-dimensional clustering and an example of how to explore new packages.
16. Install and load the package HDclassif
. Read
the introduction and section 4.2, parts “First results” and “PCA
representation” from the associated paper here.
This paper is from the Journal of Statistical Software, a very high-quality open journal describing statistical software packages. If a package has a JSS paper, always start there!
install.packages("HDclassif")
library(HDclassif)
17. Run high-dimensional data clustering on the Crabs dataset
using demo("hddc")
. Choose the EM
algorithm
with random
initialization with the AkBkQkDk
model. Explain what happens in the plot window.
demo("hddc")
# We are performing gaussian mixture modeling on a low-dimensional subspace of the
# crabs data.
# In the plot, we see the data projected to the first principal components,
# and in the animation we see that as the EM iterations proceed, the clusters stabilize
# The final solution is a 4 cluster solution with different orientations.
# If we had chosen the AjBQD model type, we would assume all clusters have the same
# orientation.