Materials for Applied Data Science profile course INFOMDA2 Battling the curse of dimensionality.
David J. Hessen
In this practical, we will perform singular value decomposition and
perform principical component analysis. You will not have to load any
packages in advance, as we will solely use base R
. The data for todays
practical can be downloaded here:
In this exercise, the data of Example 1 from the lecture slides are
used. These data have been stored in the file Example1.dat
.
1. Use the function read.table()
to import the data into R. Use the
function as.matrix()
to convert the data frame to a matrix. The two
features are not centered. To center the two features the function
scale()
with the argument scale = FALSE
can be used. Give the
centered data matrix a name, for instance, C
.
2. Calculate the sample size N
and the covariance matrix S
by
executing the following R code, where the function t()
is used to
calculate the transpose of C
and %*%
is used for matrix
multiplication.
N <- dim(C)[1]
S <- t(C) %*% C/N
3. Use the function svd()
to apply a singular value decomposition to
the centered data matrix.
4. Inspect the three pieces of output, that is, U
, D
, and V
. Are
the three matrices the same as on the slides?
Note: A singular value decomposition is not unique. The singular values are unique, but the left and right singular vectors are only unique up to sign. This means that the singular vectors can be multiplied by -1 without changing the singular value decomposition.
5. Use a single matrix product to calculate the principal component scores.
6. Plot the scores on the second principal component (y-axis) against the scores on the first principal component (x-axis) and let the range of the x-axis run from -18 to 18 and the range of the y-axis from -16 to 16.
7. Use the function eigen()
to apply an eigendecomposition to the
sample covariance matrix.
Note that an alternative form of the eigendecomposition of the covariance matrix is given by
where is the th eigenvector of . From this alternative form it is clear that the (ortho-normal) eigenvectors are only defined up to sign because . This may lead to differences in sign in the output.
8. Check whether the eigenvalues are equal to the variances of the two
principal components. Be aware that the R-base function var()
takes
in the
denominator, to get an unbiased estimate of the variance.
9. Finally, calculate the percentage of total variance explained by each principal component.
In this exercise, a PCA is used to determine the financial strength of insurance companies. Eight relevant features have been selected: (1) gross written premium, (2) net mathematical reserves, (3) gross claims paid, (4) net premium reserves, (5) net claim reserves, (6) net income, (7) share capital, and (8) gross written premium ceded in reinsurance.
To perform a principal component analysis, an eigendecomposition can be applied to the sample correlation matrix instead of the sample covariance matrix . Note that the sample correlation matrix is the sample covariance matrix of the standardized features. These two ways of doing a PCA will yield different results. If the features have the same scales (the same units), then the covariance matrix should be used. If the features have different scales, then it’s better in general to use the correlation matrix because otherwise the features with high absolute variances will dominate the results.
The means and standard deviations of the features can be found in the following table.
feature | mean | std.dev. |
---|---|---|
(1) GR_WRI_PRE | 143,232,114 | 228,320,220 |
(2) NET_M_RES | 43,188,939 | 151,739,600 |
(3) GR_CL_PAID | 65,569,646 | 125,986,400 |
(4) ET_PRE_RES | 41,186,950 | 65,331,440 |
(5) NET_CL_RES | 14,726,188 | 24,805,660 |
(6) NET_INCOME | -1,711,048 | 22,982,140 |
(7) SHARE_CAP | 33,006,176 | 38,670,760 |
(8) R_WRI_PR_CED | 37,477,746 | 93,629,540 |
The sample correlation matrix is given below.
(1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | |
---|---|---|---|---|---|---|---|---|
(1) | 1.00 | 0.32 | 0.95 | 0.94 | 0.84 | 0.22 | 0.47 | 0.82 |
(2) | 0.32 | 1.00 | 0.06 | 0.21 | 0.01 | 0.30 | 0.10 | 0.01 |
(3) | 0.95 | 0.06 | 1.00 | 0.94 | 0.89 | 0.14 | 0.44 | 0.81 |
(4) | 0.94 | 0.21 | 0.94 | 1.00 | 0.88 | 0.19 | 0.50 | 0.68 |
(5) | 0.84 | 0.01 | 0.89 | 0.88 | 1.00 | -0.23 | 0.55 | 0.63 |
(6) | 0.22 | 0.30 | 0.14 | 0.19 | -0.23 | 1.00 | -0.15 | 0.21 |
(7) | 0.47 | 0.10 | 0.44 | 0.50 | 0.55 | -0.15 | 1.00 | 0.14 |
(8) | 0.82 | 0.01 | 0.81 | 0.68 | 0.63 | 0.21 | 0.14 | 1.00 |
R <- matrix(c(1.00,0.32,0.95,0.94,0.84,0.22,0.47,0.82,
0.32,1.00,0.06,0.21,0.01,0.30,0.10,0.01,
0.95,0.06,1.00,0.94,0.89,0.14,0.44,0.81,
0.94,0.21,0.94,1.00,0.88,0.19,0.50,0.68,
0.84,0.01,0.89,0.88,1.00,-0.23,0.55,0.63,
0.22,0.30,0.14,0.19,-0.23,1.00,-0.15,0.21,
0.47,0.10,0.44,0.50,0.55,-0.15,1.00,0.14,
0.82,0.01,0.81,0.68,0.63,0.21,0.14,1.00),
nrow = 8)
9. Use R
to apply a PCA to the sample correlation matrix.
An alternative criterion for extracting a smaller number of principal components than the number of original variables in applying a PCA to the sample correlation matrix, is the eigenvalue-greater-than-one rule. This rule says that (the number of extracted principal components) should be equal to the number of eigenvalues greater than one. Since each of the standardized variables has a variance of one, the total variance is . If a principal component has an eigenvalue greater than one, than its variance is greater than the variance of each of the original standardized variables. Then, this principal component explains more of the total variance than each of the original standardized variables.
10. How many principal components should be extracted according to the eigenvalue-greater-than-one rule?
11. How much of the total variance does this number of extracted principal components explain?
12. Make a scree-plot. How many principal components should be extracted according to the scree-plot?
13. How much of the total variance does this number of extracted principal components explain?
14. Install the R
-package keras
, following the steps outlined
here.
Note that the latest python
version you can use is 3.11
. It can be
that you have to set the python
interpreter in R Studio
. This can be
done at Tools > Global Options > Python
.
In this assignment, you will perform a PCA to a simple and easy to
understand dataset. You will use the mtcars
dataset, which is built
into R. This dataset consists of data on 32 models of car, taken from an
American motoring magazine (1974 Motor Trend magazine). For each car,
you have 11 features, expressed in varying units (US units). They are as
follows:
mpg
: fuel consumption (miles per (US) gallon); more powerful and
heavier cars tend to consume more fuel.cyl
: number of cylinders; more powerful cars often have more
cylinders.disp
: displacement (cu.in.); the combined volume of the engine’s
cylinders.hp
: gross horsepower; this is a measure of the power generated by
the car.drat
: rear axle ratio; this describes how a turn of the drive shaft
corresponds to a turn of the wheels. Higher values will decrease fuel
efficiency.wt
: weight (1000 lbs).qsec
: 1/4 mile time, the cars speed and acceleration.vs
: engine block; this denotes whether the vehicle’s engine is
shaped like a ‘V’, or is a more common straight shape.am
: transmission; this denotes whether the car’s transmission is
automatic (0) or manual (1).gear
: number of forward gears; sports cars tend to have more gears.carb
: number of carburetors; associated with more powerful engines.Note that the units used vary and occupy different scales.
First, the principal components will be computed. Because PCA works best
with numerical data, you’ll exclude the two categorical variables (vs
and am; columns 8 and 9). You are left with a matrix of 9 columns and 32
rows, which you pass to the prcomp()
function, assigning your output
to mtcars.pca
. You will also set two arguments, center
and scale
,
to be TRUE
. This is done to apply a principal component analysis to
the standardized features. So, execute
mtcars.pca <- prcomp(mtcars[, c(1:7, 10, 11)],
center = TRUE,
scale = TRUE)
15. Have a peek at the PCA object with summary()
.
You obtain 9 principal components, which you call PC1-9. Each of these explains a percentage of the total variance in the dataset.
16. What is the percentage of total variance explained by PC1?
17. What is the percentage of total variance explained by PC1, PC2, and PC3 together?
The PCA object mtcars.pca
contains the following information:
$center
)$scale
)$sdev
)$rotation
)$x
)18. Determine the eigenvalues. How many principal components should be extracted according to the eigenvalue-greater-than-one rule?
19. What is the value of the total variance? Why?
20. How much of the total variance is explained by the number of extracted principal components according to the eigenvalue-greater-than-one rule?
Next, a couple of plots will be produced to visualize the PCA solution. You will make a biplot, which includes both the position of each observation (car model) in terms of PC1 and PC2 and also will show you how the initial features map onto this. A biplot is a type of plot that will allow you to visualize how the observations relate to one another in the PCA (which observations are similar and which are different) and will simultaneously reveal how each feature contributes to each principal component.
21. Use the function biplot()
with the argument choices = c(1, 2)
to ask for a biplot for the first two principal components.
The axes of the biplot are seen as arrows originating from the center point. Here, you see that the variables hp, cyl, and disp all contribute to PC1, with higher values in those variables moving the observations to the right on this plot. This lets you see how the car models relate to the axes. You can also see which cars are similar to one another. For example, the Maserati Bora, Ferrari Dino and Ford Pantera L all cluster together at the top. This makes sense, as all of these are sports cars.
22. Make a biplot for the first and third principal components. Especially which brand of car has negative values on the first principal component and positive values on the third principal component?
23. Use the function screeplot()
with the argument type = 'lines'
to produce a scree-plot. How many principal components should be
extracted according to this plot? Why? Is this number in agreement with
the number of principal components extracted according to the
eigenvalue-greater-than-one rule?