Materials for Applied Data Science profile course INFOMDA2 Battling the curse of dimensionality.
The data to be analyzed in this exercise can be found in the following file.
The data in this file constitute a contingency table of counts, the
classic 1949 Great Britain five-by-five son’s by father’s occupational
mobility table. Import the data into R
. The warning message that might
show up in using the function read.table()
can be ignored.
The rows of the data table correspond to five different categories of father’s occupation and the columns to the same five different categories of son’s occupation. The cells in the main diagonal of the table refer to fathers and sons with the same occupational category, and this group is important because it measures the total amount of mobility exhibited by the sons. The categories for both nominal variables are:
If the table is called X
, then the row and column labels can be
assigned by executing
rownames(X) <- c('UN F','LN F','UM F','LM F','F F')
colnames(X) <- c('UN S','LN S','UM S','LM S','F S')
Obtain the correspondence table using the function prop.table()
. Use
the function sum()
to check whether the sum of all elements of the
correspondence table equals one. The matrix of row profiles can be
obtained by using the argument margin = 1
in the function
prop.table()
and the matrix of column profiles by using the argument
margin = 2
. Use the functions rowSums()
and colSums()
to check
whether the sums of the profiles are all equal to one. Install and load
the R package ggpubr
and execute ggballoonplot(X, fill ='value')
.
to visualize the correspondence table using a balloon plot. One of the R
packages for correspondence analysis is ca
. Install and load this
package.
1. Apply a correspondence analysis to the GB mobility table. The
function to be used is ca()
.
2. Explore the arguments and values of the function ca()
using
?ca
. Obtain the row and column standard coordinates.
3. Use the function summary()
to determine the proportion of total
inertia explained by the first two extracted dimensions.
4. Use the function plot()
to obtain a symmetric map.
5. Use the argument map='rowprincipal'
to obtain an asymmetric map
with principal coordinates for rows and standard coordinates for
columns.
For the lab exercises, you will use the file
This data contains a two-way contingency table that can be used to analyze economic activity of the Polish population in relation to gender and level of education in the second quarter of 2011. The rows of the table refer to different levels of education, that is:
The columns refer to the levels:
Import the data into R and respond to the following items.
6. Give the rows 1 to 6 the labels E1 to E6, respectively. Give the columns 1 to 4 the labels A1F to A4F, and the columns 5 to 8 the labels A1M to A4M, respectively. Give a visualization of the correspondence matrix.
7. Give the proportion of full-time employed females with secondary level of education.
8. Give the matrices of row profiles and column profiles.
9. What is the conditional proportion of full-time employed females given tertiary level of education and what is the conditional proportion of full-time employed males given tertiary level of education?
10. What is the conditional proportion of females with the lowest level of education given economically inactive? What is the conditonal proportion of males with the lowest level of education given economically inactive?
11. Apply a correspondence analysis to the data. How large is the total inertia?
12. Set the desired minimum proportion of explained inertia to .85. How many underlying dimensions are sufficient? What is the proportion of inertia explained by this number of dimensions?
13. Give the symmetric map for the final solution.