Materials for Applied Data Science profile course INFOMDA2 Battling the curse of dimensionality.
David J. Hessen
Load required packages.
install.packages("fastICA")
install.packages("BiocManager")
BiocManager::install("Biobase")
install.packages("NMF")
install.packages("svs")
library(fastICA)
library(NMF)
library(svs)
In this exercise, you will factor analyze the data of the places rated
example given in the lecture slides. The data file is places.txt
.
1. Import the data into R.
Note. Since the variable names are not in the first line, use the
argument header=FALSE
in the function read.table()
.
2. Factor analyze the data using the R
-function factanal()
with
five factors, and include the argument scores='regression'
.
3. Calculate the correlations between the factor scores. Are the correlations as expected? Why or why not?
The R package fastICA
can be used for independent component analysis
(ICA).
4. Fit an ICA model with three components on the places data using the
R
-function fastICA()
.
Note. The centered data can be obtained using $X
, the matrix of
weights (loadings) can be obtained using $A
, and the independent
component scores can be obtained using $S
.
5. Obtain the proportion of variance explained by the solution with three components.
Hint: The proportion of variance explained can be obtained by dividing the total variance of by the total variance of , where total variance is simply the sum of the diagonal elements of the variance-covariance matrix.
x <- c(ica$X)
e <- c(ica$S %*% ica$A)
(2*crossprod(x, e) - crossprod(e))/crossprod(x)
6. Calculate the correlations between the original features and the components. Are they as you expect?
Hint: Recall that the independent component scores are stored under
$S
.
7. Now calculate the correlations among the components. Are they as you expect?
One arena in which an understanding of the hidden components which drive the system is crucial (and potentially lucrative), is financial markets. It is widely believed that the stock market and indeed individual stock prices are determined by fluctuations in underlying but unknown factors (signals). If one could determine and predict these components, one would have the opportunity to leverage this knowledge for financial gain. In this exercise, non-negative matrix factorization (NMF) is used to learn the components which drive the stock market; specifically, the closing prices for 5 recent years for 505 companies currently found on the S&P 500 index, are used.
8. Load the stocks data into R
with the function read.table()
.
Hint: Make sure to set header=TRUE
and sep = "\t"
.
9. Remove the first column with the company names. Also remove the observations with missing values, and store the data in a matrix.
10. Run the function nmf()
with 1, 2 and 3 components.
11. Calculate the proportion of variance explained by each solution
using the function evar()
.
12. Calculate the correlations between the features according to the model of your choice. Are they highly correlated?
In this exercise, probabilistic latent semantic analysis is applied to the data file . The data set consists of letter counts in 12 samples of texts from books by six different authors. The author of the first two documents (books) is Pearl S. Buck, the author of the second two documents is James Michener, the author of the third two documents is Arthur C. Clarke, the author of the fourth two documents is Ernest Hemingway, the author of the fifth two documents is William Faulkner, and the author of the last two documents is Victoria Holt.
13. Read the author.txt
data into R
using the function
read.table()
, remove the first column with document indices, and store
the data as a matrix.
14. Run the function fast_plsa()
with 4, 5 and 6 components. Set the
argument symmetric=TRUE
.
Note. Convergence may take a while for 5 and 6 latent classes, so to
speed up the process you can set tol = 1e-6
.
The output of the function fast_plsa()
outputs a list with three
elements, prob0
,
prob1
and prob2
.
15. Test how many classes are needed by calculating Pearson’s -statistic for each solution. Which solution is the best?
The Pearson’s -statistic requires the estimated multinomial probabilities , which can be obtained as follows.
P <- lapply(plsa, function(x) x$prob1 %*% diag(x$prob0) %*% t(x$prob2))
Note. plsa
is a list with three elements: the output of fast_plsa()
for 4, 5, and 6 components.
From these probabilities, we can compute the Pearson’s -statistic or the Likelihood-ratio statistic.
Xsq <- sapply(P, function(x) sum((author - sum(author)*x)^2 / (sum(author)*x)))
Gsq <- sapply(P, function(x) 2*sum(log((author/(sum(author)*x))^author)))
Lastly, we need the degrees of freedom, which is the number of cells minus the number of parameters estimated. The number of cells is the number of rows times the number of columns, and the number of parameters is equal to the number of components, times the sum of the number of rows and the number of columns minus one.
N <- nrow(author)
p <- ncol(author)
df <- N*p - r*(N+p-1)
Accordingly, we can calculate the -value as follows.
1-pchisq(Xsq, df)
[1] 0.000000e+00 2.220446e-16 1.410538e-12
16. How many classes do you select, based on these results?
17. Given this number of latent classes, for which latent class has document the highest probability?
18. Which letter occurs the most for almost all classes?
19. Determine for the finally selected number of classes the proportion of explained variance by executing the following lines.
x <- c(author)
e <- c(sum(author) * P[[3]])
(2 * crossprod(x, e) - crossprod(e))/crossprod(x)
The Macroeconomic variables, both real and financial, do have considerable influence, positive as well as negative, on the performance of the corporate sector of the economy. Consequently, the stock markets of the economy got affected by such performance. The movement of stock prices, apart from the firms’ fundamentals, also depends upon the level of development achieved in the economy and its integration towards the world economy.
Since macroeconomic variables are highly interdependent, using all of them as explanatory variables in affecting the stock market may pose a severe multicolinearity problem and it becomes difficult to delineate the separate affects of different variables on the stock market movement. Deriving basic factors from such macroeconomic variables and employing these factors in pricing models can provide valuable information about the contents of priced factors in different stock markets. Generating orthogonal factor realizations eliminates the multicolinearity problem in estimating factor regression coefficients and serves to find the factors that are rewarded by the market. In this assignment, such factors will be extracted from twelve macroeconomic variables in India. The variables are:
The standardized observations in the data file IndianSM.txt
are based
on monthly averages, for 149 months.
20. Read the data into R
, apply factor analysis and determine how
many common factors are required to explain at least
of the
total variance.
Note. Set the number of starting values to 200
.
21. Does the factor model with the number of factors chosen in (20.) fit the data? Why
22. Give the correlations between the ‘regression’ factor scores, given the in (20.) selected number of common factors?
23. Carry out an independent component analysis and determine the number of independent components. How many independent components do you select? Why?
24. Give the correlations between the features (macro-economic variables) and the independent components. Use these correlations to interpret the independent components.
In this part of the assignment, non-negative matrix factorization (NMF) is used to once again learn the components which drive the stock market. Now, instead of the closing prices, the numbers of shares traded for 5 recent years for 505 companies currently found on the S&P 500 index, are used.
The data for this exercise can be found in the file volume_stocks.dat
.
The first feature in this data file is and gives an abbreviation of the
company name. The features named to are the numbers of shares traded on
1259 days within 5 recent years.
Import the data into R
. Be aware that the data file is tab-delimited
and that the first line in the data file contains the feature names. The
feature name
(the first column) is not relevant for the analysis and
should be removed. Certain companies have missing values. Remove the
companies with missing values and only use the complete cases
(companies). Next, store the data in a matrix.
25. Load in the data, apply non-negative matrix factorization and determine the dimension using the proportion of explained variance.
26. How many dimensions do you select and why? What is the proportion of variance explained for this number of dimensions?
27. Give the reconstruction error for this number of dimensions.
28. Calculate the correlation matrix of the finally selected dimensions.
In this part of the assignment, probabilistic latent semantic analysis
is applied to the data file benthos.txt
. The data set consists of
abundances of 10 marine species near an oilfield in the North Sea at 13
sites (the columns in the data file). The first 11 columns give the data
for polluted sites. The last two columns give the data for unpolluted
reference sites. Import the data into R and store the data as a matrix.
29. Carry out a probabilistic latent semantic analysis and determine the number of latent classes based on the proportion explained variance. How many classes do you select? Why?
30. Produce the three matrices , and for the number of latent classes selected and the corresponding matrix of estimated multinomial probabilities.