Dimension reduction 2

Author

David J. Hessen

Take home exercises

Load required packages.

install.packages("fastICA")
install.packages("BiocManager")
BiocManager::install("Biobase")
install.packages("NMF")
install.packages("svs")
library(fastICA)
library(NMF)
library(svs)

Factor analysis

In this exercise, you will factor analyze the data of the places rated example given in the lecture slides. The data file is places.txt.

1. Import the data into R.

Note. Since the variable names are not in the first line, use the argument header=FALSE in the function read.table().

places <- read.table("data/places.txt", header = FALSE)

2. Factor analyze the data using the R-function factanal() with five factors, and include the argument scores='regression'.

fa <- factanal(places, 5, scores = "regression")
fa

Call:
factanal(x = places, factors = 5, scores = "regression")

Uniquenesses:
   V1    V2    V3    V4    V5    V6    V7    V8    V9 
0.737 0.005 0.005 0.411 0.005 0.691 0.206 0.646 0.005 

Loadings:
   Factor1 Factor2 Factor3 Factor4 Factor5
V1          0.450  -0.150           0.183 
V2  0.250   0.926   0.239   0.122         
V3  0.939   0.240           0.105   0.212 
V4  0.125   0.102   0.130   0.104   0.731 
V5  0.332                   0.916   0.191 
V6  0.509           0.100   0.191         
V7  0.753   0.291           0.131   0.351 
V8  0.146   0.380           0.232   0.355 
V9                  0.982           0.166 

               Factor1 Factor2 Factor3 Factor4 Factor5
SS loadings      1.925   1.368   1.081   0.985   0.931
Proportion Var   0.214   0.152   0.120   0.109   0.103
Cumulative Var   0.214   0.366   0.486   0.595   0.699

Test of the hypothesis that 5 factors are sufficient.
The chi square statistic is 5.59 on 1 degree of freedom.
The p-value is 0.018 

3. Calculate the correlations between the factor scores. Are the correlations as expected? Why or why not?

cor(fa$scores)
             Factor1      Factor2      Factor3      Factor4     Factor5
Factor1  1.000000000  0.020154372 -0.015261872 -0.009825126  0.11800720
Factor2  0.020154372  1.000000000  0.011580702  0.008773717 -0.07897472
Factor3 -0.015261872  0.011580702  1.000000000 -0.007633115  0.07038196
Factor4 -0.009825126  0.008773717 -0.007633115  1.000000000  0.05945721
Factor5  0.118007196 -0.078974722  0.070381961  0.059457205  1.00000000

Independent component analysis

The R package fastICA can be used for independent component analysis (ICA).

4. Fit an ICA model with three components on the places data using the R-function fastICA().

Note. The centered data can be obtained using $X, the matrix of weights (loadings) can be obtained using $A, and the independent component scores can be obtained using $S.

ica <- fastICA(places, n.comp = 3)

5. Obtain the proportion of variance explained by the solution with three components.

Hint: The proportion of variance explained can be obtained by dividing the total variance of \(SA\) by the total variance of \(X\), where total variance is simply the sum of the diagonal elements of the variance-covariance matrix.

x <- c(ica$X)
e <- c(ica$S %*% ica$A)

(2*crossprod(x, e) - crossprod(e))/crossprod(x)
        [,1]
[1,] 0.93936

6. Calculate the correlations between the original features and the components. Are they as you expect?

Hint: Recall that the independent component scores are stored under $S.

cor(places, ica$S)
         [,1]       [,2]        [,3]
V1 0.03009906 0.34777028 -0.15565434
V2 0.19632006 0.95250721 -0.21795787
V3 0.25030341 0.23846593 -0.80946949
V4 0.21989379 0.04020787 -0.35184594
V5 0.95421909 0.01878782 -0.28712049
V6 0.25574122 0.08444300 -0.32483842
V7 0.19451303 0.21210067 -0.95750475
V8 0.36737596 0.34076873 -0.24547347
V9 0.08332856 0.34442761  0.02014598

7. Now calculate the correlations among the components. Are they as you expect?

cor(ica$S)
              [,1]         [,2]          [,3]
[1,]  1.000000e+00 5.331744e-16 -2.461807e-15
[2,]  5.331744e-16 1.000000e+00  6.155032e-16
[3,] -2.461807e-15 6.155032e-16  1.000000e+00

Non-negative matrix factorization

One arena in which an understanding of the hidden components which drive the system is crucial (and potentially lucrative), is financial markets. It is widely believed that the stock market and indeed individual stock prices are determined by fluctuations in underlying but unknown factors (signals). If one could determine and predict these components, one would have the opportunity to leverage this knowledge for financial gain. In this exercise, non-negative matrix factorization (NMF) is used to learn the components which drive the stock market; specifically, the closing prices for 5 recent years for 505 companies currently found on the S&P 500 index, are used.

8. Load the stocks data into R with the function read.table().

Hint: Make sure to set header=TRUE and sep = "\t".

stock <- read.table("data/close_stocks.dat", header = TRUE, sep = "\t")

9. Remove the first column with the company names. Also remove the observations with missing values, and store the data in a matrix.

stock <- stock[, -1]
stock <- na.omit(stock)
stock <- as.matrix(stock)

10. Run the function nmf() with 1, 2 and 3 components.

cl <- parallel::makeCluster(4)
parallel::clusterExport(cl, varlist = c("stock"))
NMF <- pbapply::pblapply(1:3, function(i) NMF::nmf(stock, i), cl = cl)
parallel::stopCluster(cl)

11. Calculate the proportion of variance explained by each solution using the function evar().

sapply(NMF, function(x) evar(x, stock))
[1] 0.9670928 0.9876895 0.9930594

12. Calculate the correlations between the features according to the model of your choice. Are they highly correlated?

cor(NMF[[3]]@fit@W)
          [,1]      [,2]      [,3]
[1,] 1.0000000 0.6972977 0.7880911
[2,] 0.6972977 1.0000000 0.7758822
[3,] 0.7880911 0.7758822 1.0000000

Probabilistic latent semantic analysis

In this exercise, probabilistic latent semantic analysis is applied to the data file . The data set consists of letter counts in 12 samples of texts from books by six different authors. The author of the first two documents (books) is Pearl S. Buck, the author of the second two documents is James Michener, the author of the third two documents is Arthur C. Clarke, the author of the fourth two documents is Ernest Hemingway, the author of the fifth two documents is William Faulkner, and the author of the last two documents is Victoria Holt.

13. Read the author.txt data into R using the function read.table(), remove the first column with document indices, and store the data as a matrix.

author <- read.table("data/author.txt", header = TRUE)
author <- as.matrix(author[, -1])

14. Run the function fast_plsa() with 4, 5 and 6 components. Set the argument symmetric=TRUE.

Note. Convergence may take a while for 5 and 6 latent classes, so to speed up the process you can set tol = 1e-6.

r <- c(4,5,6)
plsa <- lapply(r, function(i) svs::fast_plsa(dat = author, k = i, symmetric = TRUE, tol = 1e-6))

The output of the function fast_plsa() outputs a list with three elements, prob0 \(= \boldsymbol{\Sigma}\), prob1 \(= \boldsymbol{U}\) and prob2 \(= \boldsymbol{V}'\).

15. Test how many classes are needed by calculating Pearson’s \(\chi^2\)-statistic for each solution. Which solution is the best?

The Pearson’s \(\chi^2\)-statistic requires the estimated multinomial probabilities \(\hat{\boldsymbol{P}}=\boldsymbol{U\Sigma V^\top}\), which can be obtained as follows.

P <- lapply(plsa, function(x) x$prob1 %*% diag(x$prob0) %*% t(x$prob2))

Note. plsa is a list with three elements: the output of fast_plsa() for 4, 5, and 6 components.

From these probabilities, we can compute the Pearson’s \(\chi^2\)-statistic or the Likelihood-ratio statistic.

Xsq <- sapply(P, function(x) sum((author - sum(author)*x)^2 / (sum(author)*x)))
Gsq <- sapply(P, function(x) 2*sum(log((author/(sum(author)*x))^author)))

Lastly, we need the degrees of freedom, which is the number of cells minus the number of parameters estimated. The number of cells is the number of rows times the number of columns, and the number of parameters is equal to the number of components, times the sum of the number of rows and the number of columns minus one.

N <- nrow(author)
p <- ncol(author)
df <- N*p - r*(N+p-1)

Accordingly, we can calculate the \(p\)-value as follows.

1-pchisq(Xsq, df)
[1] 0.000000e+00 2.220446e-16 1.365907e-12

16. How many classes do you select, based on these results?

17. Given this number of latent classes, for which latent class has document \(2\) the highest probability?

plsa[[3]]$prob1[2, ]
         1          2          3          4          5          6 
0.16813171 0.10931625 0.02214574 0.02356608 0.04860496 0.11562740 

18. Which letter occurs the most for almost all classes?

apply(plsa[[3]]$prob2, 2, which.max)
 1  2  3  4  5  6 
 5  5  5 20  5  5 

19. Determine for the finally selected number of classes the proportion of explained variance by executing the following lines.

x <- c(author)
e <- c(sum(author) * P[[3]])
(2 * crossprod(x, e) - crossprod(e))/crossprod(x)
          [,1]
[1,] 0.9985299

Lab exercises

Factor analysis and independent component analysis

The Macroeconomic variables, both real and financial, do have considerable influence, positive as well as negative, on the performance of the corporate sector of the economy. Consequently, the stock markets of the economy got affected by such performance. The movement of stock prices, apart from the firms’ fundamentals, also depends upon the level of development achieved in the economy and its integration towards the world economy.

Since macroeconomic variables are highly interdependent, using all of them as explanatory variables in affecting the stock market may pose a severe multicolinearity problem and it becomes difficult to delineate the separate affects of different variables on the stock market movement. Deriving basic factors from such macroeconomic variables and employing these factors in pricing models can provide valuable information about the contents of priced factors in different stock markets. Generating orthogonal factor realizations eliminates the multicolinearity problem in estimating factor regression coefficients and serves to find the factors that are rewarded by the market. In this assignment, such factors will be extracted from twelve macroeconomic variables in India. The variables are:

  1. Money Supply (MS),
  2. Consumer Price Index (CPI),
  3. Gold Prices (GP),
  4. Crude Oil Prices (COP),
  5. Foreign Exchange Reserves (FER),
  6. Foreign Direct Investment (FDI),
  7. Foreign Institutional Investment (FII),
  8. Call Money Rate (CMR),
  9. Balance of Trade (BOT),
  10. Foreign Exchange Rate (ER),
  11. Repo Rate (Repo),
  12. Industrial Growth Rate (IGR).

The standardized observations in the data file IndianSM.txt are based on monthly averages, for 149 months.

20. Read the data into R, apply factor analysis and determine how many common factors are required to explain at least \(80\%\) of the total variance.

Note. Set the number of starting values to 200.

ism <- read.table("data/IndianSM.txt", header = TRUE)
fa <- factanal(ism, 3, scores = "regression", 
               control = list(nstart = 200))
fa

Call:
factanal(x = ism, factors = 3, scores = "regression", control = list(nstart = 200))

Uniquenesses:
   MS   CPI    GP   COP   FER   FDI   FII   CMR   BOT    ER  Repo   IGR 
0.005 0.016 0.020 0.071 0.075 0.469 0.618 0.195 0.139 0.119 0.091 0.440 

Loadings:
     Factor1 Factor2 Factor3
MS    0.942   0.328         
CPI   0.919   0.372         
GP    0.896   0.408         
COP   0.936  -0.153   0.169 
FER   0.958                 
FDI   0.667  -0.284         
FII   0.596          -0.162 
CMR   0.300   0.163   0.830 
BOT  -0.849  -0.319  -0.196 
ER    0.157   0.910   0.168 
Repo -0.210           0.928 
IGR          -0.746         

               Factor1 Factor2 Factor3
SS loadings      6.012   2.037   1.694
Proportion Var   0.501   0.170   0.141
Cumulative Var   0.501   0.671   0.812

Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 159.32 on 33 degrees of freedom.
The p-value is 1.78e-18 

21. Does the factor model with the number of factors chosen in (20.) fit the data? Why

# Although we explain more than 80% of the variance, the
# Pearson chi-square test indicates that there is 
# substantial misfit still, and that we should consider to
# use more factors to explain the data. 

22. Give the correlations between the ‘regression’ factor scores, given the in (20.) selected number of common factors?

cor(fa$scores)
             Factor1   Factor2      Factor3
Factor1  1.000000000 0.0253458 -0.002030194
Factor2  0.025345795 1.0000000  0.013662897
Factor3 -0.002030194 0.0136629  1.000000000

23. Carry out an independent component analysis and determine the number of independent components. How many independent components do you select? Why?

ica <- lapply(1:8, function(i) fastICA(ism, i))
pve <- sapply(ica, function(i) {
  obs <- c(unlist(ism))
  fit <- c(i$S %*% i$A)
  (2*crossprod(obs, fit) - crossprod(fit))/crossprod(obs)
})

# The first three components explain a substantial portion of the variance
# but hereafter, the added value of including more components is rather small

24. Give the correlations between the features (macro-economic variables) and the independent components. Use these correlations to interpret the independent components.

cor(ism, ica[[3]]$S)
            [,1]       [,2]        [,3]
MS   -0.04801991 -0.9828971 -0.08066265
CPI  -0.01078008 -0.9780558 -0.09792728
GP    0.02657816 -0.9650020 -0.17568779
COP  -0.48020447 -0.8096347 -0.14291747
FER  -0.27995648 -0.9243639  0.03538658
FDI  -0.61845645 -0.5296874 -0.09786166
FII  -0.18952846 -0.6328702  0.28332579
CMR  -0.10760274 -0.2702622 -0.90384550
BOT   0.05022472  0.9153689  0.23547395
ER    0.75334499 -0.4768387 -0.32347373
Repo -0.00523520  0.2297328 -0.92043276
IGR  -0.78983250  0.3320805  0.19977293

Non-negative matrix factorization

In this part of the assignment, non-negative matrix factorization (NMF) is used to once again learn the components which drive the stock market. Now, instead of the closing prices, the numbers of shares traded for 5 recent years for 505 companies currently found on the S&P 500 index, are used.

The data for this exercise can be found in the file volume_stocks.dat. The first feature in this data file is and gives an abbreviation of the company name. The features named to are the numbers of shares traded on 1259 days within 5 recent years.

Import the data into R. Be aware that the data file is tab-delimited and that the first line in the data file contains the feature names. The feature name (the first column) is not relevant for the analysis and should be removed. Certain companies have missing values. Remove the companies with missing values and only use the complete cases (companies). Next, store the data in a matrix.

25. Load in the data, apply non-negative matrix factorization and determine the dimension using the proportion of explained variance.

vs <- read.table("data/volume_stocks.dat", header = TRUE)
vs <- na.omit(vs[,-1]) |> as.matrix()

cl <- parallel::makeCluster(10)
parallel::clusterExport(cl, varlist = c("vs"))
NMF <- pbapply::pblapply(1:10, function(i) NMF::nmf(vs, i), cl = cl)
parallel::stopCluster(cl)

sapply(NMF, function(x) evar(x, vs))
 [1] 0.7759571 0.8126387 0.8302604 0.8481792 0.8670004 0.8779494 0.8821306
 [8] 0.8922032 0.8954649 0.9018260

26. How many dimensions do you select and why? What is the proportion of variance explained for this number of dimensions?

# I would select a single dimension only, as any additional component
# contributes only very little to the overall proportion of variance
# explained, but for the sake of the later questions, let's continue
# with two components. Note that you might also treat the number of
# components as a hyperparameter in a later model selection procedure. 
# That is, if your ultimate goal is to run a prediction model, you might
# do cross-validation with differing numbers of components, and choose
# the one that minimizes the prediction error.

mod <- NMF[[2]]
evar(mod, vs)
[1] 0.8126387

27. Give the reconstruction error for this number of dimensions.

sum((vs - mod@fit@W %*% mod@fit@H)^2)
[1] 1.065775e+19

28. Calculate the correlation matrix of the finally selected dimensions.

cor(mod@fit@W)
          [,1]      [,2]
[1,] 1.0000000 0.5671059
[2,] 0.5671059 1.0000000

Probabilistic latent semantic analysis

In this part of the assignment, probabilistic latent semantic analysis is applied to the data file benthos.txt. The data set consists of abundances of 10 marine species near an oilfield in the North Sea at 13 sites (the columns in the data file). The first 11 columns give the data for polluted sites. The last two columns give the data for unpolluted reference sites. Import the data into R and store the data as a matrix.

29. Carry out a probabilistic latent semantic analysis and determine the number of latent classes based on the proportion explained variance. How many classes do you select? Why?

benthos <- read.delim("data/benthos.txt", header = FALSE)
benthos <- as.matrix(benthos)

cl <- parallel::makeCluster(10)
parallel::clusterExport(cl, varlist = c("benthos"))
plsa <- pbapply::pblapply(1:12, function(i) svs::fast_plsa(benthos, i, symmetric = TRUE), cl=cl)
parallel::stopCluster(cl)

sapply(plsa, function(k) {
  x <- c(benthos)
  xhat <- sum(x) * c(k$prob1 %*% diag(k$prob0) %*% t(k$prob2))
  (2*crossprod(x, xhat) - crossprod(xhat))/crossprod(x)
})
 [1] 0.8978113 0.9939865 0.9969436 0.9993193 0.9994517 0.9996007 0.9997787
 [8] 0.9999402 0.9999665 0.9999962 0.9999997 0.9999997

30. Produce the three matrices \(\boldsymbol{U}\), \(\boldsymbol{\Sigma}\) and \(\boldsymbol{V}^\top\) for the number of latent classes selected and the corresponding matrix of estimated multinomial probabilities.

plsa[[2]]$prob1
                  1           2
 [1,]  8.808095e-01 0.048201813
 [2,]  7.706206e-02 0.311075639
 [3,]  2.152352e-03 0.267695164
 [4,]  8.018183e-04 0.145842139
 [5,]  7.229185e-03 0.088942595
 [6,]  3.864640e-03 0.056033508
 [7,]  1.960340e-02 0.009997768
 [8,]  6.961015e-03 0.024906913
 [9,]  1.516034e-03 0.026739617
[10,] 1.850665e-142 0.020564843
diag(plsa[[2]]$prob0)
          [,1]      [,2]
[1,] 0.6899232 0.0000000
[2,] 0.0000000 0.3100768
t(plsa[[2]]$prob2)
          V1         V2         V3         V4         V5         V6         V7
1 0.07196772 0.02614088 0.05590012 0.02376826 0.04990964 0.11548076 0.04405299
2 0.06388115 0.09313481 0.06437767 0.10135175 0.08137882 0.05740269 0.07164169
          V8         V9        V10          V11          V12         V13
1 0.04981460 0.09711325 0.09468017 3.690435e-01 7.732571e-31 0.002128124
2 0.08599704 0.05567194 0.14701724 7.642036e-08 7.638370e-02 0.101761413
plsa[[2]]$prob1 %*% diag(plsa[[2]]$prob0) %*% t(plsa[[2]]$prob2)
                V1           V2           V3           V4           V5
 [1,] 0.0446889139 0.0172775959 0.0349322004 0.0159585853 0.0315459455
 [2,] 0.0099881057 0.0103733643 0.0091817341 0.0110398038 0.0105031247
 [3,] 0.0054093907 0.0077695704 0.0054267452 0.0084481034 0.0068290481
 [4,] 0.0029286618 0.0042262272 0.0029422271 0.0045965035 0.0037077437
 [5,] 0.0021207251 0.0026989476 0.0020542802 0.0029137293 0.0024932775
 [6,] 0.0013018029 0.0016878878 0.0012675887 0.0018243286 0.0015470060
 [7,] 0.0011713882 0.0006422762 0.0009556157 0.0006356599 0.0009273003
 [8,] 0.0008389873 0.0008448285 0.0007656563 0.0008968938 0.0008681874
 [9,] 0.0006049344 0.0007995538 0.0005922453 0.0008652016 0.0007269418
[10,] 0.0004073496 0.0005938908 0.0004105158 0.0006462877 0.0005189267
                V6           V7           V8           V9          V10
 [1,] 0.0710345705 0.0278413813 0.0315572144 0.0598469320 0.0597336388
 [2,] 0.0116766646 0.0092525269 0.0109435328 0.0105331780 0.0192147413
 [3,] 0.0049362541 0.0060121101 0.0072122472 0.0047653170 0.0123439160
 [4,] 0.0026597622 0.0032641686 0.0039165375 0.0025713333 0.0067008278
 [5,] 0.0021590803 0.0021955263 0.0026201696 0.0020197386 0.0045268182
 [6,] 0.0013052607 0.0013622107 0.0016269927 0.0012262162 0.0028068249
 [7,] 0.0017398114 0.0008179043 0.0009403317 0.0014860283 0.0017362986
 [8,] 0.0009979281 0.0007648601 0.0009033978 0.0008963503 0.0015901299
 [9,] 0.0005967316 0.0006400822 0.0007651336 0.0005631700 0.0013179995
[10,] 0.0003660385 0.0004568361 0.0005483756 0.0003550021 0.0009374818
               V11          V12          V13
 [1,] 2.242644e-01 0.0011416508 0.0028141945
 [2,] 1.962091e-02 0.0073677674 0.0099287793
 [3,] 5.480202e-04 0.0063403091 0.0084499728
 [4,] 2.041558e-04 0.0034542433 0.0046030580
 [5,] 1.840637e-03 0.0021065885 0.0028170954
 [6,] 9.839837e-04 0.0013271430 0.0017737471
 [7,] 4.991254e-03 0.0002367952 0.0003442506
 [8,] 1.772356e-03 0.0005899155 0.0007961294
 [9,] 3.860006e-04 0.0006333228 0.0008459637
[10,] 4.873082e-10 0.0004870744 0.0006488999