Materials for Applied Data Science profile course INFOMDA2 Battling the curse of dimensionality.
In this practical, we will deal with the curse of dimensionality by applying the “bet on sparsity”. We will use the following packages in the process:
library(tidyverse)
library(glmnet)
Create a practical folder with an .Rproj file (e.g.,
practical_01.Rproj
) and a data folder inside. Download the prepared
files below and put them in the data folder in your project directory.
The data file we will be working with is gene expression data. Using
microarrays, the expression of many genes can be measured at the same
time. The data file contains expressions for 54675 genes with IDs such
as 1007_s_at
, 202896_s_at
, AFFX-r2-P1-cre-3_at
. (NB: these IDs are
specific for this type of chip and need to be converted to actual gene
names before they can be looked up in a database such as “GeneCards”).
The values in the data file are related to the amount of RNA belonging
to each gene found in the tissue sample.
The goal of the study for which this data was collected is one of exploratory cancer classification: are there differences in gene expression between tissue samples of human prostates with and without prostate cancer?
1. Read the data file gene_expressions.rds
using read_rds()
. What
are the dimensions of the data? What is the sample size?
2. As always, visualisation is a good idea. Create histograms of the first 6 variables. Describe what you notice.
3. We now only have the gene expression data, but the labels are in
the file phenotypes.rds
. Load that file, select()
the relevant
columns for classification into normal
and tumor
tissue, and
join()
it with the gene expression data, based on the tissue
identifier in the sample
column. Give the resulting dataset a good
name!
4. Does this dataset suffer from class imbalance?
5. Split the data into a training (80%) and a test set (20%). We will use the training set for model development in the next section.
In this section, we will perform class prediction with this dataset using filtering and logistic regression. For the model development parts, use the training dataset.
6. Use a correlation filter to find the IDs of the 10 genes that are most related to disease status.
7. Perform logistic regression, predicting the outcome using the
selected genes. Name the fitted object fit_lr
.
8. Create a confusion matrix for the predictions of this model on the test set. What is the accuracy of this model?
In this section, we will use the glmnet
package to perform LASSO
regression, which will automatically set certain coefficients to 0. The
first step in performing LASSO regression is to prepare the data in the
correct format. Read the help file of glmnet()
to figure out what the
x
and y
inputs should be.
9. Prepare your data for input into glmnet
. Create x_train, y_train,
x_test, and y_test.
10. Use the glmnet function to fit a LASSO regression. Use the plot() function on the fitted model and describe what you see.
The next step is finding the right penalization parameter λ. In other
words, we need to tune this hyperparameter. We want to select the
hyperparameter which yields the best out-of-sample prediction
performance. We could do this by further splitting the training dataset
into a train and development subset, or with cross-validation. In this
case, we will use the built-in cross-validation function of the glmnet
package: cv.glmnet
.
11. Run cv.glmnet
for your dataset. Run the plot() function on the
resulting object. Explain in your own words what you see. NB: Do not
forget to set family = "binomial"
to ensure that you are running
logistic regression.
12. Inspect the nonzero coefficients of the model with the lowest
out-of-sample deviance. Hint: use the coef()
function, and make sure
to use the right value for the s
argument to that function. Do you see
overlap between the correlation filter selections and the LASSO
results?
13. Use the predict()
function on the fitted cv.glmnet
object to
predict disease status for the test set based on the optimized lambda
value. Create a confusion matrix and compare this with the logistic
regression model we made earlier in terms of accuracy.