Materials for Applied Data Science profile course INFOMDA2 Battling the curse of dimensionality.

1 Introduction

In this practical, we will deal with the curse of dimensionality by applying the “bet on sparsity”. We will use the following packages in the process:


Create a practical folder with an .Rproj file (e.g., practical_01.Rproj) and a data folder inside. Download the prepared files below and put them in the data folder in your project directory.

2 Take home exercises

2.1 Gene expression data

The data file we will be working with is gene expression data. Using microarrays, the expression of many genes can be measured at the same time. The data file contains expressions for 54675 genes with IDs such as 1007_s_at, 202896_s_at, AFFX-r2-P1-cre-3_at. (NB: these IDs are specific for this type of chip and need to be converted to actual gene names before they can be looked up in a database such as “GeneCards”). The values in the data file are related to the amount of RNA belonging to each gene found in the tissue sample.

The goal of the study for which this data was collected is one of exploratory cancer classification: are there differences in gene expression between tissue samples of human prostates with and without prostate cancer?

1. Read the data file gene_expressions.rds using read_rds(). What are the dimensions of the data? What is the sample size?

2. As always, visualisation is a good idea. Create histograms of the first 6 variables. Describe what you notice.

3. We now only have the gene expression data, but the labels are in the file phenotypes.rds. Load that file, select() the relevant columns for classification into normal and tumor tissue, and join() it with the gene expression data, based on the tissue identifier in the sample column. Give the resulting dataset a good name!

4. Does this dataset suffer from class imbalance?

5. Split the data into a training (80%) and a test set (20%). We will use the training set for model development in the next section.

3 Lab exercises

3.1 Correlation filter & logistic regression

In this section, we will perform class prediction with this dataset using filtering and logistic regression. For the model development parts, use the training dataset.

6. Use a correlation filter to find the IDs of the 10 genes that are most related to disease status.

7. Perform logistic regression, predicting the outcome using the selected genes. Name the fitted object fit_lr.

8. Create a confusion matrix for the predictions of this model on the test set. What is the accuracy of this model?

3.2 Regularized regression

In this section, we will use the glmnet package to perform LASSO regression, which will automatically set certain coefficients to 0. The first step in performing LASSO regression is to prepare the data in the correct format. Read the help file of glmnet() to figure out what the x and y inputs should be.

9. Prepare your data for input into glmnet. Create x_train, y_train, x_test, and y_test.

10. Use the glmnet function to fit a LASSO regression. Use the plot() function on the fitted model and describe what you see.

The next step is finding the right penalization parameter λ. In other words, we need to tune this hyperparameter. We want to select the hyperparameter which yields the best out-of-sample prediction performance. We could do this by further splitting the training dataset into a train and development subset, or with cross-validation. In this case, we will use the built-in cross-validation function of the glmnet package: cv.glmnet.

11. Run cv.glmnet for your dataset. Run the plot() function on the resulting object. Explain in your own words what you see. NB: Do not forget to set family = "binomial" to ensure that you are running logistic regression.

12. Inspect the nonzero coefficients of the model with the lowest out-of-sample deviance. Hint: use the coef() function, and make sure to use the right value for the s argument to that function. Do you see overlap between the correlation filter selections and the LASSO results?

13. Use the predict() function on the fitted cv.glmnet object to predict disease status for the test set based on the optimized lambda value. Create a confusion matrix and compare this with the logistic regression model we made earlier in terms of accuracy.