Introduction

In this practical, we will deal with the curse of dimensionality by applying the “bet on sparsity”. We will use the following packages in the process:

library(tidyverse)
library(glmnet)

Create a practical folder with an .Rproj file (e.g., practical_01.Rproj) and a data folder inside. Download the prepared files below and put them in the data folder in your project directory.

Take home exercises

Gene expression data

The data file we will be working with is gene expression data. Using microarrays, the expression of many genes can be measured at the same time. The data file contains expressions for 54675 genes with IDs such as 1007_s_at, 202896_s_at, AFFX-r2-P1-cre-3_at. (NB: these IDs are specific for this type of chip and need to be converted to actual gene names before they can be looked up in a database such as “GeneCards”). The values in the data file are related to the amount of RNA belonging to each gene found in the tissue sample.

The goal of the study for which this data was collected is one of exploratory cancer classification: are there differences in gene expression between tissue samples of human prostates with and without prostate cancer?

1. Read the data file gene_expressions.rds using read_rds(). What are the dimensions of the data? What is the sample size?

# read the data to a tibble
expr_dat <- read_rds("data/gene_expressions.rds")

# inspect the dimensions
dim(expr_dat)

## [1]   237 54676

# The file has 54675 columns and 237 rows

# the sample size is 237

2. As always, visualisation is a good idea. Create histograms of the first 6 variables. Describe what you notice.

expr_dat[,1:7] %>% 
  pivot_longer(-sample, names_to = "gene") %>% 
  ggplot(aes(x = value, fill = gene)) +
  geom_histogram(colour = "black", bins = 35) +
  theme_minimal() +
  facet_wrap(~gene) +
  labs(x = "Expression", y = "Count") +
  scale_fill_viridis_d(guide = "none")

# values range from ~2.5 to ~12.5. Some gene expressions are skewed, some look roughly normal.

3. We now only have the gene expression data, but the labels are in the file phenotypes.rds. Load that file, select() the relevant columns for classification into normal and tumor tissue, and join() it with the gene expression data, based on the tissue identifier in the sample column. Give the resulting dataset a good name!

phen <- read_rds("data/phenotypes.rds")
cancer_df <- 
  phen %>% 
  mutate(disease = as_factor(disease)) %>% # it's always a good idea to ensure the correct measurement level!
  select(sample, disease) %>% 
  right_join(expr_dat)

## Joining, by = "sample"

4. Does this dataset suffer from class imbalance?

prop.table(table(cancer_df$disease))

## 
##    normal     tumor 
## 0.4894515 0.5105485

# The two classes (normal & tumor) are nearly 50/50, so no class imbalance!

5. Split the data into a training (80%) and a test set (20%). We will use the training set for model development in the next section.

set.seed(45)
train_idx <- sample(237, 190)
cancer_df_train <- cancer_df[train_idx,]
cancer_df_test  <- cancer_df[-train_idx,]

Lab exercises

Correlation filter & logistic regression

In this section, we will perform class prediction with this dataset using filtering and logistic regression. For the model development parts, use the training dataset.

6. Use a correlation filter to find the IDs of the 10 genes that are most related to disease status.

# There are many ways to go about this! You can use correlations or t-scores or standardised mean differences or other metrics.
# Here we use correlations

# get disease status as an indicator (for the cor function)
y <- as.numeric(cancer_df_train$disease == "tumor")

# get the gene expressions as a matrix (for the apply function)
X <- 
  cancer_df_train %>% 
  select(-disease, -sample) %>% 
  as.matrix()

# use the apply function to get a correlation of every column (margin = 2) with the disease status
cors <- apply(X, 2, cor, y = y)

# select the 10 most correlating genes (don't forget the abs() function!)
(cors_10 <- sort(abs(cors), decreasing = TRUE)[1:10])

## 209424_s_at   242138_at 209426_s_at   232575_at   209425_at 206858_s_at 
##   0.7468092   0.7441597   0.7345152   0.7270552   0.7257923   0.7255545 
##   207147_at 204934_s_at   236365_at   217111_at 
##   0.7210385   0.7142122   0.7052019   0.7050075

7. Perform logistic regression, predicting the outcome using the selected genes. Name the fitted object fit_lr.

top10_form <- disease ~ `209424_s_at` + `242138_at` + `209426_s_at` + `232575_at` + `209425_at` + 
  `206858_s_at` + `207147_at` + `204934_s_at` + `236365_at` + `217111_at`

fit_lr <- glm(formula = top10_form, family = binomial(), data = cancer_df_train)
summary(fit_lr)

## 
## Call:
## glm(formula = top10_form, family = binomial(), data = cancer_df_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.41353  -0.22586   0.00133   0.03044   2.94022  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -24.28706    5.32947  -4.557 5.19e-06 ***
## `209424_s_at`   0.47378    0.74468   0.636   0.5246    
## `242138_at`     0.56413    0.48396   1.166   0.2438    
## `209426_s_at`  -0.08493    0.76242  -0.111   0.9113    
## `232575_at`     0.45622    0.32990   1.383   0.1667    
## `209425_at`     0.58400    1.07048   0.546   0.5854    
## `206858_s_at`   0.82918    0.34501   2.403   0.0162 *  
## `207147_at`     1.02348    0.58740   1.742   0.0814 .  
## `204934_s_at`  -0.03303    0.48325  -0.068   0.9455    
## `236365_at`    -0.15008    0.83584  -0.180   0.8575    
## `217111_at`    -0.05181    0.83299  -0.062   0.9504    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 263.059  on 189  degrees of freedom
## Residual deviance:  59.631  on 179  degrees of freedom
## AIC: 81.631
## 
## Number of Fisher Scoring iterations: 8

8. Create a confusion matrix for the predictions of this model on the test set. What is the accuracy of this model?

pred_lr <- predict(fit_lr, newdata = cancer_df_test, type = "response")

table(observed = cancer_df_test$disease, predicted = ifelse(pred_lr > 0.5, "tumor", "normal"))

##         predicted
## observed normal tumor
##   normal     22     3
##   tumor       2    20

# accuracy: (22+20) / (22+20+2+3) = 89%

Regularized regression

In this section, we will use the glmnet package to perform LASSO regression, which will automatically set certain coefficients to 0. The first step in performing LASSO regression is to prepare the data in the correct format. Read the help file of glmnet() to figure out what the x and y inputs should be.

9. Prepare your data for input into glmnet. Create x_train, y_train, x_test, and y_test.

x_train <- cancer_df_train %>% select(-sample, -disease) %>% as.matrix()
y_train <- cancer_df_train %>% pull(disease)

x_test <- cancer_df_test %>% select(-sample, -disease) %>% as.matrix()
y_test <- cancer_df_test %>% pull(disease)

10. Use the glmnet function to fit a LASSO regression. Use the plot() function on the fitted model and describe what you see.

fit_lasso <- glmnet(x_train, y_train, family = "binomial")
plot(fit_lasso)

# Each line is a coefficient, with its value on the y-axis.
# On the x - axis we see the total budget for the parameters. 
# As we decrease the budget (right-to-left), the coefficients
# shrink towards 0. Decreasing the budget further, some coefs 
# will become exactly 0, which is shown in the number of nonzero
# parameters (on top).

The next step is finding the right penalization parameter \(\lambda\). In other words, we need to tune this hyperparameter. We want to select the hyperparameter which yields the best out-of-sample prediction performance. We could do this by further splitting the training dataset into a train and development subset, or with cross-validation. In this case, we will use the built-in cross-validation function of the glmnet package: cv.glmnet.

11. Run cv.glmnet for your dataset. Run the plot() function on the resulting object. Explain in your own words what you see. NB: Do not forget to set family = "binomial" to ensure that you are running logistic regression.

cv_fit <- cv.glmnet(x_train, y_train, family = "binomial")
plot(cv_fit)

# we see a plot with on the x-axis the log of the penalty (complexity) and on 
# the y axis the binomial deviance (lack of fit). On top, there is the number 
# of nonzero parameters. The cross-validation procedure has computed for different
# levels of lambda the out-of-sample deviance.

12. Inspect the nonzero coefficients of the model with the lowest out-of-sample deviance. Hint: use the coef() function, and make sure to use the right value for the s argument to that function. Do you see overlap between the correlation filter selections and the LASSO results?

coefs_1se <- coef(cv_fit, s = "lambda.min")
nonzero_idx <- which(coefs_1se[,1] != 0)
coefs_1se[nonzero_idx,]

##   (Intercept)  1552666_a_at    1553561_at    1558489_at    1559861_at 
## -13.815760253   0.109047140   0.041621016   0.007357958   0.135971034 
##    1561232_at    1561555_at  1566137_s_at  1570425_s_at   201877_s_at 
##   0.092398285   0.321915500   0.143523318   0.240307568  -0.041703816 
##     203393_at   204430_s_at     204735_at     205083_at     205202_at 
##   0.081759546  -0.101113381  -0.010136251  -0.246611320  -0.432237537 
##     206023_at     206364_at   206858_s_at     207147_at   209244_s_at 
##   0.179125395   0.189693770   0.307651970   0.025149174  -0.092640259 
##   209426_s_at   211214_s_at   211330_s_at   211333_s_at   213877_x_at 
##   0.237303890   0.096199709  -0.022820180   0.063545546  -0.029311397 
##   216258_s_at   216548_x_at     217170_at     218487_at     218931_at 
##   0.212913674   0.032348352   0.070664896  -0.026417998   0.102875468 
##     220541_at     223652_at   224460_s_at     227653_at     229703_at 
##   0.135141208  -0.116867888   0.220905057   0.597606843   0.117741168 
##     232575_at     233327_at     238239_at     238463_at     241110_at 
##   0.105691542   0.149565741   0.764672092  -0.610048991   0.239376191 
##     241235_at     241310_at     242138_at     243551_at 
##   0.275540645   0.049418164   0.227762711   0.056388035

# using this train-test split, the following genes are in both selections
intersect(names(cors_10), names(coefs_1se[nonzero_idx,]))

## [1] "242138_at"   "209426_s_at" "232575_at"   "206858_s_at" "207147_at"

# so there is some overlap but not complete overlap!

13. Use the predict() function on the fitted cv.glmnet object to predict disease status for the test set based on the optimized lambda value. Create a confusion matrix and compare this with the logistic regression model we made earlier in terms of accuracy.

pred_glmnet <- predict(cv_fit, newx = x_test, s = "lambda.min", type = "response")

table(observed = cancer_df_test$disease, predicted = ifelse(pred_glmnet > 0.5, "tumor", "normal"))

##         predicted
## observed normal tumor
##   normal     23     2
##   tumor       2    20

# accuracy: (23+20) / (23+20+2+2) = 91%
# So this model performs slightly better on the test set than the logistic regression model