In this practical, we will deal with the curse of dimensionality by applying the “bet on sparsity”. We will use the following packages in the process:
library(tidyverse)
library(glmnet)
Create a practical folder with an .Rproj file (e.g.,
practical_01.Rproj
) and a data folder inside. Download the
prepared files below and put them in the data folder in your project
directory.
The data file we will be working with is gene expression data. Using
microarrays, the expression of many genes can be measured at the same
time. The data file contains expressions for 54675 genes with IDs such
as 1007_s_at
, 202896_s_at
,
AFFX-r2-P1-cre-3_at
. (NB: these IDs are specific for this
type of chip and need to be converted to actual gene names before they
can be looked up in a database such as “GeneCards”). The values in the
data file are related to the amount of RNA belonging to each gene found
in the tissue sample.
The goal of the study for which this data was collected is one of exploratory cancer classification: are there differences in gene expression between tissue samples of human prostates with and without prostate cancer?
1. Read the data file gene_expressions.rds
using
read_rds()
. What are the dimensions of the data? What is
the sample size?
# read the data to a tibble
<- read_rds("data/gene_expressions.rds")
expr_dat
# inspect the dimensions
dim(expr_dat)
## [1] 237 54676
# The file has 54675 columns and 237 rows
# the sample size is 237
2. As always, visualisation is a good idea. Create histograms of the first 6 variables. Describe what you notice.
1:7] %>%
expr_dat[,pivot_longer(-sample, names_to = "gene") %>%
ggplot(aes(x = value, fill = gene)) +
geom_histogram(colour = "black", bins = 35) +
theme_minimal() +
facet_wrap(~gene) +
labs(x = "Expression", y = "Count") +
scale_fill_viridis_d(guide = "none")
# values range from ~2.5 to ~12.5. Some gene expressions are skewed, some look roughly normal.
3. We now only have the gene expression data, but the labels
are in the file phenotypes.rds
. Load that file,
select()
the relevant columns for classification into
normal
and tumor
tissue, and
join()
it with the gene expression data, based on the
tissue identifier in the sample
column. Give the resulting
dataset a good name!
<- read_rds("data/phenotypes.rds")
phen <-
cancer_df %>%
phen mutate(disease = as_factor(disease)) %>% # it's always a good idea to ensure the correct measurement level!
select(sample, disease) %>%
right_join(expr_dat)
## Joining, by = "sample"
4. Does this dataset suffer from class imbalance?
prop.table(table(cancer_df$disease))
##
## normal tumor
## 0.4894515 0.5105485
# The two classes (normal & tumor) are nearly 50/50, so no class imbalance!
5. Split the data into a training (80%) and a test set (20%). We will use the training set for model development in the next section.
set.seed(45)
<- sample(237, 190)
train_idx <- cancer_df[train_idx,]
cancer_df_train <- cancer_df[-train_idx,] cancer_df_test
In this section, we will perform class prediction with this dataset using filtering and logistic regression. For the model development parts, use the training dataset.
6. Use a correlation filter to find the IDs of the 10 genes that are most related to disease status.
# There are many ways to go about this! You can use correlations or t-scores or standardised mean differences or other metrics.
# Here we use correlations
# get disease status as an indicator (for the cor function)
<- as.numeric(cancer_df_train$disease == "tumor")
y
# get the gene expressions as a matrix (for the apply function)
<-
X %>%
cancer_df_train select(-disease, -sample) %>%
as.matrix()
# use the apply function to get a correlation of every column (margin = 2) with the disease status
<- apply(X, 2, cor, y = y)
cors
# select the 10 most correlating genes (don't forget the abs() function!)
<- sort(abs(cors), decreasing = TRUE)[1:10]) (cors_10
## 209424_s_at 242138_at 209426_s_at 232575_at 209425_at 206858_s_at
## 0.7468092 0.7441597 0.7345152 0.7270552 0.7257923 0.7255545
## 207147_at 204934_s_at 236365_at 217111_at
## 0.7210385 0.7142122 0.7052019 0.7050075
7. Perform logistic regression, predicting the outcome using
the selected genes. Name the fitted object
fit_lr
.
<- disease ~ `209424_s_at` + `242138_at` + `209426_s_at` + `232575_at` + `209425_at` +
top10_form `206858_s_at` + `207147_at` + `204934_s_at` + `236365_at` + `217111_at`
<- glm(formula = top10_form, family = binomial(), data = cancer_df_train)
fit_lr summary(fit_lr)
##
## Call:
## glm(formula = top10_form, family = binomial(), data = cancer_df_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.41353 -0.22586 0.00133 0.03044 2.94022
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -24.28706 5.32947 -4.557 5.19e-06 ***
## `209424_s_at` 0.47378 0.74468 0.636 0.5246
## `242138_at` 0.56413 0.48396 1.166 0.2438
## `209426_s_at` -0.08493 0.76242 -0.111 0.9113
## `232575_at` 0.45622 0.32990 1.383 0.1667
## `209425_at` 0.58400 1.07048 0.546 0.5854
## `206858_s_at` 0.82918 0.34501 2.403 0.0162 *
## `207147_at` 1.02348 0.58740 1.742 0.0814 .
## `204934_s_at` -0.03303 0.48325 -0.068 0.9455
## `236365_at` -0.15008 0.83584 -0.180 0.8575
## `217111_at` -0.05181 0.83299 -0.062 0.9504
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 263.059 on 189 degrees of freedom
## Residual deviance: 59.631 on 179 degrees of freedom
## AIC: 81.631
##
## Number of Fisher Scoring iterations: 8
8. Create a confusion matrix for the predictions of this model on the test set. What is the accuracy of this model?
<- predict(fit_lr, newdata = cancer_df_test, type = "response")
pred_lr
table(observed = cancer_df_test$disease, predicted = ifelse(pred_lr > 0.5, "tumor", "normal"))
## predicted
## observed normal tumor
## normal 22 3
## tumor 2 20
# accuracy: (22+20) / (22+20+2+3) = 89%
In this section, we will use the glmnet
package to
perform LASSO regression, which will automatically set certain
coefficients to 0. The first step in performing LASSO regression is to
prepare the data in the correct format. Read the help file of
glmnet()
to figure out what the x
and
y
inputs should be.
9. Prepare your data for input into glmnet
.
Create x_train, y_train, x_test, and y_test.
<- cancer_df_train %>% select(-sample, -disease) %>% as.matrix()
x_train <- cancer_df_train %>% pull(disease)
y_train
<- cancer_df_test %>% select(-sample, -disease) %>% as.matrix()
x_test <- cancer_df_test %>% pull(disease) y_test
10. Use the glmnet function to fit a LASSO regression. Use the plot() function on the fitted model and describe what you see.
<- glmnet(x_train, y_train, family = "binomial")
fit_lasso plot(fit_lasso)
# Each line is a coefficient, with its value on the y-axis.
# On the x - axis we see the total budget for the parameters.
# As we decrease the budget (right-to-left), the coefficients
# shrink towards 0. Decreasing the budget further, some coefs
# will become exactly 0, which is shown in the number of nonzero
# parameters (on top).
The next step is finding the right penalization parameter \(\lambda\). In other words, we need to
tune this hyperparameter. We want to select the
hyperparameter which yields the best out-of-sample prediction
performance. We could do this by further splitting the training dataset
into a train and development subset, or with cross-validation. In this
case, we will use the built-in cross-validation function of the
glmnet
package: cv.glmnet
.
11. Run cv.glmnet
for your dataset. Run the
plot() function on the resulting object. Explain in your own words what
you see. NB: Do not forget to set
family = "binomial"
to ensure that you are running logistic
regression.
<- cv.glmnet(x_train, y_train, family = "binomial")
cv_fit plot(cv_fit)
# we see a plot with on the x-axis the log of the penalty (complexity) and on
# the y axis the binomial deviance (lack of fit). On top, there is the number
# of nonzero parameters. The cross-validation procedure has computed for different
# levels of lambda the out-of-sample deviance.
12. Inspect the nonzero coefficients of the model with the
lowest out-of-sample deviance. Hint: use the coef()
function, and make sure to use the right value for the s
argument to that function. Do you see overlap between the correlation
filter selections and the LASSO results?
<- coef(cv_fit, s = "lambda.min")
coefs_1se <- which(coefs_1se[,1] != 0)
nonzero_idx coefs_1se[nonzero_idx,]
## (Intercept) 1552666_a_at 1553561_at 1558489_at 1559861_at
## -13.815760253 0.109047140 0.041621016 0.007357958 0.135971034
## 1561232_at 1561555_at 1566137_s_at 1570425_s_at 201877_s_at
## 0.092398285 0.321915500 0.143523318 0.240307568 -0.041703816
## 203393_at 204430_s_at 204735_at 205083_at 205202_at
## 0.081759546 -0.101113381 -0.010136251 -0.246611320 -0.432237537
## 206023_at 206364_at 206858_s_at 207147_at 209244_s_at
## 0.179125395 0.189693770 0.307651970 0.025149174 -0.092640259
## 209426_s_at 211214_s_at 211330_s_at 211333_s_at 213877_x_at
## 0.237303890 0.096199709 -0.022820180 0.063545546 -0.029311397
## 216258_s_at 216548_x_at 217170_at 218487_at 218931_at
## 0.212913674 0.032348352 0.070664896 -0.026417998 0.102875468
## 220541_at 223652_at 224460_s_at 227653_at 229703_at
## 0.135141208 -0.116867888 0.220905057 0.597606843 0.117741168
## 232575_at 233327_at 238239_at 238463_at 241110_at
## 0.105691542 0.149565741 0.764672092 -0.610048991 0.239376191
## 241235_at 241310_at 242138_at 243551_at
## 0.275540645 0.049418164 0.227762711 0.056388035
# using this train-test split, the following genes are in both selections
intersect(names(cors_10), names(coefs_1se[nonzero_idx,]))
## [1] "242138_at" "209426_s_at" "232575_at" "206858_s_at" "207147_at"
# so there is some overlap but not complete overlap!
13. Use the predict()
function on the fitted
cv.glmnet
object to predict disease status for the test set
based on the optimized lambda value. Create a confusion matrix and
compare this with the logistic regression model we made earlier in terms
of accuracy.
<- predict(cv_fit, newx = x_test, s = "lambda.min", type = "response")
pred_glmnet
table(observed = cancer_df_test$disease, predicted = ifelse(pred_glmnet > 0.5, "tumor", "normal"))
## predicted
## observed normal tumor
## normal 23 2
## tumor 2 20
# accuracy: (23+20) / (23+20+2+2) = 91%
# So this model performs slightly better on the test set than the logistic regression model