In this practical, we will deal with the curse of dimensionality by applying the “bet on sparsity”. We will use the following packages in the process:
library(tidyverse)
library(glmnet)
Create a practical folder with an .Rproj file (e.g.,
practical_01.Rproj
) and a data folder inside. Download the
prepared files below and put them in the data folder in your project
directory.
The data file we will be working with is gene expression data. Using
microarrays, the expression of many genes can be measured at the same
time. The data file contains expressions for 54675 genes with IDs such
as 1007_s_at
, 202896_s_at
,
AFFX-r2-P1-cre-3_at
. (NB: these IDs are specific for this
type of chip and need to be converted to actual gene names before they
can be looked up in a database such as “GeneCards”). The values in the
data file are related to the amount of RNA belonging to each gene found
in the tissue sample.
The goal of the study for which this data was collected is one of exploratory cancer classification: are there differences in gene expression between tissue samples of human prostates with and without prostate cancer?
1. Read the data file gene_expressions.rds
using
read_rds()
. What are the dimensions of the data? What is
the sample size?
# read the data to a tibble
<- read_rds("data/gene_expressions.rds")
expr_dat
# inspect the dimensions
dim(expr_dat)
## [1] 237 54676
# The file has 54675 columns and 237 rows
# the sample size is 237
2. As always, visualisation is a good idea. Create histograms of the first 6 variables. Describe what you notice.
1:7] %>%
expr_dat[,pivot_longer(-sample, names_to = "gene") %>%
ggplot(aes(x = value, fill = gene)) +
geom_histogram(colour = "black", bins = 35) +
theme_minimal() +
facet_wrap(~gene) +
labs(x = "Expression", y = "Count") +
scale_fill_viridis_d(guide = "none")