Introduction

In this practical, we will deal with the curse of dimensionality by applying the “bet on sparsity”. We will use the following packages in the process:

library(tidyverse)
library(glmnet)

Create a practical folder with an .Rproj file (e.g., practical_01.Rproj) and a data folder inside. Download the prepared files below and put them in the data folder in your project directory.

Take home exercises

Gene expression data

The data file we will be working with is gene expression data. Using microarrays, the expression of many genes can be measured at the same time. The data file contains expressions for 54675 genes with IDs such as 1007_s_at, 202896_s_at, AFFX-r2-P1-cre-3_at. (NB: these IDs are specific for this type of chip and need to be converted to actual gene names before they can be looked up in a database such as “GeneCards”). The values in the data file are related to the amount of RNA belonging to each gene found in the tissue sample.

The goal of the study for which this data was collected is one of exploratory cancer classification: are there differences in gene expression between tissue samples of human prostates with and without prostate cancer?

1. Read the data file gene_expressions.rds using read_rds(). What are the dimensions of the data? What is the sample size?

# read the data to a tibble
expr_dat <- read_rds("data/gene_expressions.rds")

# inspect the dimensions
dim(expr_dat)
## [1]   237 54676
# The file has 54675 columns and 237 rows

# the sample size is 237

2. As always, visualisation is a good idea. Create histograms of the first 6 variables. Describe what you notice.

expr_dat[,1:7] %>% 
  pivot_longer(-sample, names_to = "gene") %>% 
  ggplot(aes(x = value, fill = gene)) +
  geom_histogram(colour = "black", bins = 35) +
  theme_minimal() +
  facet_wrap(~gene) +
  labs(x = "Expression", y = "Count") +
  scale_fill_viridis_d(guide = "none")