Introduction

In this practical, we will create a feed-forward neural network as well as a convolutional neural network to analyze the famous MNIST dataset.

library(tidyverse)
library(keras)

Take-home exercises: deep feed-forward neural network

In this section, we will develop a deep feed-forward neural network for MNIST.

Data preparation

1. Load the built-in MNIST dataset by running the following code. Then, describe the structure and contents of the mnist object.

mnist <- dataset_mnist()
str(mnist)
## List of 2
##  $ train:List of 2
##   ..$ x: int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ y: int [1:60000(1d)] 5 0 4 1 9 2 1 3 1 4 ...
##  $ test :List of 2
##   ..$ x: int [1:10000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ y: int [1:10000(1d)] 7 2 1 0 4 1 4 9 5 9 ...
# mnist is a list of 2 objects:
# - train (N=60000)
# - test (N=10000)

range(mnist$train$x)
## [1]   0 255
table(mnist$train$y)
## 
##    0    1    2    3    4    5    6    7    8    9 
## 5923 6742 5958 6131 5842 5421 5918 6265 5851 5949
# in each object, there is an X with integers between 0-255 in a 28*28 matrix for each observation
# and a y which is a single integer (0-9) labelling each matrix

Plotting is very important when working with image data. We have defined a convenient plotting function for you.

2. Use the plot_img() function below to plot the first training image. The img parameter has to be a matrix with dimensions (28, 28). NB: indexing in 3-dimensional arrays works the same as indexing in matrices, but you need an extra comma x[,,].

plot_img <- function(img, col = gray.colors(255, start = 1, end = 0), ...) {
  image(t(img), asp = 1, ylim = c(1.1, -0.1), col = col, bty = "n", axes = FALSE, ...)
}
# get first training image (observation number is in first dimension)
im <- mnist$train$x[1,,]
plot_img(im)

It is usually a good idea to normalize your features to have a manageable, standard range before entering them in neural networks.

3. As a preprocessing step, ensure the brightness values of the images in the training and test set are in the range (0, 1)

# remember that the range of training image values is 0 to 255
mnist$train$x <- mnist$train$x / 255
mnist$test$x <- mnist$test$x / 255

Multinomial logistic regression

The simplest model is a multinomial logistic regression model, where we have no hidden layers and 10 outputs (0-1). That model is shown below.

4. Display a summary of the multinomial model using the summary() function. Describe why this model has 7850 parameters.

multinom  <- 
  # initialize a sequential model
  keras_model_sequential(input_shape = c(28, 28)) %>% 
  # flatten 28*28 matrix into single vector
  layer_flatten() %>%
  # softmax outcome == probability for each of 10 outputs
  layer_dense(10, activation = "softmax")

multinom$compile(
  loss = "sparse_categorical_crossentropy", # loss function for multinomial outcome
  optimizer = "adam", # we use this optimizer because it works well
  metrics = list("accuracy") # we want to know training accuracy in the end
)
summary(multinom)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## flatten (Flatten)                   (None, 784)                     0           
## ________________________________________________________________________________
## dense (Dense)                       (None, 10)                      7850        
## ================================================================================
## Total params: 7,850
## Trainable params: 7,850
## Non-trainable params: 0
## ________________________________________________________________________________
# parameters: 784*10 weights + 10 biases at output nodes = 7850

5. Train the model for 5 epochs using the code below. What accuracy do we obtain in the validation set? (NB: the multinom object is changed “in-place”, which means you don’t have to assign it to another variable)

multinom %>% fit(x = mnist$train$x, y = mnist$train$y, epochs = 5, validation_split = 0.2, verbose = 1)
# about 92.69% accuracy on the validation set

6. Train the model for another 5 epochs. What accuracy do we obtain in the validation set?

multinom %>% fit(x = mnist$train$x, y = mnist$train$y, epochs = 5, validation_split = 0.2, verbose = 1)
# about 92.98% accuracy on the validation set. Slightly improved but not much!

Deep feed-forward neural networks.

7. Create and compile a feed-forward neural network with the following properties. Ensure that the model has 50890 parameters.

  • sequential model
  • flatten layer
  • dense layer with 64 hidden units and “relu” activation function
  • dense output layer with 10 units and softmax activation function

You may reuse code from the multinomial model

ffnn  <- 
  # initialize a sequential model
  keras_model_sequential(input_shape = c(28, 28)) %>% 
  # flatten 28*28 matrix into single vector
  layer_flatten() %>%
  # this is the hidden layer!
  layer_dense(64, activation = "relu") %>%
  # softmax outcome == probability for each of 10 outputs
  layer_dense(10, activation = "softmax")

ffnn$compile(
  loss = "sparse_categorical_crossentropy", # loss function for multinomial outcome
  optimizer = "adam", # we use this optimizer because it works well
  metrics = list("accuracy") # we want to know training accuracy in the end
)

summary(ffnn)
## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## flatten_1 (Flatten)                 (None, 784)                     0           
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 64)                      50240       
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 10)                      650         
## ================================================================================
## Total params: 50,890
## Trainable params: 50,890
## Non-trainable params: 0
## ________________________________________________________________________________

7. Train the model for 10 epochs. What do you see in terms of validation accuracy, also compared to the multinomial model?

ffnn %>% fit(x = mnist$train$x, y = mnist$train$y, epochs = 10, validation_split = 0.2, verbose = 1)
# in my case, higher accuracy! 97.28%

8. Create predictions for the test data using the two trained models (using the function below). Create a confusion matrix and compute test accuracy for these two models. Write down any observations you have.

class_predict <- function(model, x_train) predict(model, x = x_train) %>% apply(1, which.max) - 1
pred_multinom <- class_predict(multinom, x = mnist$test$x)
pred_ffnn     <- class_predict(ffnn, x = mnist$test$x)

(ctab_multinom <- table(pred = pred_multinom, true = mnist$test$y))
##     true
## pred    0    1    2    3    4    5    6    7    8    9
##    0  964    0    6    3    2   10   14    1   11   11
##    1    0 1110    8    0    1    3    3    6    4    5
##    2    1    4  932   20    6    4    8   24    7    1
##    3    2    1   12  912    1   22    1    6   16   10
##    4    0    0    8    0  915    6    7    8    9   23
##    5    6    1    5   31    0  795   14    1   23    9
##    6    4    4   11    2   10   12  908    0    7    0
##    7    2    2    8   10    4    5    1  941    6   14
##    8    1   13   39   24   11   29    2    2  882    9
##    9    0    0    3    8   32    6    0   39    9  927
(ctab_ffnn     <- table(pred = pred_ffnn, true = mnist$test$y))
##     true
## pred    0    1    2    3    4    5    6    7    8    9
##    0  969    0    5    0    1    2    5    1    3    4
##    1    0 1121    1    1    0    0    3    0    1    3
##    2    2    4  996    3    3    0    1    5    8    1
##    3    1    4    5  987    1    9    2    7   20    6
##    4    2    0    2    0  935    1    5    0    4    3
##    5    1    1    1    4    2  868    9    0   14    6
##    6    1    2    3    0    5    5  932    0    2    1
##    7    1    1   15    7   10    1    0 1010   11   10
##    8    2    2    3    0    2    5    1    0  910    0
##    9    1    0    1    8   23    1    0    5    1  975
sum(diag(ctab_multinom)) / sum(ctab_multinom)
## [1] 0.9286
sum(diag(ctab_ffnn)) / sum(ctab_ffnn)
## [1] 0.9703

9. OPTIONAL: if you have time, create and estimate (10 epochs) a deep feed-forward model with the following properties. Compare this model to the previous models on the test data.

  • sequential model
  • flatten layer
  • dense layer with 128 hidden units and “relu” activation function
  • dense layer with 64 hidden units and “relu” activation function
  • dense output layer with 10 units and softmax activation function
dffnn  <- 
  keras_model_sequential(input_shape = c(28, 28)) %>% # initialize a sequential model
  layer_flatten() %>% # flatten 28*28 matrix into single vector
  layer_dense(128, activation = "relu") %>% # this is the hidden layer!
  layer_dense(64, activation = "relu") %>% # this is the hidden layer!
  layer_dense(10, activation = "softmax") # softmax outcome == logistic regression for each of 10 outputs

dffnn$compile(
  loss = "sparse_categorical_crossentropy", # loss function for multinomial outcome
  optimizer = "adam", # we use this optimizer because it works well
  metrics = list("accuracy") # we want to know training accuracy in the end
)

summary(dffnn)
## Model: "sequential_2"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## flatten_2 (Flatten)                 (None, 784)                     0           
## ________________________________________________________________________________
## dense_5 (Dense)                     (None, 128)                     100480      
## ________________________________________________________________________________
## dense_4 (Dense)                     (None, 64)                      8256        
## ________________________________________________________________________________
## dense_3 (Dense)                     (None, 10)                      650         
## ================================================================================
## Total params: 109,386
## Trainable params: 109,386
## Non-trainable params: 0
## ________________________________________________________________________________
dffnn %>% fit(x = mnist$train$x, y = mnist$train$y, epochs = 10, validation_split = 0.2, verbose = 1)

pred_dffnn <- class_predict(dffnn, x = mnist$test$x)
(ctab_dffnn <- table(pred = pred_dffnn, true = mnist$test$y))
##     true
## pred    0    1    2    3    4    5    6    7    8    9
##    0  956    0    3    1    1    1    2    0    0    1
##    1    0 1127    4    0    1    0    2    3    0    2
##    2    2    2 1009    4    3    0    0   13    3    0
##    3    1    1    3  973    1    3    1    5    3    1
##    4    0    0    3    0  950    0    5    2    1    6
##    5    7    0    0   14    1  880   14    0   15    5
##    6    6    0    3    0    6    2  931    0    0    0
##    7    0    0    3    4    2    0    0  988    3    1
##    8    5    4    4    8    2    2    3    4  945    2
##    9    3    1    0    6   15    4    0   13    4  991
sum(diag(ctab_dffnn)) / sum(ctab_dffnn)
## [1] 0.975
# this one does even better!

Lab exercises: convolutional neural network

Convolution layers in Keras need a specific form of data input. For each example, they need a (width, height, channels) array (tensor). For a colour image with 28*28 dimension, that shape is usually (28, 28, 3), where the channels indicate red, green, and blue. MNIST has no colour info, but we still need the channel dimension to enter the data into a convolution layer with shape (28, 28, 1). The training dataset x_train should thus have shape (60000, 28, 28, 1).

10. add a “channel” dimension to the training and test data using the following code. Plot an image using the first channel of the 314th training example (this is a 9).

# add channel dimension to input (required for convolution layers)
dim(mnist$train$x) <- c(dim(mnist$train$x), 1)
dim(mnist$test$x)  <- c(dim(mnist$test$x), 1)
plot_img(mnist$train$x[314,,,1])

11. Create and compile a convolutional neural network using the following code. Describe the different layers in your own words.

cnn <- 
  keras_model_sequential(input_shape = c(28, 28, 1)) %>% 
  layer_conv_2d(filters = 6, kernel_size = c(5, 5)) %>% 
  layer_max_pooling_2d(pool_size = c(4, 4)) %>%
  layer_flatten() %>% 
  layer_dense(units = 32, activation = "relu") %>% 
  layer_dense(10, activation = "softmax")

cnn %>% 
  compile(
    loss = "sparse_categorical_crossentropy",
    optimizer = "adam", 
    metrics = c("accuracy")
  )
summary(cnn)
## Model: "sequential_3"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## conv2d (Conv2D)                     (None, 24, 24, 6)               156         
## ________________________________________________________________________________
## max_pooling2d (MaxPooling2D)        (None, 6, 6, 6)                 0           
## ________________________________________________________________________________
## flatten_3 (Flatten)                 (None, 216)                     0           
## ________________________________________________________________________________
## dense_7 (Dense)                     (None, 32)                      6944        
## ________________________________________________________________________________
## dense_6 (Dense)                     (None, 10)                      330         
## ================================================================================
## Total params: 7,430
## Trainable params: 7,430
## Non-trainable params: 0
## ________________________________________________________________________________
# First, we have the input layer which gets the images and the first channel (28, 28, 1)
# then, there is a 2d convolution layer with 6 filters, and a kernel size of 5 (in each direction)
# then, we max-pool the resulting 6 maps to reduce their size by 4 in each direction
# afterwards, we flatten
# then comes a dense hidden layer with 32 units and a relu activation function
# lastly, the output layer is the same as before

12. Fit this model on the training data (10 epochs) and compare it to the previous models.

cnn %>% fit(x = mnist$train$x, y = mnist$train$y, epochs = 10, validation_split = 0.2, verbose = 1)

pred_cnn <- class_predict(cnn, x = mnist$test$x)
(ctab_cnn <- table(pred = pred_cnn, true = mnist$test$y))
##     true
## pred    0    1    2    3    4    5    6    7    8    9
##    0  970    0    1    0    0    1    4    1    2    0
##    1    0 1133    6    0    2    0    3    5    1    5
##    2    3    1 1001    2    1    0    0    7    1    0
##    3    0    0    5 1001    0    7    0    1    2    0
##    4    0    0    2    0  968    0    3    2    0    6
##    5    0    1    0    2    0  874    1    1    1    1
##    6    3    0    0    0    0    3  944    0    0    0
##    7    1    0    9    1    0    0    0 1004    1    6
##    8    1    0    8    3    0    5    3    2  961    3
##    9    2    0    0    1   11    2    0    5    5  988
sum(diag(ctab_cnn)) / sum(ctab_cnn)
## [1] 0.9844
# has about the same performance as the deep feed-forward model (mind you - with way fewer parameters!)

13. Create another CNN which has better performance within 10 epochs. Compare your validation or test accuracy to that of your peers.

Here are some things you could do:

  • Reduce the convolution filter size & the pooling size and add a second convolutional & pooling layer with double the number of filters
  • Add a dropout layer after the flatten layer
  • Look up on the internet what works well and implement it!
cnn_2 <- 
  keras_model_sequential(input_shape = c(28, 28, 1)) %>% 
  layer_conv_2d(filters = 6, kernel_size = c(3, 3)) %>% 
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  layer_conv_2d(filters = 12, kernel_size = c(3, 3)) %>% 
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  layer_flatten() %>% 
  layer_dropout(rate = 0.2) %>% 
  layer_dense(units = 32, activation = "relu") %>% 
  layer_dense(10, activation = "softmax")

cnn_2 %>% 
  compile(
    loss = "sparse_categorical_crossentropy",
    optimizer = "adam", 
    metrics = c("accuracy")
  )

cnn_2 %>% fit(x = mnist$train$x, y = mnist$train$y, epochs = 10, validation_split = 0.2, verbose = 1)
pred_cnn_2 <- class_predict(cnn_2, x = mnist$test$x)
(ctab_cnn_2 <- table(pred = pred_cnn_2, true = mnist$test$y))
##     true
## pred    0    1    2    3    4    5    6    7    8    9
##    0  966    0    1    0    0    0    5    1    1    0
##    1    0 1131    0    0    0    0    2    2    1    3
##    2    3    1 1020    1    0    1    1    8    2    0
##    3    0    2    1 1003    0   10    0    2    1    1
##    4    2    0    1    0  970    0    3    0    1    2
##    5    0    0    0    3    0  879    4    0    2    3
##    6    3    1    1    0    1    1  941    0    0    0
##    7    1    0    6    1    1    1    0 1013    5    6
##    8    3    0    2    2    1    0    2    0  955    1
##    9    2    0    0    0    9    0    0    2    6  993
sum(diag(ctab_cnn_2)) / sum(ctab_cnn_2)
## [1] 0.9871