Materials for Applied Data Science profile course INFOMDA2 Battling the curse of dimensionality.

1 Partial least squares

1. Download the corn data and store it in your assignment folder.

2. Pick a property (Moisture, Oil, Starch, or Protein) to predict.

3. Split your data into a training (80%) and test (20%) set.

4. Use the function plsr from the package pls to estimate a partial least squares model, predicting the property using the NIR spectroscopy measurements in the training data. Make sure that the features are on the same scale. Use leave-one-out cross-validation (built into plsr) to estimate out-of-sample performance.

5. Find out which component best predicts the property you chose. Explain how you did this.

6. Create a plot with on the x-axis the wavelength, and on the y-axis the strength of the loading for this component. Explain which wavelengths are most important for predicting the property you are interested in.

7. Pick the number of components included in the model based on the “one standard deviation” rule (selectNcomp()). Create predictions for the test set using the resulting model.

8. Compare your PLS predictions to a LASSO linear regression model where lambda is selected based on cross-validation with the one standard deviation rule (using cv.glmnet).

2 Hand-in format

A zipped folder with: