Materials for Applied Data Science profile course INFOMDA2 Battling the curse of dimensionality.
1. Download the corn data and store it in your assignment folder.
2. Pick a property (Moisture, Oil, Starch, or Protein) to predict.
3. Split your data into a training (80%) and test (20%) set.
4. Use the function plsr
from the package pls
to estimate a
partial least squares model, predicting the property using the NIR
spectroscopy measurements in the training data. Make sure that the
features are on the same scale. Use leave-one-out cross-validation
(built into plsr
) to estimate out-of-sample performance.
5. Find out which component best predicts the property you chose. Explain how you did this.
6. Create a plot with on the x-axis the wavelength, and on the y-axis the strength of the loading for this component. Explain which wavelengths are most important for predicting the property you are interested in.
7. Pick the number of components included in the model based on the
“one standard deviation” rule (selectNcomp()
). Create predictions for
the test set using the resulting model.
8. Compare your PLS predictions to a LASSO linear regression model
where lambda is selected based on cross-validation with the one standard
deviation rule (using cv.glmnet
).
A zipped folder with:
data/
subfolder.Rmd
file with your answers and clean, commented code chunks.html
or .pdf
file from this .Rmd
..Rmd
without error upon unzipping!