Materials for Applied Data Science profile course INFOMDA2 *Battling the curse of dimensionality*.

The ever-growing influx of data allows us to develop, interpret and apply an increasing set of learning techniques. However, with this increase in data comes a challenge: how to make sense of the data and identify the components that really matter in our modeling efforts. This course gives a detailed and modern overview of statistical learning with a specific focus on high-dimensional data.

In this course we emphasize the tools that are useful in solving and interpreting modern-day analysis problems. Many of these tools are essential building blocks that are often encountered in statistical learning. We also consider the state-of-the-art in handling machine learning problems. We will not only discuss the theoretical underpinnings of different techniques, but focus also on the skills and experience needed to rapidly apply these techniques to new problems.

During this course, participants will actively learn how to apply the main statistical methods in data analysis and how to use machine learning algorithms and visualization techniques, especially on high-dimensional data problems. The course has a strongly practical, hands-on focus: rather than focusing on the mathematics and background of the discussed techniques, you will gain hands-on experience in using them on real data during the course and interpreting the results.

The course INFOMDA1 (or equivalent) serves as a sufficient entry requirement for this course. For information about the contents of the INFOMDA1 course, refer to its course website.

At the end of this course, students are able to apply and interpret the theories, principles, methods and techniques related to contemporary data science and to understand and explain different approaches to data analysis:

- apply data visualization and dimension reduction techniques on high dimensional data sets
- implement, understand, and explain methods and techniques that are associated with advanced data modeling, including regularized regression, principal components, correspondence analysis, neural networks, clustering, time series, text mining and deep learning.
- evaluate the performance of these techniques with appropriate performance measures.
- select appropriate techniques to solve specific data science problems.
- motivate and explain the choice for techniques to investigate data problems.
- interpret and evaluate the results of (high-dimensional) data analyses and explain these techniques in simple terminology to a broad audience.
- understand and explain the principles of high-dimensional data analysis and visualization.
- construct appropriate visualizations for each data analysis technique in R.

Freely available sections from the following books:

**ISLR**: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021).*An introduction to statistical learning*(2nd ed.). Springer. statlearning.com**SLS**: Hastie, T., Tibshirani, R., & Wainwright, M. (2015).*Statistical learning with sparsity*CRC Press. web.stanford.edu/~hastie/StatLearnSparsity**ESL**: Hastie, T., Tibshirani, R., and Friedman, J. (2001).*The Elements of Statistical Learning: Data Mining, Inference and Prediction.*New York: Springer Verlag. web.stanford.edu/~hastie/ElemStatLearn.**R4DS**: Wickham, H., & Grolemund, G. (2016).*R for data science: import, tidy, transform, visualize, and model data.*O’Reilly Media, Inc. r4ds.had.co.nz**MBCC**: Bouveyron, C., Celeux, G., Murphy, T., & Raftery, A. (2019).*Model-based Clustering and Classification for Data Science: With Applications in R*Cambridge University Press. cambridge.org**FPP3**: Hyndman, R. J. & Athanasopoulos, G. (2021).*Forecasting: Principles and Practice (3rd ed.)*Otexts. otexts.com/fpp3**TTMR**: Silge, J., & Robinson, D. (2021).*Text mining with R: A tidy approach.*O’Reilly Media, Inc. tidytextmining.com**SLP3**: Jurafsky, D., Martin, J.H. (2021)*Speech and language processing.*(3rd ed.) https://web.stanford.edu/~jurafsky/slp3/- Some freely available articles & chapters.

In this course, we will exclusively use R & RStudio for data analysis.
First, install the latest version of R for your system (see
`https://cran.r-project.org/`

). Then,
install the latest (desktop open source) version of the RStudio
integrated development environment
(`link`

).

We will make extensive use of the `tidyverse`

suite of packages, which
can be installed from within `R`

using the command
`install.packages("tidyverse")`

.

- There will be a lecture and a lab session each week. Both are in-person.
- The required readings should be read before the lecture. These are
*not*optional. - There are some take-home exercises to be done before each lab session; these will be discussed during the lab session.
- There are some additional exercises to be done during the lab session; the answers to these will be made available after the session.
- Hand-in of practicals and assignments is done on blackboard

- INFOMDA2 is a fully offline course, with in-person lectures and lab sessions.
- We find it important for interactive and collaborative learning that the course is offline, hence there is no teams environment for this course.
- If you miss a session, e.g., due to sickness, you should catch up in
the regular way:
- Read the readings
- Go through the lecture slides
- Do the practicals
- Ask your peers if you have questions
- (after the above) ask the lecturer for further explanation

- Of course, we realize that a new corona wave may occur during the course
- If (and only if) this leads to considerable problems, we will reconsider the in-person course policy and adjust to a hybrid / online setting

- To develop the necessary skills for completing the assignments and the
exam, 9
`R`

practicals must be made and handed in. These exercises are not graded, but students must fulfill them to pass the course. - There are two graded assignments. These each count for
**5%**of your grade. **90%**of your grade will be determined by a final exam featuring both knowledge questions as well as practical data analysis skills in`R`

. Some example questions will be made available in due time to you so you can prepare.

You can find the up-to-date class schedule with locations on mytimetable.uu.nl.

Day | Date | Time | Location | Description |
---|---|---|---|---|

Wednesday | 16-11-2022 | 13:15 - 15:00 | BBG 223 | Lecture 1 |

Friday | 18-11-2022 | 11:00 - 12:45 | BST LIE | Lab 1 |

Wednesday | 23-11-2022 | 13:15 - 15:00 | BBG 223 | Lecture 2 |

Friday | 25-11-2022 | 11:00 - 12:45 | BBG 209 | Lab 2 |

Wednesday | 30-11-2022 | 13:15 - 15:00 | BBG 223 | Lecture 3 |

Friday | 02-12-2022 | 11:00 | Deadline A1 | |

Friday | 02-12-2022 | 11:00 - 12:45 | BBG 209 | Lab 3 |

Wednesday | 07-12-2022 | 13:15 - 15:00 | BBG 223 | Lecture 4 |

Friday | 09-12-2022 | 11:00 - 12:45 | BBG 209 | Lab 4 |

Wednesday | 14-12-2022 | 13:15 - 15:00 | BBG 223 | Lecture 5 |

Friday | 16-12-2022 | 11:00 - 12:45 | BBG 209 | Lab 5 |

Wednesday | 21-12-2022 | 13:15 - 15:00 | BBG 203 | Lecture 6 |

Friday | 23-12-2022 | 11:00 - 12:45 | BBG 209 | Lab 6 |

Break | ||||

Wednesday | 11-01-2023 | 13:15 - 15:00 | BBG 223 | Lecture 7 |

Friday | 13-01-2023 | 11:00 - 12:45 | BBG 209 | Lab 7 |

Wednesday | 18-01-2023 | 13:15 - 15:00 | BBG 223 | Lecture 8 |

Friday | 20-01-2023 | 11:00 | Deadline A2 | |

Friday | 20-01-2023 | 11:00 - 12:45 | BBG 209 | Lab 8 |

Wednesday | 25-01-2023 | 13:15 - 15:00 | BBG 223 | Lecture 9 |

Friday | 27-01-2023 | 11:00 - 12:45 | BBG 209 | Lab 9 |

Friday | 03-02-2023 | 14:00 - 16:00 | BBG 201 | Exam |

Friday | 03-03-2023 | TBD | TBD | Resit |

- This syllabus
- ISLR section 6.2 shrinkage methods
- ISLR section 6.4 considerations in high dimensions
- SLS chapter 1
- SLS sections 2.1 - 2.4.2
- SLS chapter 4 until (not including) example 4.1.

- Review the tidyverse style guide

- ISLR section 12.1-12.2 (pp. 497-510)
- ESL section 3.4 (pp. 66-67; skip subsections 3.4.2-3.4.4)
- ESL section 14.5 (pp. 534-536; skip subsections 14.5.2-14.5.5)
- ESL section 14.6 (pp. 553-554; skip subsection 14.6.1)
- ESL section 14.7 (pp. 557-570; skip subsection 14.7.3)

- ISLR section 6.3 dimension reduction methods
- Abdi H. and Bera, M. (2018).
*Correspondence analysis.*In R. Alhajj and J. Rokne (Eds.), Encyclopedia of Social Networks and Mining (2nd Edition). New York: Springer Verlag. Freely available from the website.

Partial least squares
(link).
Hand in on blackboard **before** practical 3 (`02-12-2022 @ 11:00`

).

- ISLR chapter 10, up to and including section 10.3.3.

- ISLR section 12.4
- Tan, Steinbach, Karpatne, & Kumar (2019) Introduction to Data Mining section (second edition) section 7.5 cluster evaluation, available here
- SLS sections 8.5.1 and 8.5.2.

- The remainder of Introduction to Data Mining Chapter 7 Cluster analysis

- Mixture models: latent profile and latent class analysis by Daniel Oberski (2016) link
- MBCC sections 2.1 and 2.2

- MBCC sections 2.3, 2.4, 2.8
- MBCC chapter 8 (not freely available)

- FPP3 sections 1.1, 1.7, 2.1 - 2.3, 2.8, 2.9, 3.2, 3.6, 5.1, 5.7, 5.8, 9.1, 9.3

- TTMR sections TBD
- SLP3 sections TBD

Comparing clustering methods
(link).
Hand in on blackboard **before** practical 8 (`20-01-2023 @ 11:00`

).

- TTMR sections TBD
- SLP3 sections TBD

`03-02-2023 | 14:00 - 16:00`

Target date: `03-03-2023`

, to be confirmed.