INFOMDA2

Logo

Materials for Applied Data Science profile course INFOMDA2 Battling the curse of dimensionality.

1 Comparing cluster methods

In this assignment, you will design, implement, and critically compare two different clustering analyses for a single dataset.

1. Create an assignment folder with your assignment .Rmd file in the root and the following subdirectories: raw_data/, processed_data/.

2. Find a clustering dataset with 10-100 columns (attributes) in the UCI machine learning repository. Download the dataset in the raw_data/ subdirectory of your assignment folder. In one or two paragraphs, explain what the data is about. It’s easiest to use this link

3. Preprocess the data into a tidy dataset (a data frame or tibble). This can include things like transforming variables (e.g., feet to meters), giving each variable the correct measurement level (character, factor, ordered factor, numeric) and selecting only the columns you need. Save the tidy dataset as an .rds in the processed_data/ subdirectory. In one or two paragraphs, explain which features you chose.

4. Choose two different clustering methods. This can be any method of your choice, even combinations of methods like PCA + K-means. Describe these methods and why you chose them for this dataset.

5. Apply these methods to your dataset. Make sure to apply the knowledge you obtained in the clustering weeks.

6. Decide and describe how you will compare these methods in your dataset, and then implement this comparison.

7. Write a short conclusion where you critically compare the relative strengths and weaknesses of the methods you chose.

2 Hand-in format

A zipped folder with: