Materials for Applied Data Science profile course INFOMDA2 Battling the curse of dimensionality.
In this assignment, you will design, implement, and critically compare two different clustering analyses for a single dataset.
1. Create an assignment folder with your assignment .Rmd
file in the
root and the following subdirectories: raw_data/
, processed_data/
.
2. Find a clustering dataset with 10-100 columns (attributes) in the
UCI machine learning repository. Download the dataset in the raw_data/
subdirectory of your assignment folder. In one or two paragraphs,
explain what the data is about. It’s easiest to use this
link
3. Preprocess the data into a
tidy
dataset (a data frame or tibble). This can include things like
transforming variables (e.g., feet to meters), giving each variable the
correct measurement level (character, factor, ordered factor, numeric)
and selecting only the columns you need. Save the tidy dataset as an
.rds
in the processed_data/
subdirectory. In one or two paragraphs,
explain which features you chose.
4. Choose two different clustering methods. This can be any method of your choice, even combinations of methods like PCA + K-means. Describe these methods and why you chose them for this dataset.
5. Apply these methods to your dataset. Make sure to apply the knowledge you obtained in the clustering weeks.
6. Decide and describe how you will compare these methods in your dataset, and then implement this comparison.
7. Write a short conclusion where you critically compare the relative strengths and weaknesses of the methods you chose.
A zipped folder with:
raw_data/
subfolder.rds
file in a processed_data/
subfolder.Rmd
file with your answers and clean, commented code chunks.html
or .pdf
report from this .Rmd
..Rmd
without error upon unzipping!