Week 5 - Classification & Clustering Methods
Important
- Due date: Lab 5 - Sunday, Oct 13, 5pm ET
- Quiz 1 will be in class on October 09 (est. 30 minutes)
Prepare
📖 Read Introduction to Modern Statistics, Sec 9: Logistic regression
📖 Read Classification with the Tidymodels Framework in R
📖 Check out Tidymodels Tutorial - Classification
📖 Look at Classification: ROC and AUC
Participate
Perform
Podcast
Study
Short Answer Questions
Instructions: Answer the following questions in 2-3 sentences each.
- What is the key difference between eager learners and lazy learners in classification?
- Provide an example of a situation where accuracy might be a misleading metric for evaluating a binary classifier.
- Explain how adjusting the classification threshold in a binary classifier affects the trade-off between sensitivity and specificity.
- What is the main assumption made by Naive Bayes classification, and why is it considered “naive”?
- How does the choice of ‘k’ affect the performance of a k-Nearest Neighbors (k-NN) classifier?
- What is the concept of a “margin” in Support Vector Machine (SVM) classification, and why is maximizing it desirable?
- Describe two common kernel functions used in SVM classification and their applications.
- What is the curse of dimensionality, and how does it impact k-NN classification?
- What is the fundamental difference between classification and clustering in machine learning?
- Briefly explain the steps involved in the k-means clustering algorithm.
Short Answer Key
- Eager learners (e.g., logistic regression, decision trees) learn a model from the training data before making predictions, while lazy learners (e.g., k-NN) memorize the training data and classify new instances based on similarity to stored instances.
- In a highly imbalanced dataset (e.g., fraud detection), a model that always predicts the majority class will have high accuracy but fail to identify the minority class, which is often of greater interest.
- Lowering the classification threshold increases sensitivity (recall) but decreases specificity, leading to more true positives but also more false positives. Raising the threshold has the opposite effect.
- Naive Bayes assumes that all features are conditionally independent, given the class label. This assumption is often unrealistic in real-world data, hence the term “naive.”
- A small ‘k’ can make k-NN sensitive to noise (overfitting), while a large ‘k’ can oversmooth the decision boundary (underfitting). The optimal ‘k’ depends on the dataset and is typically found through cross-validation.
- The margin in SVM is the distance between the separating hyperplane and the nearest data points (support vectors) of each class. Maximizing the margin aims to improve the classifier’s generalization ability and reduce overfitting.
- The Radial Basis Function (RBF) kernel is commonly used for non-linear classification problems. The polynomial kernel maps data into a higher-dimensional space and can model non-linear relationships between features.
- The curse of dimensionality refers to the phenomenon where distances between points become less meaningful as the number of dimensions increases. In high-dimensional spaces, k-NN can struggle to find meaningful nearest neighbors.
- Classification is a supervised learning task where the goal is to predict predefined labels for data points. Clustering is an unsupervised learning task where the goal is to discover natural groupings or structures in data without predefined labels.
- The k-means algorithm initializes ‘k’ centroids randomly. Then, it iteratively assigns each data point to the nearest centroid and recalculates the centroids until convergence (centroids no longer change significantly).
Essay Questions
- Compare and contrast eager learning and lazy learning algorithms for classification. Discuss their strengths, weaknesses, and situations where one might be preferred over the other.
- Explain the concepts of precision and recall in the context of binary classification. Discuss how these metrics are affected by changes in the classification threshold and provide examples of applications where each metric might be prioritized.
- Describe the Naive Bayes classification algorithm in detail, including its underlying assumptions and the calculation of conditional probabilities. Discuss its advantages, limitations, and potential applications.
- Explain the workings of Support Vector Machines (SVMs) for classification. Describe the concepts of support vectors, margins, and kernel functions. Compare and contrast linear, polynomial, and radial basis function (RBF) kernels in SVM.
Back to course schedule ⏎