Machine Learning: Exam Notes
1. Introduction
Machine Learning (ML) enables computers to
learn patterns from data without explicit programming. Coined by Arthur Samuel
in 1959: "Field of study that gives computers the capability to learn without
being explicitly programmed."
ML vs. Traditional Programming:
• Traditional: Input + Program Logic →
Output
• ML: Input + Output → Model (logic
inferred during training), then Prediction on new inputs
2. Key Terminology
• Model (Hypothesis): Representation
learned from data via an ML algorithm.
• Feature: Measurable property of data
(e.g., color, size), represented as feature vectors.
• Target (Label): The value to predict
(e.g., fruit type).
• Training: Learning model parameters from
labeled data.
• Prediction: Applying the trained model to
new inputs.
3. Types of Learning
1. Supervised Learning: Learns from labeled
data. Tasks: Classification, Regression.
2. Unsupervised Learning: Discovers
patterns in unlabeled data. Tasks: Clustering, Association.
3. Semi‑Supervised Learning: Uses small
labeled and large unlabeled datasets.
4. Supervised Learning Algorithms
4.1 Classification
• Logistic Regression: Uses sigmoid
function to map features to probabilities. Loss: cross‑entropy.
• k‑Nearest Neighbors (kNN): Classifies
based on majority vote of k nearest samples.
• Support Vector Machine (SVM): Finds
hyperplane maximizing margin; kernel trick for non-linear separation.
• Naive Bayes: Probabilistic model with
feature independence assumption (variants: Gaussian, Multinomial, Bernoulli).
4.2 Regression
• Linear Regression: Models relationship by
minimizing squared error.
• Multiple Linear Regression: Extends to
multiple features.
• Polynomial Regression: Models nonlinear
relationships via feature expansion.
• Decision Tree Regression: Recursive
partitioning; prone to overfitting.
• Random Forest Regression: Ensemble of
trees via bagging; reduces variance.
• XGBoost: Gradient-boosted trees optimized
for speed and regularization.
5. Unsupervised Learning Algorithms
5.1 Clustering
• k-Means: Partitions data into k clusters
by minimizing within-cluster sum of squares.
• Hierarchical Clustering: Agglomerative or
divisive merging/splitting; various linkage criteria.
• DBSCAN: Density-based clustering;
parameters ε (radius) and MinPts; identifies core, border, and noise points.
6. Model Evaluation & Optimization
• Cross‑Validation (k‑Fold): Splits data
into k folds; trains on k-1, validates on held-out, averages performance.
• Hyperparameter Tuning: Grid Search over
parameter grid using CV to select best combination.
7. Sample Exam Questions & Answers
7.1 5‑Mark Questions
Q1. Define Machine Learning and
differentiate it from traditional programming.
A. ML enables computers to learn patterns
from data without explicit programming logic, unlike traditional programming
where logic must be manually coded.
Q2. List and briefly describe three types
of learning in ML.
A. 1. Supervised Learning: trains on
labeled data (classification/regression).
2. Unsupervised Learning: discovers structure in unlabeled data
(clustering/association).
3. Semi‑Supervised Learning: uses both labeled and unlabeled data.
Q3. What is overfitting, and how does
Random Forest mitigate it?
A. Overfitting occurs when a model captures
noise, reducing generalization. Random Forest mitigates overfitting by
averaging multiple decision trees trained on bootstrapped samples, reducing
variance.
7.2 10‑Mark Questions
Q1. Explain the working of k‑Nearest
Neighbors algorithm. What are its advantages and disadvantages?
A. The kNN algorithm classifies a sample
based on the majority class among its k nearest neighbors (using a distance
metric). Advantages: simple, no training time. Disadvantages: high
computational cost on large datasets, sensitive to noise and irrelevant
features, requires feature scaling.
Q2. Describe the concept of kernel trick in
SVMs. Give examples of common kernels.
A. Kernel trick maps data to a
higher-dimensional space for linear separability by computing dot products
implicitly via functions. Common kernels: Linear, Polynomial, Radial Basis
Function (RBF), Sigmoid.
7.3 20‑Mark Question
Q1. Discuss the steps involved in building,
evaluating, and tuning a supervised ML model for a regression problem.
Illustrate with a pipeline from data preprocessing to hyperparameter
optimization.
In-depth Answer: Regression Model Pipeline
1. Problem Definition & Data Collection
Begin by defining the regression objective
(e.g., predict CO₂ emissions) and gather a dataset with features and target.
2. Data Preprocessing
• Handle missing values: impute or remove.
• Encode categorical features: one-hot encoding.
• Scale features: standardize or normalize.
3. Feature Engineering & Selection
• Polynomial Features: expand for
non-linear relationships.
• Feature Importance: use tree-based models to drop uninformative features.
4. Train/Test Split
• Hold-out test set (20–30%) for final
evaluation.
• Use remaining data for training and validation.
5. Model Training
Train candidate models: Linear Regression,
Decision Tree, Random Forest, XGBoost.
6. Model Evaluation
Use MSE, MAE, R² on validation data.
7. Cross-Validation
Apply k-fold CV: split data into k folds,
train on k-1 and validate, repeat, average.
8. Hyperparameter Tuning
Use GridSearchCV to search combinations
(e.g., n_estimators, max_depth) via CV to find the best model.
9. Final Testing & Deployment
Evaluate tuned model on the held-out test
set, deploy, and monitor for drift.