Chapter 1 — Deep Learning Landscape & Motivation
Chapter 1 — Deep Learning Landscape & Motivation
Unit I · Introduction to Deep Learning
1. Introduction to Deep Learning
Deep Learning (DL) is a sub-field of Machine Learning (ML), which is itself a sub-field of Artificial Intelligence (AI). The key idea is representation learning: instead of hand-crafting features, a deep neural network automatically discovers useful hierarchical representations directly from raw data.
Three factors drove the DL revolution after 2012: (1) large labelled datasets (ImageNet, Common Crawl), (2) GPU-accelerated computing, and (3) algorithmic advances (ReLU, dropout, batch normalization, residual connections).
Figure 1.1 — The AI ⊃ ML ⊃ DL ⊃ GenAI nested hierarchy
ML: learns patterns from data to approximate y = f(x) by minimising a loss.
DL: ML with neural networks (many layers) that learn representations + decisions end-to-end.
GenAI: DL models that generate new content (text / images / code) instead of only predicting labels — LLMs, diffusion models, GANs.
Learning paradigms within ML: Supervised (labelled pairs \((x, y)\) — classification, regression), Unsupervised (unlabelled data — clustering, dimensionality reduction), Reinforcement Learning (agent learns via reward signals from environment).
Exam-ready points- Deep learning = learning representations through many layers of non-linear transformations.
- AlexNet (2012, Krizhevsky et al.) winning ImageNet by a large margin is considered the watershed moment.
- Key formula: output of layer \(\ell\): \(\mathbf{a}^{(\ell)} = \sigma\!\left(W^{(\ell)}\mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}\right)\)
- GenAI extends DL by training models to generate: LLMs (next-token prediction), diffusion models (iterative denoising), GANs (adversarial generation).
2. Probability Basics & Bayesian Learning
Probability is the language of ML. Four concepts you must know cold:
- Likelihood \(L(\theta) = P(\mathcal{D}|\theta)\): how probable the observed data is under parameters \(\theta\).
- Prior \(P(\theta)\): our belief about parameters before seeing any data.
- Posterior \(P(\theta|\mathcal{D})\): updated belief after seeing the data.
- Bayes' rule: \(P(\theta|\mathcal{D}) = \frac{P(\mathcal{D}|\theta)\,P(\theta)}{P(\mathcal{D})}\) — multiply likelihood × prior, normalise by evidence.
MLE (Maximum Likelihood Estimation): \(\hat\theta = \arg\max_\theta \log P(\mathcal{D}|\theta)\) — ignores prior. MAP (Maximum A Posteriori): \(\hat\theta_{\text{MAP}} = \arg\max_\theta [\log P(\mathcal{D}|\theta) + \log P(\theta)]\) — adds prior as regulariser.
- MLE: maximise data likelihood — can overfit if prior is ignored.
- MAP with Gaussian prior \(\equiv\) L2-regularised MLE.
- Full Bayesian inference is intractable for large networks → variational inference or MC-Dropout as approximations.
3. Shallow Machine Learning overview
Shallow ML models include Linear/Logistic Regression, Support Vector Machines (SVM), Decision Trees / Random Forests, and k-NN. They rely on hand-crafted features and a single (or few) layers of learned transformation.
Their strengths: interpretability, efficiency on small data, strong theoretical guarantees. Their weakness: performance plateaus on complex high-dimensional inputs (images, speech, text) because manual feature engineering is expensive and incomplete.
Exam-ready points- Feature engineering (HOG, SIFT, TF-IDF) is the bottleneck in shallow pipelines.
- SVMs maximise the margin: \(\min \frac{1}{2}\|w\|^2\) subject to \(y_i(w^\top x_i + b) \ge 1\).
- Random Forests use bagging + random feature subsets → low variance, moderate bias.
4. Deep vs Shallow: key differences
| Aspect | Shallow ML | Deep Learning |
|---|---|---|
| Feature extraction | Manual | Automatic (learned) |
| Data requirement | Small–medium | Large (millions) |
| Compute | CPU, seconds–minutes | GPU/TPU, hours–days |
| Interpretability | High (LR, trees) | Low (black box) |
| Performance on raw data | Plateaus | Scales with data & compute |
5. ML Algorithm Quick Reference
Supervised learning
| Algorithm | Task | Key idea |
|---|---|---|
| Linear Regression | Regression | Best-fit hyperplane; minimises MSE |
| Logistic Regression | Classification | Linear + sigmoid/softmax; outputs probability |
| SVM | Classification | Max-margin hyperplane; kernels for non-linearity |
| k-NN | Both | Predicts from K nearest points; no training, slow inference |
| Naive Bayes | Classification | Bayes' rule + conditional independence; strong text baseline |
| Decision Tree | Both | If–else splits on Gini/Entropy; interpretable, prone to overfit |
| Random Forest | Both | Bagging many trees (bootstrap + random features) → ↓ variance |
| Gradient Boosting / XGBoost | Both | Sequential trees correcting errors; often best on tabular data |
Unsupervised learning
| Algorithm | Type | Key idea |
|---|---|---|
| K-Means | Clustering | Assigns points to nearest centroid; iterates; needs K preset |
| PCA | Dim. reduction | Projects onto directions of max variance (linear) |
| t-SNE / UMAP | Visualisation | Preserve neighbourhood structure; for 2D/3D plots only |
6. Bias, Variance & Overfitting
The bias-variance tradeoff governs generalisation. Total error ≈ Bias² + Variance + Irreducible noise.
- High bias (underfitting): model too simple — high train error and high test error. Fix: more capacity, more features, less regularisation.
- High variance (overfitting): memorises noise — low train error but high test error. Fix: more data, regularisation (L1/L2/dropout), early stopping, simpler model.
2. Split data first → fit preprocessors on train only → transform val/test.
3. Baseline first → tune with CV → evaluate once on test set.
- Bagging (Random Forest): parallel training → reduces variance.
- Boosting (XGBoost): sequential training → reduces bias.
- Cross-validation (k-fold): reliable estimate of generalisation when data is limited.
Worked Example — Counting parameters in a simple DNN
Network: Input (784) → Dense(128, ReLU) → Dense(64, ReLU) → Dense(10, Softmax)
Layer 1: 784 × 128 + 128 = 100,480 params
Layer 2: 128 × 64 + 64 = 8,256 params
Layer 3: 64 × 10 + 10 = 650 params
Total = 109,386 params
Even this small network has ~110K parameters — deep models for ImageNet have tens of millions.
Exercises
- Explain the AI ⊃ ML ⊃ DL hierarchy with an example at each level.
- Why did deep learning become practical only after ~2012? Cite at least three enabling factors.
- Compute the total trainable parameters for a network: 32 inputs → 64 → 32 → 2 (all Dense layers with bias).
Viva Questions
- What is representation learning and why is it important in DL?
- Differentiate MLE and MAP estimation.
- What prior distribution corresponds to L1 regularisation?
- Name two benchmark datasets that accelerated DL research.
- Why do shallow models struggle with raw image data?