Unit I — Introduction to Deep Learning

Chapter 1 — Deep Learning Landscape & Motivation

Framework Fill the placeholders below with your full content.

Chapter 1 — Deep Learning Landscape & Motivation

Unit I · Introduction to Deep Learning

Objectives

Understand where deep learning fits in the AI/ML landscape · Compare shallow vs. deep approaches · Appreciate the data and compute drivers of the DL revolution

1. Introduction to Deep Learning

Deep Learning (DL) is a sub-field of Machine Learning (ML), which is itself a sub-field of Artificial Intelligence (AI). The key idea is representation learning: instead of hand-crafting features, a deep neural network automatically discovers useful hierarchical representations directly from raw data.

Three factors drove the DL revolution after 2012: (1) large labelled datasets (ImageNet, Common Crawl), (2) GPU-accelerated computing, and (3) algorithmic advances (ReLU, dropout, batch normalization, residual connections).

Figure 1.1 — The AI ⊃ ML ⊃ DL ⊃ GenAI nested hierarchy

AI ⊃ ML ⊃ DL ⊃ GenAI — four tiers

AI: any approach that makes machines act "intelligent" (rules, search, planning, ML).
ML: learns patterns from data to approximate y = f(x) by minimising a loss.
DL: ML with neural networks (many layers) that learn representations + decisions end-to-end.
GenAI: DL models that generate new content (text / images / code) instead of only predicting labels — LLMs, diffusion models, GANs.

Learning paradigms within ML: Supervised (labelled pairs \((x, y)\) — classification, regression), Unsupervised (unlabelled data — clustering, dimensionality reduction), Reinforcement Learning (agent learns via reward signals from environment).

Exam-ready points

Deep learning = learning representations through many layers of non-linear transformations.
AlexNet (2012, Krizhevsky et al.) winning ImageNet by a large margin is considered the watershed moment.
Key formula: output of layer \(\ell\): \(\mathbf{a}^{(\ell)} = \sigma\!\left(W^{(\ell)}\mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}\right)\)
GenAI extends DL by training models to generate: LLMs (next-token prediction), diffusion models (iterative denoising), GANs (adversarial generation).

2. Probability Basics & Bayesian Learning

Probability is the language of ML. Four concepts you must know cold:

Likelihood \(L(\theta) = P(\mathcal{D}|\theta)\): how probable the observed data is under parameters \(\theta\).
Prior \(P(\theta)\): our belief about parameters before seeing any data.
Posterior \(P(\theta|\mathcal{D})\): updated belief after seeing the data.
Bayes' rule: \(P(\theta|\mathcal{D}) = \frac{P(\mathcal{D}|\theta)\,P(\theta)}{P(\mathcal{D})}\) — multiply likelihood × prior, normalise by evidence.

MLE (Maximum Likelihood Estimation): \(\hat\theta = \arg\max_\theta \log P(\mathcal{D}|\theta)\) — ignores prior. MAP (Maximum A Posteriori): \(\hat\theta_{\text{MAP}} = \arg\max_\theta [\log P(\mathcal{D}|\theta) + \log P(\theta)]\) — adds prior as regulariser.

Where this shows up in DL

Naive Bayes uses Bayes' rule directly. Cross-entropy loss is negative log-likelihood for classification. L2 weight decay = Gaussian prior (MAP). L1 = Laplace prior. MC-Dropout ≈ Bayesian posterior approximation at inference time.

Exam-ready points

MLE: maximise data likelihood — can overfit if prior is ignored.
MAP with Gaussian prior \(\equiv\) L2-regularised MLE.
Full Bayesian inference is intractable for large networks → variational inference or MC-Dropout as approximations.

3. Shallow Machine Learning overview

Shallow ML models include Linear/Logistic Regression, Support Vector Machines (SVM), Decision Trees / Random Forests, and k-NN. They rely on hand-crafted features and a single (or few) layers of learned transformation.

Their strengths: interpretability, efficiency on small data, strong theoretical guarantees. Their weakness: performance plateaus on complex high-dimensional inputs (images, speech, text) because manual feature engineering is expensive and incomplete.

Exam-ready points

Feature engineering (HOG, SIFT, TF-IDF) is the bottleneck in shallow pipelines.
SVMs maximise the margin: \(\min \frac{1}{2}\|w\|^2\) subject to \(y_i(w^\top x_i + b) \ge 1\).
Random Forests use bagging + random feature subsets → low variance, moderate bias.

4. Deep vs Shallow: key differences

Aspect	Shallow ML	Deep Learning
Feature extraction	Manual	Automatic (learned)
Data requirement	Small–medium	Large (millions)
Compute	CPU, seconds–minutes	GPU/TPU, hours–days
Interpretability	High (LR, trees)	Low (black box)
Performance on raw data	Plateaus	Scales with data & compute

5. ML Algorithm Quick Reference

Supervised learning

Algorithm	Task	Key idea
Linear Regression	Regression	Best-fit hyperplane; minimises MSE
Logistic Regression	Classification	Linear + sigmoid/softmax; outputs probability
SVM	Classification	Max-margin hyperplane; kernels for non-linearity
k-NN	Both	Predicts from K nearest points; no training, slow inference
Naive Bayes	Classification	Bayes' rule + conditional independence; strong text baseline
Decision Tree	Both	If–else splits on Gini/Entropy; interpretable, prone to overfit
Random Forest	Both	Bagging many trees (bootstrap + random features) → ↓ variance
Gradient Boosting / XGBoost	Both	Sequential trees correcting errors; often best on tabular data

Unsupervised learning

Algorithm	Type	Key idea
K-Means	Clustering	Assigns points to nearest centroid; iterates; needs K preset
PCA	Dim. reduction	Projects onto directions of max variance (linear)
t-SNE / UMAP	Visualisation	Preserve neighbourhood structure; for 2D/3D plots only

Parameters vs Hyperparameters

Parameters: learned from data (weights, biases). Hyperparameters: chosen before training (K in k-NN, depth, learning rate, batch size, number of layers).

6. Bias, Variance & Overfitting

The bias-variance tradeoff governs generalisation. Total error ≈ Bias² + Variance + Irreducible noise.

High bias (underfitting): model too simple — high train error and high test error. Fix: more capacity, more features, less regularisation.
High variance (overfitting): memorises noise — low train error but high test error. Fix: more data, regularisation (L1/L2/dropout), early stopping, simpler model.

Practical checklist

1. Identify problem type (classification/regression, supervised/unsupervised).
2. Split data first → fit preprocessors on train only → transform val/test.
3. Baseline first → tune with CV → evaluate once on test set.

Exam-ready points

Bagging (Random Forest): parallel training → reduces variance.
Boosting (XGBoost): sequential training → reduces bias.
Cross-validation (k-fold): reliable estimate of generalisation when data is limited.

Worked Example — Counting parameters in a simple DNN

Network: Input (784) → Dense(128, ReLU) → Dense(64, ReLU) → Dense(10, Softmax)

Layer 1: 784 × 128 + 128 = 100,480 params
Layer 2: 128 ×  64 +  64 =   8,256 params
Layer 3:  64 ×  10 +  10 =     650 params
Total                    = 109,386 params

Even this small network has ~110K parameters — deep models for ImageNet have tens of millions.

Exercises

Explain the AI ⊃ ML ⊃ DL hierarchy with an example at each level.
Why did deep learning become practical only after ~2012? Cite at least three enabling factors.
Compute the total trainable parameters for a network: 32 inputs → 64 → 32 → 2 (all Dense layers with bias).

Viva Questions

What is representation learning and why is it important in DL?
Differentiate MLE and MAP estimation.
What prior distribution corresponds to L1 regularisation?
Name two benchmark datasets that accelerated DL research.
Why do shallow models struggle with raw image data?