Unit I — Introduction to Deep Learning

Chapter 1 — Deep Learning Landscape & Motivation

Framework Fill the placeholders below with your full content.

Chapter 1 — Deep Learning Landscape & Motivation

Unit I · Introduction to Deep Learning

Objectives
Understand where deep learning fits in the AI/ML landscape · Compare shallow vs. deep approaches · Appreciate the data and compute drivers of the DL revolution

1. Introduction to Deep Learning

Deep Learning (DL) is a sub-field of Machine Learning (ML), which is itself a sub-field of Artificial Intelligence (AI). The key idea is representation learning: instead of hand-crafting features, a deep neural network automatically discovers useful hierarchical representations directly from raw data.

Three factors drove the DL revolution after 2012: (1) large labelled datasets (ImageNet, Common Crawl), (2) GPU-accelerated computing, and (3) algorithmic advances (ReLU, dropout, batch normalization, residual connections).

AI ML DL GenAI Examples AI: rules, search (A*), planning ML: LR, Trees, SVM, KNN DL: CNN, LSTM/GRU, Transformers GenAI: GPT-like LLMs, diffusion models

Figure 1.1 — The AI ⊃ ML ⊃ DL ⊃ GenAI nested hierarchy

AI ⊃ ML ⊃ DL ⊃ GenAI — four tiers
AI: any approach that makes machines act "intelligent" (rules, search, planning, ML).
ML: learns patterns from data to approximate y = f(x) by minimising a loss.
DL: ML with neural networks (many layers) that learn representations + decisions end-to-end.
GenAI: DL models that generate new content (text / images / code) instead of only predicting labels — LLMs, diffusion models, GANs.

Learning paradigms within ML: Supervised (labelled pairs \((x, y)\) — classification, regression), Unsupervised (unlabelled data — clustering, dimensionality reduction), Reinforcement Learning (agent learns via reward signals from environment).

Exam-ready points
  • Deep learning = learning representations through many layers of non-linear transformations.
  • AlexNet (2012, Krizhevsky et al.) winning ImageNet by a large margin is considered the watershed moment.
  • Key formula: output of layer \(\ell\): \(\mathbf{a}^{(\ell)} = \sigma\!\left(W^{(\ell)}\mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}\right)\)
  • GenAI extends DL by training models to generate: LLMs (next-token prediction), diffusion models (iterative denoising), GANs (adversarial generation).

2. Probability Basics & Bayesian Learning

Probability is the language of ML. Four concepts you must know cold:

  • Likelihood \(L(\theta) = P(\mathcal{D}|\theta)\): how probable the observed data is under parameters \(\theta\).
  • Prior \(P(\theta)\): our belief about parameters before seeing any data.
  • Posterior \(P(\theta|\mathcal{D})\): updated belief after seeing the data.
  • Bayes' rule: \(P(\theta|\mathcal{D}) = \frac{P(\mathcal{D}|\theta)\,P(\theta)}{P(\mathcal{D})}\) — multiply likelihood × prior, normalise by evidence.

MLE (Maximum Likelihood Estimation): \(\hat\theta = \arg\max_\theta \log P(\mathcal{D}|\theta)\) — ignores prior. MAP (Maximum A Posteriori): \(\hat\theta_{\text{MAP}} = \arg\max_\theta [\log P(\mathcal{D}|\theta) + \log P(\theta)]\) — adds prior as regulariser.

Where this shows up in DL
Naive Bayes uses Bayes' rule directly. Cross-entropy loss is negative log-likelihood for classification. L2 weight decay = Gaussian prior (MAP). L1 = Laplace prior. MC-Dropout ≈ Bayesian posterior approximation at inference time.
Exam-ready points
  • MLE: maximise data likelihood — can overfit if prior is ignored.
  • MAP with Gaussian prior \(\equiv\) L2-regularised MLE.
  • Full Bayesian inference is intractable for large networks → variational inference or MC-Dropout as approximations.

3. Shallow Machine Learning overview

Shallow ML models include Linear/Logistic Regression, Support Vector Machines (SVM), Decision Trees / Random Forests, and k-NN. They rely on hand-crafted features and a single (or few) layers of learned transformation.

Their strengths: interpretability, efficiency on small data, strong theoretical guarantees. Their weakness: performance plateaus on complex high-dimensional inputs (images, speech, text) because manual feature engineering is expensive and incomplete.

Exam-ready points
  • Feature engineering (HOG, SIFT, TF-IDF) is the bottleneck in shallow pipelines.
  • SVMs maximise the margin: \(\min \frac{1}{2}\|w\|^2\) subject to \(y_i(w^\top x_i + b) \ge 1\).
  • Random Forests use bagging + random feature subsets → low variance, moderate bias.

4. Deep vs Shallow: key differences

AspectShallow MLDeep Learning
Feature extractionManualAutomatic (learned)
Data requirementSmall–mediumLarge (millions)
ComputeCPU, seconds–minutesGPU/TPU, hours–days
InterpretabilityHigh (LR, trees)Low (black box)
Performance on raw dataPlateausScales with data & compute

5. ML Algorithm Quick Reference

Supervised learning

AlgorithmTaskKey idea
Linear RegressionRegressionBest-fit hyperplane; minimises MSE
Logistic RegressionClassificationLinear + sigmoid/softmax; outputs probability
SVMClassificationMax-margin hyperplane; kernels for non-linearity
k-NNBothPredicts from K nearest points; no training, slow inference
Naive BayesClassificationBayes' rule + conditional independence; strong text baseline
Decision TreeBothIf–else splits on Gini/Entropy; interpretable, prone to overfit
Random ForestBothBagging many trees (bootstrap + random features) → ↓ variance
Gradient Boosting / XGBoostBothSequential trees correcting errors; often best on tabular data

Unsupervised learning

AlgorithmTypeKey idea
K-MeansClusteringAssigns points to nearest centroid; iterates; needs K preset
PCADim. reductionProjects onto directions of max variance (linear)
t-SNE / UMAPVisualisationPreserve neighbourhood structure; for 2D/3D plots only
Parameters vs Hyperparameters
Parameters: learned from data (weights, biases). Hyperparameters: chosen before training (K in k-NN, depth, learning rate, batch size, number of layers).

6. Bias, Variance & Overfitting

The bias-variance tradeoff governs generalisation. Total error ≈ Bias² + Variance + Irreducible noise.

  • High bias (underfitting): model too simple — high train error and high test error. Fix: more capacity, more features, less regularisation.
  • High variance (overfitting): memorises noise — low train error but high test error. Fix: more data, regularisation (L1/L2/dropout), early stopping, simpler model.
Practical checklist
1. Identify problem type (classification/regression, supervised/unsupervised).
2. Split data first → fit preprocessors on train only → transform val/test.
3. Baseline first → tune with CV → evaluate once on test set.
Exam-ready points
  • Bagging (Random Forest): parallel training → reduces variance.
  • Boosting (XGBoost): sequential training → reduces bias.
  • Cross-validation (k-fold): reliable estimate of generalisation when data is limited.

Worked Example — Counting parameters in a simple DNN

Network: Input (784) → Dense(128, ReLU) → Dense(64, ReLU) → Dense(10, Softmax)

Layer 1: 784 × 128 + 128 = 100,480 params
Layer 2: 128 ×  64 +  64 =   8,256 params
Layer 3:  64 ×  10 +  10 =     650 params
Total                    = 109,386 params

Even this small network has ~110K parameters — deep models for ImageNet have tens of millions.

Exercises

  1. Explain the AI ⊃ ML ⊃ DL hierarchy with an example at each level.
  2. Why did deep learning become practical only after ~2012? Cite at least three enabling factors.
  3. Compute the total trainable parameters for a network: 32 inputs → 64 → 32 → 2 (all Dense layers with bias).

Viva Questions

  1. What is representation learning and why is it important in DL?
  2. Differentiate MLE and MAP estimation.
  3. What prior distribution corresponds to L1 regularisation?
  4. Name two benchmark datasets that accelerated DL research.
  5. Why do shallow models struggle with raw image data?
Tip: press Esc to close.