Unit I — Introduction to Deep Learning

Chapter 3 — Loss Functions & Optimization Basics

Framework Fill the placeholders below with your full content.

Chapter 3 — Loss Functions & Optimization Basics

Unit I · Introduction to Deep Learning

Objectives
Identify appropriate loss functions for regression and classification · Understand gradient descent and its variants · Build a mini training pipeline

0. The ML Training Pipeline & Critical Traps

Every training run follows this sequence. The order matters — violating it causes subtle but serious bugs:

Data SPLIT FIRST Preprocess Train / CV Test

Figure 3.1 — Always split before any preprocessing; fit preprocessors on train only

⚠ Data Leakage
Using information at training time that would not be available at prediction time → artificially inflated scores. Classic trap: fitting a scaler or imputer on the full dataset before splitting. Fix: fit on train only, then transform val/test with the same fitted object.
⚠ Imbalanced Classes
If 99% of samples are class A, a trivial "always predict A" classifier gets 99% accuracy. Fix: use Precision / Recall / F1 or PR-AUC, adjust decision threshold, use class weights, or careful resampling.
Normalisation vs Standardisation
Standardise (zero-mean, unit-variance): best for LR / SVM / KNN — sensitive to scale.
Normalise (scale to [0,1] or unit-norm): best for distances, embeddings, CNNs with sigmoid output.
Always apply the same fitted transform to val and test sets.

1. Loss Functions

The loss (cost) function measures how far the model's predictions are from the ground truth. Choosing the right loss is crucial:

TaskLossFormula
RegressionMSE\(\frac{1}{N}\sum(y_i-\hat{y}_i)^2\)
Regression (robust)MAE / Huber\(\frac{1}{N}\sum|y_i-\hat{y}_i|\)
Binary classificationBinary Cross-Entropy\(-[y\log\hat{p}+(1-y)\log(1-\hat{p})]\)
Multi-classCategorical Cross-Entropy\(-\sum_k y_k \log\hat{p}_k\)
Exam-ready points
  • MSE heavily penalises large errors (squared); MAE is more robust to outliers.
  • Cross-entropy loss + softmax output = maximum likelihood under categorical distribution.
  • Huber loss: MSE for small errors, MAE for large — combines the best of both.

2. Gradient Descent

Gradient descent iteratively moves parameters in the direction of steepest descent of the loss:

\[\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}(\theta)\]

Three variants differ in how many samples are used per update:

  • Batch GD: gradient over entire dataset — stable but slow per update.
  • Stochastic GD (SGD): gradient over one sample — noisy but fast; natural regulariser.
  • Mini-batch GD: gradient over a batch of \(B\) samples (typically 32–512) — best of both; enables GPU parallelism.
Learning rate \(\eta\)
Too large → divergence or oscillation. Too small → very slow convergence. Use schedules (step decay, cosine annealing) or adaptive optimisers.
Exam-ready points
  • Gradient points uphill; we subtract it to go downhill.
  • SGD with momentum: \(v \leftarrow \beta v - \eta g\), \(\theta \leftarrow \theta + v\) — dampens oscillations.
  • Epoch = one full pass through the training set.

3. Batch Optimization & Training Pipeline

A standard mini-batch training loop:

for epoch in range(num_epochs):
    for X_batch, y_batch in dataloader:          # 1. fetch mini-batch
        y_pred = model(X_batch)                  # 2. forward pass
        loss   = criterion(y_pred, y_batch)      # 3. compute loss
        optimizer.zero_grad()                    # 4. clear old gradients
        loss.backward()                          # 5. backprop
        optimizer.step()                         # 6. update weights

Common adaptive optimisers:

  • AdaGrad: scales \(\eta\) by inverse sqrt of sum of squared gradients — good for sparse features.
  • RMSProp: exponential moving average of squared gradients — fixes AdaGrad's decaying lr.
  • Adam: combines momentum + RMSProp; update: \(\theta \leftarrow \theta - \frac{\eta}{\sqrt{\hat{v}}+\epsilon}\hat{m}\). Default choice for most DL tasks.
Exam-ready points
  • Adam hyperparameters: \(\beta_1=0.9\), \(\beta_2=0.999\), \(\epsilon=10^{-8}\) (defaults).
  • Always call optimizer.zero_grad() before backward() to avoid gradient accumulation.
  • Validation loss monitored each epoch to detect overfitting early.

4. Evaluation Metrics Reference

Classification metrics (choose based on class balance and cost of errors):

Precision: TP / (TP + FP) (minimise false alarms) Recall: TP / (TP + FN) (catch all true cases) F1: 2·P·R / (P+R)

Figure 3.2 — Precision, Recall and F1 at a glance

MetricFormulaWhen to use
Accuracy(TP+TN)/(TP+TN+FP+FN)Balanced classes
PrecisionTP/(TP+FP)False alarms are costly (e.g., spam filter)
Recall (Sensitivity)TP/(TP+FN)Missing true cases is costly (e.g., disease detection)
F1 Score2·P·R/(P+R)Imbalanced classes; balance of P & R needed
ROC-AUCArea under TPR vs FPR curveThreshold-free ranking quality
PR-AUCArea under Precision vs RecallHeavy class imbalance (ROC can be misleading)

Regression metrics:

MetricFormulaNote
MAE\(\frac{1}{N}\sum|y_i-\hat y_i|\)Robust; same units as target
RMSE\(\sqrt{\frac{1}{N}\sum(y_i-\hat y_i)^2}\)Penalises large errors more; same units as target
\(1 - \frac{\sum(y_i-\hat y_i)^2}{\sum(y_i-\bar y)^2}\)Variance explained; compare against baselines, not in isolation
Key insight — Threshold & tradeoffs
For probabilistic classifiers, adjusting the decision threshold shifts the precision↔recall tradeoff. The ROC curve shows all possible thresholds at once. Always report metric and threshold together.
Exam-ready points
  • High train accuracy + low test accuracy → overfitting (high variance). Fix: more data, regularisation, early stopping.
  • Accuracy is misleading on imbalanced datasets — prefer F1 or PR-AUC.
  • ROC-AUC = 0.5 → random classifier; = 1.0 → perfect classifier.

Worked Example — MSE gradient for linear regression

# Closed-form gradient of MSE for y = wx + b
# L = (1/N) * sum((y - wx - b)^2)
# dL/dw = (-2/N) * sum((y_i - y_hat_i) * x_i)
# dL/db = (-2/N) * sum((y_i - y_hat_i))
import numpy as np
X = np.array([1,2,3,4,5], dtype=float)
y = np.array([2,4,5,4,5], dtype=float)
w, b, lr = 0.0, 0.0, 0.01
for _ in range(1000):
    y_hat = w * X + b
    res   = y - y_hat
    w += lr * (2/len(X)) * (res @ X)
    b += lr * (2/len(X)) * res.sum()
print(f"w={w:.3f}, b={b:.3f}")   # → w≈0.6, b≈2.2

Exercises

  1. Derive \(\partial \mathcal{L}_{\text{BCE}} / \partial z\) where \(z = w^\top x + b\) and \(\hat{p} = \sigma(z)\).
  2. Compare Batch GD, SGD, and Mini-batch GD on a table covering: convergence noise, GPU usage, memory cost.
  3. Why does Adam use bias correction terms \(\hat{m} = m/(1-\beta_1^t)\)?

Viva Questions

  1. What loss function should you use for a 5-class image classification problem?
  2. What is a saddle point and why is it problematic for gradient descent?
  3. Explain the role of momentum in SGD with momentum.
  4. What happens if the learning rate is set too high?
  5. Differentiate RMSProp and Adam.
Interview drill — model answers
  1. High train accuracy + low test accuracy?
    Overfitting (high variance). Fix: more data, regularisation (L2/dropout), simpler model, early stopping.
  2. Why do we need activation functions?
    Without non-linearity, stacking layers collapses to a single linear transform; activations let networks model complex functions.
  3. Bagging vs boosting?
    Bagging (Random Forest) trains many models in parallel → reduces variance. Boosting (XGBoost) trains sequentially, each correcting the last → reduces bias.
  4. Likelihood vs prior (one-liner)?
    Likelihood = "how well θ explains the observed data"; prior = "what we believed about θ before seeing any data."
  5. Parameters vs hyperparameters?
    Parameters are learned from data (weights). Hyperparameters are chosen before training (learning rate, batch size, number of layers).
Tip: press Esc to close.