Unit I — Introduction to Deep Learning

Chapter 3 — Loss Functions & Optimization Basics

Framework Fill the placeholders below with your full content.

Chapter 3 — Loss Functions & Optimization Basics

Unit I · Introduction to Deep Learning

Objectives

Identify appropriate loss functions for regression and classification · Understand gradient descent and its variants · Build a mini training pipeline

0. The ML Training Pipeline & Critical Traps

Every training run follows this sequence. The order matters — violating it causes subtle but serious bugs:

Figure 3.1 — Always split before any preprocessing; fit preprocessors on train only

⚠ Data Leakage

Using information at training time that would not be available at prediction time → artificially inflated scores. Classic trap: fitting a scaler or imputer on the full dataset before splitting. Fix: fit on train only, then transform val/test with the same fitted object.

⚠ Imbalanced Classes

If 99% of samples are class A, a trivial "always predict A" classifier gets 99% accuracy. Fix: use Precision / Recall / F1 or PR-AUC, adjust decision threshold, use class weights, or careful resampling.

Normalisation vs Standardisation

Standardise (zero-mean, unit-variance): best for LR / SVM / KNN — sensitive to scale.
Normalise (scale to [0,1] or unit-norm): best for distances, embeddings, CNNs with sigmoid output.
Always apply the same fitted transform to val and test sets.

1. Loss Functions

The loss (cost) function measures how far the model's predictions are from the ground truth. Choosing the right loss is crucial:

Task	Loss	Formula
Regression	MSE	\(\frac{1}{N}\sum(y_i-\hat{y}_i)^2\)
Regression (robust)	MAE / Huber	\(\frac{1}{N}\sum\|y_i-\hat{y}_i\|\)
Binary classification	Binary Cross-Entropy	\(-[y\log\hat{p}+(1-y)\log(1-\hat{p})]\)
Multi-class	Categorical Cross-Entropy	\(-\sum_k y_k \log\hat{p}_k\)

Exam-ready points

MSE heavily penalises large errors (squared); MAE is more robust to outliers.
Cross-entropy loss + softmax output = maximum likelihood under categorical distribution.
Huber loss: MSE for small errors, MAE for large — combines the best of both.

2. Gradient Descent

Gradient descent iteratively moves parameters in the direction of steepest descent of the loss:

\[\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}(\theta)\]

Three variants differ in how many samples are used per update:

Batch GD: gradient over entire dataset — stable but slow per update.
Stochastic GD (SGD): gradient over one sample — noisy but fast; natural regulariser.
Mini-batch GD: gradient over a batch of \(B\) samples (typically 32–512) — best of both; enables GPU parallelism.

Learning rate \(\eta\)

Too large → divergence or oscillation. Too small → very slow convergence. Use schedules (step decay, cosine annealing) or adaptive optimisers.

Exam-ready points

Gradient points uphill; we subtract it to go downhill.
SGD with momentum: \(v \leftarrow \beta v - \eta g\), \(\theta \leftarrow \theta + v\) — dampens oscillations.
Epoch = one full pass through the training set.

3. Batch Optimization & Training Pipeline

A standard mini-batch training loop:

for epoch in range(num_epochs):
    for X_batch, y_batch in dataloader:          # 1. fetch mini-batch
        y_pred = model(X_batch)                  # 2. forward pass
        loss   = criterion(y_pred, y_batch)      # 3. compute loss
        optimizer.zero_grad()                    # 4. clear old gradients
        loss.backward()                          # 5. backprop
        optimizer.step()                         # 6. update weights

Common adaptive optimisers:

AdaGrad: scales \(\eta\) by inverse sqrt of sum of squared gradients — good for sparse features.
RMSProp: exponential moving average of squared gradients — fixes AdaGrad's decaying lr.
Adam: combines momentum + RMSProp; update: \(\theta \leftarrow \theta - \frac{\eta}{\sqrt{\hat{v}}+\epsilon}\hat{m}\). Default choice for most DL tasks.

Exam-ready points

Adam hyperparameters: \(\beta_1=0.9\), \(\beta_2=0.999\), \(\epsilon=10^{-8}\) (defaults).
Always call optimizer.zero_grad() before backward() to avoid gradient accumulation.
Validation loss monitored each epoch to detect overfitting early.

4. Evaluation Metrics Reference

Classification metrics (choose based on class balance and cost of errors):

Figure 3.2 — Precision, Recall and F1 at a glance

Metric	Formula	When to use
Accuracy	(TP+TN)/(TP+TN+FP+FN)	Balanced classes
Precision	TP/(TP+FP)	False alarms are costly (e.g., spam filter)
Recall (Sensitivity)	TP/(TP+FN)	Missing true cases is costly (e.g., disease detection)
F1 Score	2·P·R/(P+R)	Imbalanced classes; balance of P & R needed
ROC-AUC	Area under TPR vs FPR curve	Threshold-free ranking quality
PR-AUC	Area under Precision vs Recall	Heavy class imbalance (ROC can be misleading)

Regression metrics:

Metric	Formula	Note
MAE	\(\frac{1}{N}\sum\|y_i-\hat y_i\|\)	Robust; same units as target
RMSE	\(\sqrt{\frac{1}{N}\sum(y_i-\hat y_i)^2}\)	Penalises large errors more; same units as target
R²	\(1 - \frac{\sum(y_i-\hat y_i)^2}{\sum(y_i-\bar y)^2}\)	Variance explained; compare against baselines, not in isolation

Key insight — Threshold & tradeoffs

For probabilistic classifiers, adjusting the decision threshold shifts the precision↔recall tradeoff. The ROC curve shows all possible thresholds at once. Always report metric and threshold together.

Exam-ready points

High train accuracy + low test accuracy → overfitting (high variance). Fix: more data, regularisation, early stopping.
Accuracy is misleading on imbalanced datasets — prefer F1 or PR-AUC.
ROC-AUC = 0.5 → random classifier; = 1.0 → perfect classifier.

Worked Example — MSE gradient for linear regression

# Closed-form gradient of MSE for y = wx + b
# L = (1/N) * sum((y - wx - b)^2)
# dL/dw = (-2/N) * sum((y_i - y_hat_i) * x_i)
# dL/db = (-2/N) * sum((y_i - y_hat_i))
import numpy as np
X = np.array([1,2,3,4,5], dtype=float)
y = np.array([2,4,5,4,5], dtype=float)
w, b, lr = 0.0, 0.0, 0.01
for _ in range(1000):
    y_hat = w * X + b
    res   = y - y_hat
    w += lr * (2/len(X)) * (res @ X)
    b += lr * (2/len(X)) * res.sum()
print(f"w={w:.3f}, b={b:.3f}")   # → w≈0.6, b≈2.2

Exercises

Derive \(\partial \mathcal{L}_{\text{BCE}} / \partial z\) where \(z = w^\top x + b\) and \(\hat{p} = \sigma(z)\).
Compare Batch GD, SGD, and Mini-batch GD on a table covering: convergence noise, GPU usage, memory cost.
Why does Adam use bias correction terms \(\hat{m} = m/(1-\beta_1^t)\)?

Viva Questions

What loss function should you use for a 5-class image classification problem?
What is a saddle point and why is it problematic for gradient descent?
Explain the role of momentum in SGD with momentum.
What happens if the learning rate is set too high?
Differentiate RMSProp and Adam.

Interview drill — model answers

High train accuracy + low test accuracy?
Overfitting (high variance). Fix: more data, regularisation (L2/dropout), simpler model, early stopping.
Why do we need activation functions?
Without non-linearity, stacking layers collapses to a single linear transform; activations let networks model complex functions.
Bagging vs boosting?
Bagging (Random Forest) trains many models in parallel → reduces variance. Boosting (XGBoost) trains sequentially, each correcting the last → reduces bias.
Likelihood vs prior (one-liner)?
Likelihood = "how well θ explains the observed data"; prior = "what we believed about θ before seeing any data."
Parameters vs hyperparameters?
Parameters are learned from data (weights). Hyperparameters are chosen before training (learning rate, batch size, number of layers).