Chapter 3 — Loss Functions & Optimization Basics
Chapter 3 — Loss Functions & Optimization Basics
Unit I · Introduction to Deep Learning
0. The ML Training Pipeline & Critical Traps
Every training run follows this sequence. The order matters — violating it causes subtle but serious bugs:
Figure 3.1 — Always split before any preprocessing; fit preprocessors on train only
Normalise (scale to [0,1] or unit-norm): best for distances, embeddings, CNNs with sigmoid output.
Always apply the same fitted transform to val and test sets.
1. Loss Functions
The loss (cost) function measures how far the model's predictions are from the ground truth. Choosing the right loss is crucial:
| Task | Loss | Formula |
|---|---|---|
| Regression | MSE | \(\frac{1}{N}\sum(y_i-\hat{y}_i)^2\) |
| Regression (robust) | MAE / Huber | \(\frac{1}{N}\sum|y_i-\hat{y}_i|\) |
| Binary classification | Binary Cross-Entropy | \(-[y\log\hat{p}+(1-y)\log(1-\hat{p})]\) |
| Multi-class | Categorical Cross-Entropy | \(-\sum_k y_k \log\hat{p}_k\) |
- MSE heavily penalises large errors (squared); MAE is more robust to outliers.
- Cross-entropy loss + softmax output = maximum likelihood under categorical distribution.
- Huber loss: MSE for small errors, MAE for large — combines the best of both.
2. Gradient Descent
Gradient descent iteratively moves parameters in the direction of steepest descent of the loss:
\[\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}(\theta)\]
Three variants differ in how many samples are used per update:
- Batch GD: gradient over entire dataset — stable but slow per update.
- Stochastic GD (SGD): gradient over one sample — noisy but fast; natural regulariser.
- Mini-batch GD: gradient over a batch of \(B\) samples (typically 32–512) — best of both; enables GPU parallelism.
- Gradient points uphill; we subtract it to go downhill.
- SGD with momentum: \(v \leftarrow \beta v - \eta g\), \(\theta \leftarrow \theta + v\) — dampens oscillations.
- Epoch = one full pass through the training set.
3. Batch Optimization & Training Pipeline
A standard mini-batch training loop:
for epoch in range(num_epochs):
for X_batch, y_batch in dataloader: # 1. fetch mini-batch
y_pred = model(X_batch) # 2. forward pass
loss = criterion(y_pred, y_batch) # 3. compute loss
optimizer.zero_grad() # 4. clear old gradients
loss.backward() # 5. backprop
optimizer.step() # 6. update weights
Common adaptive optimisers:
- AdaGrad: scales \(\eta\) by inverse sqrt of sum of squared gradients — good for sparse features.
- RMSProp: exponential moving average of squared gradients — fixes AdaGrad's decaying lr.
- Adam: combines momentum + RMSProp; update: \(\theta \leftarrow \theta - \frac{\eta}{\sqrt{\hat{v}}+\epsilon}\hat{m}\). Default choice for most DL tasks.
- Adam hyperparameters: \(\beta_1=0.9\), \(\beta_2=0.999\), \(\epsilon=10^{-8}\) (defaults).
- Always call
optimizer.zero_grad()beforebackward()to avoid gradient accumulation. - Validation loss monitored each epoch to detect overfitting early.
4. Evaluation Metrics Reference
Classification metrics (choose based on class balance and cost of errors):
Figure 3.2 — Precision, Recall and F1 at a glance
| Metric | Formula | When to use |
|---|---|---|
| Accuracy | (TP+TN)/(TP+TN+FP+FN) | Balanced classes |
| Precision | TP/(TP+FP) | False alarms are costly (e.g., spam filter) |
| Recall (Sensitivity) | TP/(TP+FN) | Missing true cases is costly (e.g., disease detection) |
| F1 Score | 2·P·R/(P+R) | Imbalanced classes; balance of P & R needed |
| ROC-AUC | Area under TPR vs FPR curve | Threshold-free ranking quality |
| PR-AUC | Area under Precision vs Recall | Heavy class imbalance (ROC can be misleading) |
Regression metrics:
| Metric | Formula | Note |
|---|---|---|
| MAE | \(\frac{1}{N}\sum|y_i-\hat y_i|\) | Robust; same units as target |
| RMSE | \(\sqrt{\frac{1}{N}\sum(y_i-\hat y_i)^2}\) | Penalises large errors more; same units as target |
| R² | \(1 - \frac{\sum(y_i-\hat y_i)^2}{\sum(y_i-\bar y)^2}\) | Variance explained; compare against baselines, not in isolation |
- High train accuracy + low test accuracy → overfitting (high variance). Fix: more data, regularisation, early stopping.
- Accuracy is misleading on imbalanced datasets — prefer F1 or PR-AUC.
- ROC-AUC = 0.5 → random classifier; = 1.0 → perfect classifier.
Worked Example — MSE gradient for linear regression
# Closed-form gradient of MSE for y = wx + b
# L = (1/N) * sum((y - wx - b)^2)
# dL/dw = (-2/N) * sum((y_i - y_hat_i) * x_i)
# dL/db = (-2/N) * sum((y_i - y_hat_i))
import numpy as np
X = np.array([1,2,3,4,5], dtype=float)
y = np.array([2,4,5,4,5], dtype=float)
w, b, lr = 0.0, 0.0, 0.01
for _ in range(1000):
y_hat = w * X + b
res = y - y_hat
w += lr * (2/len(X)) * (res @ X)
b += lr * (2/len(X)) * res.sum()
print(f"w={w:.3f}, b={b:.3f}") # → w≈0.6, b≈2.2
Exercises
- Derive \(\partial \mathcal{L}_{\text{BCE}} / \partial z\) where \(z = w^\top x + b\) and \(\hat{p} = \sigma(z)\).
- Compare Batch GD, SGD, and Mini-batch GD on a table covering: convergence noise, GPU usage, memory cost.
- Why does Adam use bias correction terms \(\hat{m} = m/(1-\beta_1^t)\)?
Viva Questions
- What loss function should you use for a 5-class image classification problem?
- What is a saddle point and why is it problematic for gradient descent?
- Explain the role of momentum in SGD with momentum.
- What happens if the learning rate is set too high?
- Differentiate RMSProp and Adam.
- High train accuracy + low test accuracy?
Overfitting (high variance). Fix: more data, regularisation (L2/dropout), simpler model, early stopping. - Why do we need activation functions?
Without non-linearity, stacking layers collapses to a single linear transform; activations let networks model complex functions. - Bagging vs boosting?
Bagging (Random Forest) trains many models in parallel → reduces variance. Boosting (XGBoost) trains sequentially, each correcting the last → reduces bias. - Likelihood vs prior (one-liner)?
Likelihood = "how well θ explains the observed data"; prior = "what we believed about θ before seeing any data." - Parameters vs hyperparameters?
Parameters are learned from data (weights). Hyperparameters are chosen before training (learning rate, batch size, number of layers).