Chapter 8 — Training Deep Networks: Regularization, Dropout, BatchNorm
Chapter 8 — Training Deep Networks: Regularization, Dropout, BatchNorm
Unit III · Neural Networks in Practice
1. Optimiser Variants & Learning Rate Schedules
Beyond vanilla SGD, modern optimisers adapt the learning rate per parameter:
- SGD + Momentum: \(v_t = \beta v_{t-1} + \eta \nabla_\theta \mathcal{L}\); \(\theta \leftarrow \theta - v_t\). Typically \(\beta=0.9\).
- Nesterov Momentum: evaluates gradient at look-ahead position; faster convergence.
- Adam: default choice; combines first moment (momentum) and second moment (RMSProp).
- AdamW: Adam + decoupled weight decay — often better for transformers and large models.
LR Schedules: Step decay, cosine annealing, warm-up + decay (used in transformers).
lr_schedule = keras.optimizers.schedules.CosineDecay(
initial_learning_rate=1e-3, decay_steps=10000)
optimizer = keras.optimizers.Adam(learning_rate=lr_schedule)
2. Regularization (L1, L2, Early Stopping)
L2 (weight decay): adds \(\frac{\lambda}{2}\|W\|_2^2\) to loss — penalises large weights, encourages small distributed weights. Gradient update: \(W \leftarrow W(1-\lambda\eta) - \eta \nabla_W \mathcal{L}\).
L1: adds \(\lambda\|W\|_1\) — promotes sparsity (many weights → exactly 0).
Early Stopping: monitor validation loss; stop when it stops improving for \(p\) epochs. Implicit regulariser — prevents overfitting without changing the loss function.
from tensorflow.keras import regularizers
layers.Dense(64, activation='relu',
kernel_regularizer=regularizers.l2(0.01))
Exam-ready points
- L2 → Gaussian prior (MAP interpretation); L1 → Laplace prior.
- Data augmentation is also a powerful regulariser for image tasks.
3. Dropout
Dropout (Srivastava et al., 2014): during each training step, randomly set a fraction \(p\) of neurons to 0. At test time, scale activations by \((1-p)\) (or equivalently use inverted dropout during training).
Intuition: forces the network not to rely on any single neuron — approximately trains an ensemble of \(2^n\) sub-networks. Prevents co-adaptation of neurons.
layers.Dropout(rate=0.5) # drop 50% of units during training
After conv layers: rate 0.1–0.25 (or use SpatialDropout2D).
NOT applied during inference (
model(x, training=False)).- Inverted dropout (Keras default): multiply by \(1/(1-p)\) during training → no scaling needed at test time.
- Typical rates: 0.3 for smaller networks, 0.5 for fully connected layers.
4. Batch Normalization
BatchNorm (Ioffe & Szegedy, 2015) normalises the pre-activation of each layer over the mini-batch, then applies learnable scale \(\gamma\) and shift \(\beta\):
\[\hat{z} = \frac{z - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad \tilde{z} = \gamma \hat{z} + \beta\]
Benefits: allows higher learning rates, reduces sensitivity to initialisation, acts as mild regulariser, accelerates convergence.
model = keras.Sequential([
layers.Dense(128), layers.BatchNormalization(), layers.Activation('relu'),
layers.Dense(64), layers.BatchNormalization(), layers.Activation('relu'),
layers.Dense(10, activation='softmax')
])
Exam-ready points
- During training: uses batch mean/variance; during inference: uses running (exponential moving) statistics.
- LayerNorm (used in Transformers) normalises over feature dimension instead of batch dimension.
- BatchNorm + ReLU + He init is the standard recipe for training deep networks.
Worked Example — Regularised network comparison
# Compare: no reg / L2 / Dropout / BatchNorm
for name, model in models.items():
h = model.fit(X_train, y_train, epochs=50, validation_split=0.2, verbose=0)
val_acc = max(h.history['val_accuracy'])
print(f"{name}: best val_acc = {val_acc:.4f}")
Expected: all regularised models outperform the no-reg baseline on val set, though train accuracy may be lower.
Viva Questions
- Explain the difference between L1 and L2 regularisation in terms of weight distributions.
- How does Dropout act as an ensemble method?
- Why is Dropout disabled during inference?
- What problem does Batch Normalization solve (internal covariate shift)?
- Compare BatchNorm and LayerNorm — when is each preferred?