Unit III — Neural Networks in Practice

Chapter 8 — Training Deep Networks: Regularization, Dropout, BatchNorm

Framework Fill the placeholders below with your full content.

Chapter 8 — Training Deep Networks: Regularization, Dropout, BatchNorm

Unit III · Neural Networks in Practice

Objectives

Apply regularisation techniques to reduce overfitting · Understand Dropout as ensemble approximation · Use Batch Normalisation for stable training

1. Optimiser Variants & Learning Rate Schedules

Beyond vanilla SGD, modern optimisers adapt the learning rate per parameter:

SGD + Momentum: \(v_t = \beta v_{t-1} + \eta \nabla_\theta \mathcal{L}\); \(\theta \leftarrow \theta - v_t\). Typically \(\beta=0.9\).
Nesterov Momentum: evaluates gradient at look-ahead position; faster convergence.
Adam: default choice; combines first moment (momentum) and second moment (RMSProp).
AdamW: Adam + decoupled weight decay — often better for transformers and large models.

LR Schedules: Step decay, cosine annealing, warm-up + decay (used in transformers).

lr_schedule = keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=1e-3, decay_steps=10000)
optimizer = keras.optimizers.Adam(learning_rate=lr_schedule)

2. Regularization (L1, L2, Early Stopping)

L2 (weight decay): adds \(\frac{\lambda}{2}\|W\|_2^2\) to loss — penalises large weights, encourages small distributed weights. Gradient update: \(W \leftarrow W(1-\lambda\eta) - \eta \nabla_W \mathcal{L}\).

L1: adds \(\lambda\|W\|_1\) — promotes sparsity (many weights → exactly 0).

Early Stopping: monitor validation loss; stop when it stops improving for \(p\) epochs. Implicit regulariser — prevents overfitting without changing the loss function.

from tensorflow.keras import regularizers
layers.Dense(64, activation='relu',
             kernel_regularizer=regularizers.l2(0.01))

Exam-ready points

L2 → Gaussian prior (MAP interpretation); L1 → Laplace prior.
Data augmentation is also a powerful regulariser for image tasks.

3. Dropout

Dropout (Srivastava et al., 2014): during each training step, randomly set a fraction \(p\) of neurons to 0. At test time, scale activations by \((1-p)\) (or equivalently use inverted dropout during training).

Intuition: forces the network not to rely on any single neuron — approximately trains an ensemble of \(2^n\) sub-networks. Prevents co-adaptation of neurons.

layers.Dropout(rate=0.5)   # drop 50% of units during training

Where to apply

After dense layers: rate 0.3–0.5.
After conv layers: rate 0.1–0.25 (or use SpatialDropout2D).
NOT applied during inference (model(x, training=False)).

Exam-ready points

Inverted dropout (Keras default): multiply by \(1/(1-p)\) during training → no scaling needed at test time.
Typical rates: 0.3 for smaller networks, 0.5 for fully connected layers.

4. Batch Normalization

BatchNorm (Ioffe & Szegedy, 2015) normalises the pre-activation of each layer over the mini-batch, then applies learnable scale \(\gamma\) and shift \(\beta\):

\[\hat{z} = \frac{z - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad \tilde{z} = \gamma \hat{z} + \beta\]

Benefits: allows higher learning rates, reduces sensitivity to initialisation, acts as mild regulariser, accelerates convergence.

model = keras.Sequential([
    layers.Dense(128), layers.BatchNormalization(), layers.Activation('relu'),
    layers.Dense(64),  layers.BatchNormalization(), layers.Activation('relu'),
    layers.Dense(10, activation='softmax')
])

Exam-ready points

During training: uses batch mean/variance; during inference: uses running (exponential moving) statistics.
LayerNorm (used in Transformers) normalises over feature dimension instead of batch dimension.
BatchNorm + ReLU + He init is the standard recipe for training deep networks.

Worked Example — Regularised network comparison

# Compare: no reg / L2 / Dropout / BatchNorm
for name, model in models.items():
    h = model.fit(X_train, y_train, epochs=50, validation_split=0.2, verbose=0)
    val_acc = max(h.history['val_accuracy'])
    print(f"{name}: best val_acc = {val_acc:.4f}")

Expected: all regularised models outperform the no-reg baseline on val set, though train accuracy may be lower.

Viva Questions

Explain the difference between L1 and L2 regularisation in terms of weight distributions.
How does Dropout act as an ensemble method?
Why is Dropout disabled during inference?
What problem does Batch Normalization solve (internal covariate shift)?
Compare BatchNorm and LayerNorm — when is each preferred?