Unit II — Introduction to Neural Networks

Chapter 6 — MLP & Backpropagation (incl. BPTT overview)

Framework Fill the placeholders below with your full content.

Chapter 6 — MLP & Backpropagation (incl. BPTT overview)

Unit II · Introduction to Neural Networks

Objectives

Understand forward pass through an MLP · Derive backpropagation using the chain rule · Know BPTT and its gradient issues

1. Single-layer vs Multi-layer Networks

A Multi-Layer Perceptron (MLP) has: an input layer, one or more hidden layers, and an output layer. Adding hidden layers with non-linear activations allows the network to learn hierarchical representations and solve non-linearly separable problems.

Forward pass for layer \(\ell\):

\[z^{(\ell)} = W^{(\ell)} a^{(\ell-1)} + b^{(\ell)}, \quad a^{(\ell)} = \sigma\!\left(z^{(\ell)}\right)\]

Exam-ready points

Input layer: no computation, just passes features.
Hidden layers: intermediate representations.
Output layer: produces final prediction (sigmoid for binary, softmax for multi-class, linear for regression).

2. Backpropagation

Backpropagation (Rumelhart, Hinton & Williams, 1986) efficiently computes \(\partial \mathcal{L}/\partial W^{(\ell)}\) for all layers using the chain rule in reverse order.

Define the error signal \(\delta^{(\ell)} = \frac{\partial \mathcal{L}}{\partial z^{(\ell)}}\). Then:

\[\delta^{(L)} = \nabla_{a^{(L)}} \mathcal{L} \odot \sigma'\!\left(z^{(L)}\right)\]

\[\delta^{(\ell)} = \left(W^{(\ell+1)\top} \delta^{(\ell+1)}\right) \odot \sigma'\!\left(z^{(\ell)}\right)\]

\[\frac{\partial \mathcal{L}}{\partial W^{(\ell)}} = \delta^{(\ell)} a^{(\ell-1)\top}, \quad \frac{\partial \mathcal{L}}{\partial b^{(\ell)}} = \delta^{(\ell)}\]

Vanishing & Exploding Gradients

With sigmoid/tanh, \(\sigma'(z) \le 0.25\) — multiplied across many layers → gradients vanish. Solutions: ReLU, batch norm, residual connections, gradient clipping.

Exam-ready points

Backprop is just the chain rule applied systematically from output to input.
Time complexity of one backprop pass ≈ 2× forward pass.
Computational graph enables automatic differentiation (autograd in PyTorch/TF).

3. Backpropagation Through Time (BPTT) overview

For recurrent networks processing a sequence \((x_1, \ldots, x_T)\), the shared weights appear at every timestep. BPTT "unrolls" the RNN for \(T\) steps and applies backprop on the resulting (deep) computational graph:

\[\frac{\partial \mathcal{L}}{\partial W} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}_t}{\partial W}\]

The gradient at time \(t\) involves products of Jacobians \(\prod_{k=t}^{T} \frac{\partial h_k}{\partial h_{k-1}}\). For long sequences, these products either vanish or explode → LSTM/GRU were designed to fix this.

Exam-ready points

Truncated BPTT: backpropagate only \(k\) steps back — reduces compute and gradient issues.
Gradient clipping: if \(\|\nabla\|_2 > \tau\), scale it to \(\tau\) — standard fix for exploding gradients in RNNs.

4. Architectural Design Issues

Width vs Depth: deeper is generally better (compositional features); wider increases capacity within a layer.
Skip connections (ResNets): \(a^{(\ell)} = \sigma(z^{(\ell)}) + a^{(\ell-2)}\) — gradient highway, allows very deep networks (100+ layers).
Weight initialisation: Xavier (Glorot) for sigmoid/tanh; He for ReLU. Avoids vanishing/exploding gradients at initialisation.
Output layer: match activation to task — sigmoid (binary), softmax (multi-class), linear (regression).

Exam-ready points

He init: \(W \sim \mathcal{N}(0, \sqrt{2/n_{\text{in}}})\).
Xavier init: \(W \sim \mathcal{N}(0, \sqrt{2/(n_{\text{in}}+n_{\text{out}})})\).

Worked Example — Manual backprop (1 hidden layer, 1 sample)

import numpy as np
# Network: x(2) -> h(2,ReLU) -> y(1,sigmoid)
x = np.array([0.5, 0.8])
y_true = 1.0
W1 = np.array([[0.2,-0.4],[0.5,0.1]]); b1 = np.array([0.1, -0.1])
W2 = np.array([[0.3, -0.2]]); b2 = np.array([0.0])
# Forward
z1 = W1 @ x + b1; a1 = np.maximum(0, z1)
z2 = W2 @ a1 + b2; a2 = 1/(1+np.exp(-z2))
loss = -(y_true*np.log(a2) + (1-y_true)*np.log(1-a2))
# Backward
d2 = a2 - y_true                            # dL/dz2
dW2 = d2 * a1; db2 = d2
d1 = (W2.T * d2).flatten() * (z1 > 0)      # dL/dz1 via ReLU grad
dW1 = np.outer(d1, x); db1 = d1
print(f"loss={loss[0]:.4f}, dW1={dW1.round(4)}")

Viva Questions

State the chain rule and explain its role in backpropagation.
What causes the vanishing gradient problem in deep sigmoid networks?
How do residual/skip connections help train very deep networks?
What is the difference between backprop and BPTT?
Why does He initialisation use \(\sqrt{2/n_{\text{in}}}\) specifically for ReLU?