Chapter 2 — Shallow ML Refresher: Linear Models
Chapter 2 — Shallow ML Refresher: Linear Models
Unit I · Introduction to Deep Learning
1. Linear Classifiers
A linear classifier partitions the input space with a hyperplane: \(\hat{y} = \text{sign}(w^\top x + b)\). For probabilistic output, logistic regression maps the linear score through the sigmoid: \(P(y=1|x) = \sigma(w^\top x + b) = \frac{1}{1+e^{-(w^\top x+b)}}\).
Training minimises cross-entropy loss with gradient descent:
\[\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \left[y_i \log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\right]\]
For multi-class, the softmax function generalises the sigmoid: \(\hat{p}_k = \frac{e^{z_k}}{\sum_j e^{z_j}}\).
Exam-ready points- Linear regression loss: MSE \(= \frac{1}{N}\sum(y_i - \hat{y}_i)^2\).
- Logistic regression: binary cross-entropy loss, sigmoid output.
- Gradient update: \(w \leftarrow w - \eta \nabla_w \mathcal{L}\).
2. Linear Separability
A dataset is linearly separable if a hyperplane exists that correctly classifies all points. XOR is the classic non-linearly separable problem — no single line can separate the four XOR points.
The perceptron convergence theorem guarantees convergence iff data is linearly separable. When data is not, we need either: (a) a kernel trick (SVM), or (b) feature transforms, or (c) multi-layer networks.
- Minsky & Papert (1969) showed a single-layer perceptron cannot solve XOR — this led to the first "AI winter".
- Multi-layer networks overcome linear separability limits via non-linear activations.
3. Feature Transforms
A feature transform \(\phi: \mathbb{R}^d \to \mathbb{R}^D\) maps the input to a higher-dimensional (or different) space where a linear classifier succeeds. Examples: polynomial features \((x_1, x_2) \to (x_1, x_2, x_1^2, x_2^2, x_1 x_2)\), radial basis functions, Fourier features.
SVMs use the kernel trick: \(K(x,x') = \phi(x)^\top \phi(x')\) computed implicitly, avoiding explicit high-dimensional mapping. Common kernels: RBF \(K = e^{-\gamma\|x-x'\|^2}\), polynomial \(K = (x^\top x' + c)^d\).
Exam-ready points- Feature transforms are the shallow equivalent of learned hidden layers in deep networks.
- RBF SVM is a universal classifier but requires careful \(\gamma\) and \(C\) tuning.
- Deep networks learn \(\phi\) end-to-end rather than designing it manually.
Worked Example — Logistic Regression gradient step
import numpy as np
# Single sample gradient update
x = np.array([1.0, 2.0]); y = 1; w = np.array([0.1, -0.2]); b = 0.0; lr = 0.1
z = w @ x + b # linear score
p = 1 / (1 + np.exp(-z)) # sigmoid → p ≈ 0.475
loss = -y * np.log(p) # binary cross-entropy
dw = (p - y) * x # gradient w.r.t. w
db = (p - y) # gradient w.r.t. b
w -= lr * dw; b -= lr * db # update
print(f"loss={loss:.4f}, w={w}, b={b:.4f}")
Exercises
- Derive the gradient of binary cross-entropy loss with respect to \(w\).
- Show that XOR is not linearly separable by exhaustive enumeration.
- Using polynomial features of degree 2 on inputs \((x_1, x_2)\), write out all basis functions.
Viva Questions
- What is the decision boundary of a logistic regression classifier?
- Why is XOR a problem for single-layer perceptrons?
- Explain the kernel trick in SVMs.
- What is the difference between L1 and L2 regularisation effects on weights?
- How does the softmax function generalise the sigmoid to multi-class problems?