Unit I — Introduction to Deep Learning

Chapter 2 — Shallow ML Refresher: Linear Models

Framework Fill the placeholders below with your full content.

Chapter 2 — Shallow ML Refresher: Linear Models

Unit I · Introduction to Deep Learning

Objectives

Recall linear and logistic regression · Understand linear separability and its limits · Apply feature transforms to handle non-linearity

1. Linear Classifiers

A linear classifier partitions the input space with a hyperplane: \(\hat{y} = \text{sign}(w^\top x + b)\). For probabilistic output, logistic regression maps the linear score through the sigmoid: \(P(y=1|x) = \sigma(w^\top x + b) = \frac{1}{1+e^{-(w^\top x+b)}}\).

Training minimises cross-entropy loss with gradient descent:

\[\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \left[y_i \log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\right]\]

For multi-class, the softmax function generalises the sigmoid: \(\hat{p}_k = \frac{e^{z_k}}{\sum_j e^{z_j}}\).

Exam-ready points

Linear regression loss: MSE \(= \frac{1}{N}\sum(y_i - \hat{y}_i)^2\).
Logistic regression: binary cross-entropy loss, sigmoid output.
Gradient update: \(w \leftarrow w - \eta \nabla_w \mathcal{L}\).

2. Linear Separability

A dataset is linearly separable if a hyperplane exists that correctly classifies all points. XOR is the classic non-linearly separable problem — no single line can separate the four XOR points.

The perceptron convergence theorem guarantees convergence iff data is linearly separable. When data is not, we need either: (a) a kernel trick (SVM), or (b) feature transforms, or (c) multi-layer networks.

XOR truth table

(0,0)→0 · (0,1)→1 · (1,0)→1 · (1,1)→0 — no linear boundary can separate 0s from 1s.

Exam-ready points

Minsky & Papert (1969) showed a single-layer perceptron cannot solve XOR — this led to the first "AI winter".
Multi-layer networks overcome linear separability limits via non-linear activations.

3. Feature Transforms

A feature transform \(\phi: \mathbb{R}^d \to \mathbb{R}^D\) maps the input to a higher-dimensional (or different) space where a linear classifier succeeds. Examples: polynomial features \((x_1, x_2) \to (x_1, x_2, x_1^2, x_2^2, x_1 x_2)\), radial basis functions, Fourier features.

SVMs use the kernel trick: \(K(x,x') = \phi(x)^\top \phi(x')\) computed implicitly, avoiding explicit high-dimensional mapping. Common kernels: RBF \(K = e^{-\gamma\|x-x'\|^2}\), polynomial \(K = (x^\top x' + c)^d\).

Exam-ready points

Feature transforms are the shallow equivalent of learned hidden layers in deep networks.
RBF SVM is a universal classifier but requires careful \(\gamma\) and \(C\) tuning.
Deep networks learn \(\phi\) end-to-end rather than designing it manually.

Worked Example — Logistic Regression gradient step

import numpy as np
# Single sample gradient update
x = np.array([1.0, 2.0]); y = 1; w = np.array([0.1, -0.2]); b = 0.0; lr = 0.1
z = w @ x + b                    # linear score
p = 1 / (1 + np.exp(-z))         # sigmoid  →  p ≈ 0.475
loss = -y * np.log(p)            # binary cross-entropy
dw = (p - y) * x                 # gradient w.r.t. w
db = (p - y)                     # gradient w.r.t. b
w -= lr * dw; b -= lr * db       # update
print(f"loss={loss:.4f}, w={w}, b={b:.4f}")

Exercises

Derive the gradient of binary cross-entropy loss with respect to \(w\).
Show that XOR is not linearly separable by exhaustive enumeration.
Using polynomial features of degree 2 on inputs \((x_1, x_2)\), write out all basis functions.

Viva Questions

What is the decision boundary of a logistic regression classifier?
Why is XOR a problem for single-layer perceptrons?
Explain the kernel trick in SVMs.
What is the difference between L1 and L2 regularisation effects on weights?
How does the softmax function generalise the sigmoid to multi-class problems?