Unit I — Introduction to Deep Learning

Chapter 2 — Shallow ML Refresher: Linear Models

Framework Fill the placeholders below with your full content.

Chapter 2 — Shallow ML Refresher: Linear Models

Unit I · Introduction to Deep Learning

Objectives
Recall linear and logistic regression · Understand linear separability and its limits · Apply feature transforms to handle non-linearity

1. Linear Classifiers

A linear classifier partitions the input space with a hyperplane: \(\hat{y} = \text{sign}(w^\top x + b)\). For probabilistic output, logistic regression maps the linear score through the sigmoid: \(P(y=1|x) = \sigma(w^\top x + b) = \frac{1}{1+e^{-(w^\top x+b)}}\).

Training minimises cross-entropy loss with gradient descent:

\[\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \left[y_i \log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\right]\]

For multi-class, the softmax function generalises the sigmoid: \(\hat{p}_k = \frac{e^{z_k}}{\sum_j e^{z_j}}\).

Exam-ready points
  • Linear regression loss: MSE \(= \frac{1}{N}\sum(y_i - \hat{y}_i)^2\).
  • Logistic regression: binary cross-entropy loss, sigmoid output.
  • Gradient update: \(w \leftarrow w - \eta \nabla_w \mathcal{L}\).

2. Linear Separability

A dataset is linearly separable if a hyperplane exists that correctly classifies all points. XOR is the classic non-linearly separable problem — no single line can separate the four XOR points.

The perceptron convergence theorem guarantees convergence iff data is linearly separable. When data is not, we need either: (a) a kernel trick (SVM), or (b) feature transforms, or (c) multi-layer networks.

XOR truth table
(0,0)→0 · (0,1)→1 · (1,0)→1 · (1,1)→0 — no linear boundary can separate 0s from 1s.
Exam-ready points
  • Minsky & Papert (1969) showed a single-layer perceptron cannot solve XOR — this led to the first "AI winter".
  • Multi-layer networks overcome linear separability limits via non-linear activations.

3. Feature Transforms

A feature transform \(\phi: \mathbb{R}^d \to \mathbb{R}^D\) maps the input to a higher-dimensional (or different) space where a linear classifier succeeds. Examples: polynomial features \((x_1, x_2) \to (x_1, x_2, x_1^2, x_2^2, x_1 x_2)\), radial basis functions, Fourier features.

SVMs use the kernel trick: \(K(x,x') = \phi(x)^\top \phi(x')\) computed implicitly, avoiding explicit high-dimensional mapping. Common kernels: RBF \(K = e^{-\gamma\|x-x'\|^2}\), polynomial \(K = (x^\top x' + c)^d\).

Exam-ready points
  • Feature transforms are the shallow equivalent of learned hidden layers in deep networks.
  • RBF SVM is a universal classifier but requires careful \(\gamma\) and \(C\) tuning.
  • Deep networks learn \(\phi\) end-to-end rather than designing it manually.

Worked Example — Logistic Regression gradient step

import numpy as np
# Single sample gradient update
x = np.array([1.0, 2.0]); y = 1; w = np.array([0.1, -0.2]); b = 0.0; lr = 0.1
z = w @ x + b                    # linear score
p = 1 / (1 + np.exp(-z))         # sigmoid  →  p ≈ 0.475
loss = -y * np.log(p)            # binary cross-entropy
dw = (p - y) * x                 # gradient w.r.t. w
db = (p - y)                     # gradient w.r.t. b
w -= lr * dw; b -= lr * db       # update
print(f"loss={loss:.4f}, w={w}, b={b:.4f}")

Exercises

  1. Derive the gradient of binary cross-entropy loss with respect to \(w\).
  2. Show that XOR is not linearly separable by exhaustive enumeration.
  3. Using polynomial features of degree 2 on inputs \((x_1, x_2)\), write out all basis functions.

Viva Questions

  1. What is the decision boundary of a logistic regression classifier?
  2. Why is XOR a problem for single-layer perceptrons?
  3. Explain the kernel trick in SVMs.
  4. What is the difference between L1 and L2 regularisation effects on weights?
  5. How does the softmax function generalise the sigmoid to multi-class problems?
Tip: press Esc to close.