Chapter 9 — CNNs: Convolution, Pooling, Architectures
Chapter 9 — CNNs: Convolution, Pooling, Architectures
Unit III · Neural Networks in Practice
1. Convolution Layer
A convolutional layer applies learned filters (kernels) across the spatial dimensions of the input. For a 2D input \(I\) and kernel \(K\) of size \(f\times f\):
\[(I * K)[i,j] = \sum_{m=0}^{f-1}\sum_{n=0}^{f-1} I[i+m, j+n]\cdot K[m,n]\]
Key parameters: filters (number of output channels), kernel_size, stride (step size), padding (same vs valid). Output spatial size:
\[\text{out} = \left\lfloor\frac{n - f + 2p}{s}\right\rfloor + 1\]
Local connectivity: each output depends on a small region — efficient, exploits spatial structure.
layers.Conv2D(32, kernel_size=3, strides=1, padding='same', activation='relu')
Exam-ready points
- A conv layer with 32 filters of size 3×3 on a 28×28×1 input: 32×(3×3×1+1) = 320 parameters.
- "same" padding: output has same spatial size as input. "valid": no padding, output shrinks.
- 1×1 convolutions: channel mixing without spatial processing — used in Inception/ResNet.
2. Pooling and Flattening
Max Pooling: takes the maximum value in each pool window — retains the most prominent feature activation. Provides spatial invariance.
Average Pooling: takes the mean — smoother representation.
Global Average Pooling (GAP): reduces each channel's spatial map to a single number — replaces Flatten + Dense in modern architectures, reducing overfitting.
layers.MaxPooling2D(pool_size=2, strides=2) # halves spatial dimensions
layers.GlobalAveragePooling2D() # modern alternative to Flatten
Exam-ready points
- Pooling has no learnable parameters.
- After GAP on a 7×7×512 feature map → 512-dim vector (no 7×7×512 = 25,088 flattening needed).
3. CNN Architecture Patterns
| Architecture | Year | Key innovation | Depth |
|---|---|---|---|
| LeNet-5 | 1998 | First successful CNN (digits) | 5 |
| AlexNet | 2012 | ReLU, Dropout, GPU training | 8 |
| VGGNet | 2014 | All 3×3 kernels, uniform design | 16/19 |
| GoogLeNet | 2014 | Inception modules, GAP | 22 |
| ResNet | 2015 | Skip connections → 152 layers | 50/101/152 |
| MobileNet | 2017 | Depthwise separable conv → mobile | 28 |
4. Practical CNN Checklist
- Input: normalise to [0,1] or zero-mean/unit-var; use data augmentation (flip, crop, rotate).
- Architecture: Conv → BN → ReLU blocks; reduce spatial size gradually; increase channels gradually.
- Regularisation: Dropout after FC layers; L2 weight decay; data augmentation.
- Optimiser: Adam or SGD with momentum + cosine LR decay.
- Evaluation: confusion matrix, per-class precision/recall for imbalanced datasets.
data_aug = keras.Sequential([
layers.RandomFlip('horizontal'),
layers.RandomRotation(0.1),
layers.RandomZoom(0.1),
])
Worked Example — Small CNN for CIFAR-10
model = keras.Sequential([
layers.Conv2D(32, 3, padding='same', activation='relu', input_shape=(32,32,3)),
layers.BatchNormalization(),
layers.Conv2D(32, 3, padding='same', activation='relu'),
layers.MaxPooling2D(2), layers.Dropout(0.2),
layers.Conv2D(64, 3, padding='same', activation='relu'),
layers.BatchNormalization(),
layers.Conv2D(64, 3, padding='same', activation='relu'),
layers.MaxPooling2D(2), layers.Dropout(0.3),
layers.GlobalAveragePooling2D(),
layers.Dense(128, activation='relu'), layers.Dropout(0.4),
layers.Dense(10, activation='softmax')
])
model.compile('adam', 'sparse_categorical_crossentropy', metrics=['accuracy'])
# Achieves ~80% val accuracy on CIFAR-10 in 30 epochs
Viva Questions
- How many parameters does a Conv2D layer with 64 filters (3×3) applied to a 32-channel input have?
- What is the benefit of parameter sharing in CNNs?
- Explain how ResNet's skip connections solve the vanishing gradient problem.
- What is depthwise separable convolution (used in MobileNet) and why is it efficient?
- Compare Global Average Pooling vs Flatten before the output Dense layer.