Chapter 14 — Vision Tasks: Segmentation, Detection, Captioning
Chapter 14 — Vision Tasks: Segmentation, Detection, Captioning
Unit IV · Deep Learning for NLP & Computer Vision
1. Image Segmentation
Segmentation assigns a class label to every pixel. Two types:
- Semantic segmentation: all pixels of the same class get the same label — no instance distinction (two cats → both labelled "cat").
- Instance segmentation: distinguishes individual objects of the same class (cat-1, cat-2). Requires Mask R-CNN or similar.
FCN (Fully Convolutional Network, 2015): replace FC layers with conv layers; use upsampling (transposed conv or bilinear interpolation) to restore spatial resolution.
U-Net (2015): encoder–decoder with skip connections — state-of-art for medical image segmentation. Skip connections preserve fine spatial details lost in downsampling.
# U-Net skip connection pattern
# Encoder: Conv→Pool→Conv→Pool→...→bottleneck
# Decoder: Upsample→Concat(skip)→Conv→Upsample→Concat(skip)→Conv→...→output
Exam-ready points
- Evaluation: IoU (Intersection over Union) = TP / (TP + FP + FN) per class; mean IoU (mIoU) averaged over classes.
- Transposed convolution (deconv) learns upsampling; bilinear interpolation is a fixed upsampling.
2. Object Detection
Object detection localises (bounding box) and classifies multiple objects in an image simultaneously.
| Approach | Examples | Speed | Accuracy |
|---|---|---|---|
| Two-stage | R-CNN, Fast R-CNN, Faster R-CNN | Slower | Higher |
| One-stage | YOLO, SSD, RetinaNet | Faster (real-time) | Slightly lower |
YOLO (You Only Look Once): divides image into \(S\times S\) grid; each cell predicts \(B\) bounding boxes + confidence + class probabilities in a single forward pass. \(S=7, B=2\) for YOLO v1; modern YOLOv8 achieves state-of-art speed/accuracy tradeoff.
Faster R-CNN: introduces Region Proposal Network (RPN) sharing convolutional features with detection head — end-to-end trainable, ~5 FPS.
Exam-ready points- Bounding box output: \((x_c, y_c, w, h)\) — centre coordinates + width/height (normalised).
- NMS (Non-Maximum Suppression): remove overlapping boxes with IoU > threshold; keep highest confidence.
- mAP (mean Average Precision): standard detection metric at IoU threshold 0.5 or 0.5:0.95.
3. Automatic Image Captioning
Image captioning generates a natural language description of an image. The dominant approach: encoder–decoder with CNN encoder + LSTM decoder (Vinyals et al., 2015 "Show and Tell"):
- CNN (e.g., InceptionV3, ResNet) encodes image → feature vector.
- Feature vector initialises the LSTM hidden state.
- LSTM decodes one word at a time using previous word as input.
Attention mechanism (Xu et al., 2015 "Show, Attend and Tell"): at each decoding step, the LSTM attends to different spatial regions of the CNN feature maps — produces better captions and interpretable attention maps.
Exam-ready points- BLEU score (Bilingual Evaluation Understudy): measures n-gram overlap between generated and reference captions.
- Teacher forcing: during training, feed the ground-truth word as input at each step (not the predicted word).
- Modern approach: Vision Transformer (ViT) encoder + Transformer decoder.
Worked Example — Compute IoU for two boxes
def iou(boxA, boxB):
# boxA, boxB: [x1,y1,x2,y2]
xA = max(boxA[0], boxB[0]); yA = max(boxA[1], boxB[1])
xB = min(boxA[2], boxB[2]); yB = min(boxA[3], boxB[3])
inter = max(0, xB-xA) * max(0, yB-yA)
aA = (boxA[2]-boxA[0]) * (boxA[3]-boxA[1])
aB = (boxB[2]-boxB[0]) * (boxB[3]-boxB[1])
return inter / (aA + aB - inter + 1e-6)
print(iou([0,0,4,4], [2,2,6,6])) # → 0.143
Exercises
- Differentiate semantic segmentation, instance segmentation, and panoptic segmentation.
- Describe the YOLO grid prediction scheme. How many output values per grid cell in YOLO v1?
- Explain the encoder–decoder architecture for image captioning.
Viva Questions
- What is IoU and how is it used in object detection evaluation?
- What is Non-Maximum Suppression (NMS)?
- How does Faster R-CNN differ from Fast R-CNN?
- What is the role of attention in image captioning?
- What is the BLEU score and what are its limitations?