Unit IV — Deep Learning for NLP & Computer Vision

Chapter 14 — Vision Tasks: Segmentation, Detection, Captioning

Framework Fill the placeholders below with your full content.

Chapter 14 — Vision Tasks: Segmentation, Detection, Captioning

Unit IV · Deep Learning for NLP & Computer Vision

Objectives

Distinguish classification, detection, and segmentation · Understand YOLO and R-CNN families · Apply CNN-RNN for image captioning

1. Image Segmentation

Segmentation assigns a class label to every pixel. Two types:

Semantic segmentation: all pixels of the same class get the same label — no instance distinction (two cats → both labelled "cat").
Instance segmentation: distinguishes individual objects of the same class (cat-1, cat-2). Requires Mask R-CNN or similar.

FCN (Fully Convolutional Network, 2015): replace FC layers with conv layers; use upsampling (transposed conv or bilinear interpolation) to restore spatial resolution.

U-Net (2015): encoder–decoder with skip connections — state-of-art for medical image segmentation. Skip connections preserve fine spatial details lost in downsampling.

# U-Net skip connection pattern
# Encoder: Conv→Pool→Conv→Pool→...→bottleneck
# Decoder: Upsample→Concat(skip)→Conv→Upsample→Concat(skip)→Conv→...→output

Exam-ready points

Evaluation: IoU (Intersection over Union) = TP / (TP + FP + FN) per class; mean IoU (mIoU) averaged over classes.
Transposed convolution (deconv) learns upsampling; bilinear interpolation is a fixed upsampling.

2. Object Detection

Object detection localises (bounding box) and classifies multiple objects in an image simultaneously.

Approach	Examples	Speed	Accuracy
Two-stage	R-CNN, Fast R-CNN, Faster R-CNN	Slower	Higher
One-stage	YOLO, SSD, RetinaNet	Faster (real-time)	Slightly lower

YOLO (You Only Look Once): divides image into \(S\times S\) grid; each cell predicts \(B\) bounding boxes + confidence + class probabilities in a single forward pass. \(S=7, B=2\) for YOLO v1; modern YOLOv8 achieves state-of-art speed/accuracy tradeoff.

Faster R-CNN: introduces Region Proposal Network (RPN) sharing convolutional features with detection head — end-to-end trainable, ~5 FPS.

Exam-ready points

Bounding box output: \((x_c, y_c, w, h)\) — centre coordinates + width/height (normalised).
NMS (Non-Maximum Suppression): remove overlapping boxes with IoU > threshold; keep highest confidence.
mAP (mean Average Precision): standard detection metric at IoU threshold 0.5 or 0.5:0.95.

3. Automatic Image Captioning

Image captioning generates a natural language description of an image. The dominant approach: encoder–decoder with CNN encoder + LSTM decoder (Vinyals et al., 2015 "Show and Tell"):

CNN (e.g., InceptionV3, ResNet) encodes image → feature vector.
Feature vector initialises the LSTM hidden state.
LSTM decodes one word at a time using previous word as input.

Attention mechanism (Xu et al., 2015 "Show, Attend and Tell"): at each decoding step, the LSTM attends to different spatial regions of the CNN feature maps — produces better captions and interpretable attention maps.

Exam-ready points

BLEU score (Bilingual Evaluation Understudy): measures n-gram overlap between generated and reference captions.
Teacher forcing: during training, feed the ground-truth word as input at each step (not the predicted word).
Modern approach: Vision Transformer (ViT) encoder + Transformer decoder.

Worked Example — Compute IoU for two boxes

def iou(boxA, boxB):
    # boxA, boxB: [x1,y1,x2,y2]
    xA = max(boxA[0], boxB[0]); yA = max(boxA[1], boxB[1])
    xB = min(boxA[2], boxB[2]); yB = min(boxA[3], boxB[3])
    inter = max(0, xB-xA) * max(0, yB-yA)
    aA = (boxA[2]-boxA[0]) * (boxA[3]-boxA[1])
    aB = (boxB[2]-boxB[0]) * (boxB[3]-boxB[1])
    return inter / (aA + aB - inter + 1e-6)
print(iou([0,0,4,4], [2,2,6,6]))  # → 0.143

Exercises

Differentiate semantic segmentation, instance segmentation, and panoptic segmentation.
Describe the YOLO grid prediction scheme. How many output values per grid cell in YOLO v1?
Explain the encoder–decoder architecture for image captioning.

Viva Questions

What is IoU and how is it used in object detection evaluation?
What is Non-Maximum Suppression (NMS)?
How does Faster R-CNN differ from Fast R-CNN?
What is the role of attention in image captioning?
What is the BLEU score and what are its limitations?