Unit IV — Deep Learning for NLP & Computer Vision

Chapter 16 — Attention Models + Video-to-Text (LSTM)

Framework Fill the placeholders below with your full content.

Chapter 16 — Attention Models + Video-to-Text (LSTM)

Unit IV · Deep Learning for NLP & Computer Vision

Objectives
Understand the attention mechanism and self-attention · Apply LSTM-based video captioning · Know how Transformers extend attention to a full architecture

1. Attention Mechanism in Vision Tasks

The attention mechanism allows a decoder to selectively focus on different parts of the encoder output at each decoding step. Originally proposed for machine translation (Bahdanau et al., 2015); adapted for vision in "Show, Attend and Tell".

Given encoder feature vectors \(\{h_1, \ldots, h_L\}\) (spatial locations) and decoder state \(s_t\):

\[e_{ti} = f_{\text{att}}(s_{t-1}, h_i) \quad \text{(alignment score)}\]

\[\alpha_{ti} = \text{softmax}(e_{ti}) \quad \text{(attention weights)}\]

\[c_t = \sum_i \alpha_{ti} h_i \quad \text{(context vector)}\]

The context vector \(c_t\) is concatenated with the decoder input at step \(t\).

Hard vs Soft Attention
Soft (deterministic): weighted sum of all locations — differentiable, trained with backprop.
Hard (stochastic): attend to one location — trained with REINFORCE; interpretable but higher variance.
Exam-ready points
  • Attention adds O(n²) complexity over sequence length n — manageable for moderate sequences.
  • Self-attention: queries, keys, values all come from the same sequence — core of Transformers.

2. Video-to-Text with LSTM

Video captioning (S2VT — Sequence to Sequence Video to Text, Venugopalan et al., 2015) extends image captioning to video by encoding temporal information:

  1. Frame-level features: extract CNN features for each frame \(f_t\) (e.g., every 8th frame → 16 frames).
  2. Temporal encoder: feed the sequence of frame features through an LSTM → captures motion and temporal patterns.
  3. Text decoder: second LSTM generates the caption word by word, conditioned on the encoded video representation.
# Encoder: process frame sequence
frame_features = []
cnn = ResNet50(weights='imagenet', include_top=False, pooling='avg')
for frame in video_frames:
    feat = cnn.predict(preprocess_input(frame[None]))
    frame_features.append(feat)
# Shape: (num_frames, 2048)
frame_seq = np.array(frame_features)[None]  # (1, T, 2048)

encoder = keras.Sequential([layers.LSTM(512, input_shape=(T, 2048))])
video_encoding = encoder.predict(frame_seq)   # (1, 512)

# Decoder: condition LSTM on video_encoding, then generate words
# ... standard LSTM language model initialised with video_encoding
Exam-ready points
  • Optical flow features (motion vectors) can augment CNN appearance features.
  • Two-stream networks: spatial stream (RGB frames) + temporal stream (optical flow).
  • METEOR, CIDEr, ROUGE-L: evaluation metrics for video captioning beyond BLEU.

3. Modern Note — Transformers & Beyond

Scaled Dot-Product Attention (Vaswani et al., "Attention is All You Need", 2017):

\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V\]

Multi-Head Attention: run \(h\) attention heads in parallel with different projections; concatenate outputs. Captures different types of relationships simultaneously.

Transformer applications in vision: ViT (Vision Transformer, 2020) splits image into 16×16 patches treated as tokens → state-of-art on ImageNet with enough data. CLIP, DALL-E, Stable Diffusion all build on Transformer attention.

Positional Encoding
Self-attention is permutation-invariant → positional encodings are added to token embeddings to inject order information. Sinusoidal: \(PE_{(pos,2i)} = \sin(pos/10000^{2i/d_{\text{model}}})\).
Exam-ready points
  • Transformer has no recurrence → fully parallelisable; can scale to billions of parameters.
  • Scaling factor \(\sqrt{d_k}\) prevents dot products from growing too large → keeps softmax gradients from vanishing.
  • BERT: bidirectional Transformer encoder pre-trained with masked language modelling.
  • GPT: autoregressive Transformer decoder pre-trained with next-token prediction.

4. GenAI & LLM Basics

Modern generative AI is almost entirely built on Transformer attention. Key terms:

TermDefinition
LLMLarge language model — Transformer trained with next-token prediction on massive text corpora (GPT, LLaMA, Gemini).
Context windowMaximum tokens (input + output) the model can attend to at once. GPT-4 Turbo: 128K tokens.
HallucinationConfident-sounding output that is factually wrong or unsupported. Common failure mode of LLMs.
RAGRetrieval-Augmented Generation — fetch relevant documents first, then feed them as context to the LLM for grounded answers.
Fine-tuningContinue training a pre-trained LLM on task-specific data to specialise its behaviour.
RLHFReinforcement Learning from Human Feedback — aligns LLM outputs with human preferences (used in ChatGPT, Claude).
Prompt engineeringCrafting inputs (system prompt, few-shot examples, chain-of-thought) to steer LLM behaviour without retraining.
Practical caution
A model can be fluent and still be wrong. For factual tasks, always add retrieval (RAG), citations, or a verification step. Fluency ≠ accuracy.
Exam-ready points
  • BERT (encoder-only): bidirectional; pre-trained with masked LM → good for classification, NER, Q&A.
  • GPT (decoder-only): autoregressive; pre-trained with next-token prediction → good for generation.
  • T5 / BART (encoder-decoder): seq2seq; good for translation, summarisation, question answering.
  • Diffusion models (DALL-E 3, Stable Diffusion): iterative denoising from Gaussian noise → high-quality image generation.

Worked Example — Scaled dot-product attention (NumPy)

import numpy as np
def softmax(x): e = np.exp(x - x.max(-1, keepdims=True)); return e / e.sum(-1, keepdims=True)
def attention(Q, K, V):
    dk = K.shape[-1]
    scores = Q @ K.T / np.sqrt(dk)    # (seq_q, seq_k)
    weights = softmax(scores)          # attention weights
    return weights @ V, weights        # context + weights

np.random.seed(0)
Q = np.random.randn(3, 4)   # 3 queries,  dim=4
K = np.random.randn(5, 4)   # 5 keys,     dim=4
V = np.random.randn(5, 4)   # 5 values,   dim=4
ctx, w = attention(Q, K, V)
print("Context shape:", ctx.shape)   # (3, 4)
print("Weights (row sums=1):", w.sum(-1).round(4))

Exercises

  1. Derive the scaled dot-product attention formula and explain why we scale by \(\sqrt{d_k}\).
  2. Compare Bahdanau attention (additive) and Luong attention (multiplicative).
  3. Describe the S2VT architecture for video captioning.

Viva Questions

  1. What is the role of attention weights \(\alpha_{ti}\) in the captioning model?
  2. Explain multi-head attention and its advantage over single-head attention.
  3. What is positional encoding and why is it needed in Transformers?
  4. How does ViT (Vision Transformer) process an image?
  5. What is the difference between BERT and GPT in terms of architecture and pre-training task?
Tip: press Esc to close.