Chapter 16 — Attention Models + Video-to-Text (LSTM)
Chapter 16 — Attention Models + Video-to-Text (LSTM)
Unit IV · Deep Learning for NLP & Computer Vision
1. Attention Mechanism in Vision Tasks
The attention mechanism allows a decoder to selectively focus on different parts of the encoder output at each decoding step. Originally proposed for machine translation (Bahdanau et al., 2015); adapted for vision in "Show, Attend and Tell".
Given encoder feature vectors \(\{h_1, \ldots, h_L\}\) (spatial locations) and decoder state \(s_t\):
\[e_{ti} = f_{\text{att}}(s_{t-1}, h_i) \quad \text{(alignment score)}\]
\[\alpha_{ti} = \text{softmax}(e_{ti}) \quad \text{(attention weights)}\]
\[c_t = \sum_i \alpha_{ti} h_i \quad \text{(context vector)}\]
The context vector \(c_t\) is concatenated with the decoder input at step \(t\).
Hard (stochastic): attend to one location — trained with REINFORCE; interpretable but higher variance.
- Attention adds O(n²) complexity over sequence length n — manageable for moderate sequences.
- Self-attention: queries, keys, values all come from the same sequence — core of Transformers.
2. Video-to-Text with LSTM
Video captioning (S2VT — Sequence to Sequence Video to Text, Venugopalan et al., 2015) extends image captioning to video by encoding temporal information:
- Frame-level features: extract CNN features for each frame \(f_t\) (e.g., every 8th frame → 16 frames).
- Temporal encoder: feed the sequence of frame features through an LSTM → captures motion and temporal patterns.
- Text decoder: second LSTM generates the caption word by word, conditioned on the encoded video representation.
# Encoder: process frame sequence
frame_features = []
cnn = ResNet50(weights='imagenet', include_top=False, pooling='avg')
for frame in video_frames:
feat = cnn.predict(preprocess_input(frame[None]))
frame_features.append(feat)
# Shape: (num_frames, 2048)
frame_seq = np.array(frame_features)[None] # (1, T, 2048)
encoder = keras.Sequential([layers.LSTM(512, input_shape=(T, 2048))])
video_encoding = encoder.predict(frame_seq) # (1, 512)
# Decoder: condition LSTM on video_encoding, then generate words
# ... standard LSTM language model initialised with video_encoding
Exam-ready points
- Optical flow features (motion vectors) can augment CNN appearance features.
- Two-stream networks: spatial stream (RGB frames) + temporal stream (optical flow).
- METEOR, CIDEr, ROUGE-L: evaluation metrics for video captioning beyond BLEU.
3. Modern Note — Transformers & Beyond
Scaled Dot-Product Attention (Vaswani et al., "Attention is All You Need", 2017):
\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V\]
Multi-Head Attention: run \(h\) attention heads in parallel with different projections; concatenate outputs. Captures different types of relationships simultaneously.
Transformer applications in vision: ViT (Vision Transformer, 2020) splits image into 16×16 patches treated as tokens → state-of-art on ImageNet with enough data. CLIP, DALL-E, Stable Diffusion all build on Transformer attention.
- Transformer has no recurrence → fully parallelisable; can scale to billions of parameters.
- Scaling factor \(\sqrt{d_k}\) prevents dot products from growing too large → keeps softmax gradients from vanishing.
- BERT: bidirectional Transformer encoder pre-trained with masked language modelling.
- GPT: autoregressive Transformer decoder pre-trained with next-token prediction.
4. GenAI & LLM Basics
Modern generative AI is almost entirely built on Transformer attention. Key terms:
| Term | Definition |
|---|---|
| LLM | Large language model — Transformer trained with next-token prediction on massive text corpora (GPT, LLaMA, Gemini). |
| Context window | Maximum tokens (input + output) the model can attend to at once. GPT-4 Turbo: 128K tokens. |
| Hallucination | Confident-sounding output that is factually wrong or unsupported. Common failure mode of LLMs. |
| RAG | Retrieval-Augmented Generation — fetch relevant documents first, then feed them as context to the LLM for grounded answers. |
| Fine-tuning | Continue training a pre-trained LLM on task-specific data to specialise its behaviour. |
| RLHF | Reinforcement Learning from Human Feedback — aligns LLM outputs with human preferences (used in ChatGPT, Claude). |
| Prompt engineering | Crafting inputs (system prompt, few-shot examples, chain-of-thought) to steer LLM behaviour without retraining. |
- BERT (encoder-only): bidirectional; pre-trained with masked LM → good for classification, NER, Q&A.
- GPT (decoder-only): autoregressive; pre-trained with next-token prediction → good for generation.
- T5 / BART (encoder-decoder): seq2seq; good for translation, summarisation, question answering.
- Diffusion models (DALL-E 3, Stable Diffusion): iterative denoising from Gaussian noise → high-quality image generation.
Worked Example — Scaled dot-product attention (NumPy)
import numpy as np
def softmax(x): e = np.exp(x - x.max(-1, keepdims=True)); return e / e.sum(-1, keepdims=True)
def attention(Q, K, V):
dk = K.shape[-1]
scores = Q @ K.T / np.sqrt(dk) # (seq_q, seq_k)
weights = softmax(scores) # attention weights
return weights @ V, weights # context + weights
np.random.seed(0)
Q = np.random.randn(3, 4) # 3 queries, dim=4
K = np.random.randn(5, 4) # 5 keys, dim=4
V = np.random.randn(5, 4) # 5 values, dim=4
ctx, w = attention(Q, K, V)
print("Context shape:", ctx.shape) # (3, 4)
print("Weights (row sums=1):", w.sum(-1).round(4))
Exercises
- Derive the scaled dot-product attention formula and explain why we scale by \(\sqrt{d_k}\).
- Compare Bahdanau attention (additive) and Luong attention (multiplicative).
- Describe the S2VT architecture for video captioning.
Viva Questions
- What is the role of attention weights \(\alpha_{ti}\) in the captioning model?
- Explain multi-head attention and its advantage over single-head attention.
- What is positional encoding and why is it needed in Transformers?
- How does ViT (Vision Transformer) process an image?
- What is the difference between BERT and GPT in terms of architecture and pre-training task?