Unit IV — Deep Learning for NLP & Computer Vision

Chapter 13 — Word Embeddings: CBOW, Skip-gram, GloVe

Framework Fill the placeholders below with your full content.

Chapter 13 — Word Embeddings: CBOW, Skip-gram, GloVe

Unit IV · Deep Learning for NLP & Computer Vision

Objectives

Understand dense word representations · Train Word2Vec (CBOW and Skip-gram) · Use GloVe and evaluate embedding quality

1. CBOW (Continuous Bag of Words)

Word2Vec (Mikolov et al., 2013) learns dense word representations by training a shallow neural network on a self-supervised task. CBOW predicts the centre word from its context:

Context: words within a window of size \(k\) around the target. Input: average (or sum) of context word one-hots. Output: softmax over vocabulary → predict target word.

Architecture: one-hot(context) → Embedding matrix \(W_{V\times d}\) → average → linear → softmax

Negative Sampling

Softmax over 50K–1M vocab is expensive. Negative sampling approximates it: for each positive pair, sample \(k\) negative (noise) words and train a binary classifier. Typical \(k\): 5–20 for small datasets, 2–5 for large.

Exam-ready points

After training, the embedding matrix \(W\) is the learned representation.
CBOW is faster; better for common words.
Typical embedding dimension: 100–300.

2. Skip-gram

Skip-gram is the reverse of CBOW: given a centre word, predict each context word. More difficult task → richer representations, especially for rare words.

Training objective: maximise \(\sum_{t} \sum_{-k \le j \le k, j\ne 0} \log P(w_{t+j} | w_t)\)

Famous analogy: \(\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}\)

This shows embeddings encode semantic relationships as vector arithmetic.

# Using gensim
from gensim.models import Word2Vec
sentences = [["deep","learning","is","powerful"],["neural","networks","learn"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1)  # sg=1: skip-gram
vec_deep = model.wv['deep']
similar = model.wv.most_similar('deep', topn=5)

Exam-ready points

Skip-gram: slower training, better for infrequent words.
CBOW: faster, better for frequent words / larger corpora.

3. GloVe (Global Vectors)

GloVe (Pennington et al., 2014) factorises a word–word co-occurrence matrix \(X\) built from the entire corpus. Objective:

\[J = \sum_{i,j} f(X_{ij})\left(w_i^\top \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij}\right)^2\]

where \(f(X_{ij})\) is a weighting function that down-weights very frequent co-occurrences.

GloVe captures global (corpus-wide) statistics unlike Word2Vec which uses local windows. Pre-trained GloVe vectors (6B, 42B, 840B tokens) are widely available.

# Load pre-trained GloVe into Keras Embedding layer
import numpy as np
embeddings_index = {}
with open('glove.6B.100d.txt') as f:
    for line in f:
        vals = line.split(); word = vals[0]
        embeddings_index[word] = np.array(vals[1:], dtype='float32')
# Build embedding matrix for your vocabulary
embedding_matrix = np.zeros((vocab_size, 100))
for word, i in word_index.items():
    vec = embeddings_index.get(word)
    if vec is not None: embedding_matrix[i] = vec

4. Evaluation & Applications

Intrinsic evaluation: word similarity benchmarks (WordSim-353, SimLex-999); analogy tasks (capital-of-country, male-female).

Extrinsic evaluation: performance on downstream tasks (sentiment analysis, NER, QA) when using embeddings as features — the gold standard.

Applications of word embeddings: text classification, NER, machine translation, question answering, information retrieval, chatbots.

Exam-ready points

Embeddings are initialised with pre-trained GloVe/Word2Vec then fine-tuned during task training.
OOV (out-of-vocabulary) words: use <UNK> token or subword models (BPE, fastText).
FastText (Facebook): extends Word2Vec with character n-grams — handles morphology and OOV.

Worked Example — Using GloVe in Keras text classifier

embedding_layer = layers.Embedding(
    input_dim=vocab_size, output_dim=100, input_length=maxlen,
    weights=[embedding_matrix], trainable=False)   # frozen pre-trained

model = keras.Sequential([
    embedding_layer,
    layers.Bidirectional(layers.LSTM(64)),
    layers.Dense(32, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])
model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])

Viva Questions

Explain CBOW and Skip-gram models with their training objectives.
What is negative sampling and why is it used in Word2Vec?
How does GloVe differ from Word2Vec in its training approach?
Demonstrate the word analogy king - man + woman = ? using vector arithmetic.
What is fastText and how does it handle out-of-vocabulary words?