Chapter 13 — Word Embeddings: CBOW, Skip-gram, GloVe
Chapter 13 — Word Embeddings: CBOW, Skip-gram, GloVe
Unit IV · Deep Learning for NLP & Computer Vision
1. CBOW (Continuous Bag of Words)
Word2Vec (Mikolov et al., 2013) learns dense word representations by training a shallow neural network on a self-supervised task. CBOW predicts the centre word from its context:
Context: words within a window of size \(k\) around the target. Input: average (or sum) of context word one-hots. Output: softmax over vocabulary → predict target word.
Architecture: one-hot(context) → Embedding matrix \(W_{V\times d}\) → average → linear → softmax
- After training, the embedding matrix \(W\) is the learned representation.
- CBOW is faster; better for common words.
- Typical embedding dimension: 100–300.
2. Skip-gram
Skip-gram is the reverse of CBOW: given a centre word, predict each context word. More difficult task → richer representations, especially for rare words.
Training objective: maximise \(\sum_{t} \sum_{-k \le j \le k, j\ne 0} \log P(w_{t+j} | w_t)\)
Famous analogy: \(\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}\)
This shows embeddings encode semantic relationships as vector arithmetic.
# Using gensim
from gensim.models import Word2Vec
sentences = [["deep","learning","is","powerful"],["neural","networks","learn"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1) # sg=1: skip-gram
vec_deep = model.wv['deep']
similar = model.wv.most_similar('deep', topn=5)
Exam-ready points
- Skip-gram: slower training, better for infrequent words.
- CBOW: faster, better for frequent words / larger corpora.
3. GloVe (Global Vectors)
GloVe (Pennington et al., 2014) factorises a word–word co-occurrence matrix \(X\) built from the entire corpus. Objective:
\[J = \sum_{i,j} f(X_{ij})\left(w_i^\top \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij}\right)^2\]
where \(f(X_{ij})\) is a weighting function that down-weights very frequent co-occurrences.
GloVe captures global (corpus-wide) statistics unlike Word2Vec which uses local windows. Pre-trained GloVe vectors (6B, 42B, 840B tokens) are widely available.
# Load pre-trained GloVe into Keras Embedding layer
import numpy as np
embeddings_index = {}
with open('glove.6B.100d.txt') as f:
for line in f:
vals = line.split(); word = vals[0]
embeddings_index[word] = np.array(vals[1:], dtype='float32')
# Build embedding matrix for your vocabulary
embedding_matrix = np.zeros((vocab_size, 100))
for word, i in word_index.items():
vec = embeddings_index.get(word)
if vec is not None: embedding_matrix[i] = vec
4. Evaluation & Applications
Intrinsic evaluation: word similarity benchmarks (WordSim-353, SimLex-999); analogy tasks (capital-of-country, male-female).
Extrinsic evaluation: performance on downstream tasks (sentiment analysis, NER, QA) when using embeddings as features — the gold standard.
Applications of word embeddings: text classification, NER, machine translation, question answering, information retrieval, chatbots.
Exam-ready points- Embeddings are initialised with pre-trained GloVe/Word2Vec then fine-tuned during task training.
- OOV (out-of-vocabulary) words: use
<UNK>token or subword models (BPE, fastText). - FastText (Facebook): extends Word2Vec with character n-grams — handles morphology and OOV.
Worked Example — Using GloVe in Keras text classifier
embedding_layer = layers.Embedding(
input_dim=vocab_size, output_dim=100, input_length=maxlen,
weights=[embedding_matrix], trainable=False) # frozen pre-trained
model = keras.Sequential([
embedding_layer,
layers.Bidirectional(layers.LSTM(64)),
layers.Dense(32, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])
Viva Questions
- Explain CBOW and Skip-gram models with their training objectives.
- What is negative sampling and why is it used in Word2Vec?
- How does GloVe differ from Word2Vec in its training approach?
- Demonstrate the word analogy king - man + woman = ? using vector arithmetic.
- What is fastText and how does it handle out-of-vocabulary words?