Unit IV — Deep Learning for NLP & Computer Vision
Chapter 12 — NLP Foundations: Vector Space Semantics
Framework
Fill the placeholders below with your full content.
Chapter 12 — NLP Foundations: Vector Space Semantics
Unit IV · Deep Learning for NLP & Computer Vision
Objectives
Understand text preprocessing pipelines · Apply bag-of-words and TF-IDF representations · Interpret vector space semantics
1. Introduction to NLP
Natural Language Processing (NLP) deals with enabling machines to understand, interpret, and generate human language. The standard text preprocessing pipeline:
- Tokenisation: split text into tokens (words or subwords).
- Lowercasing & cleaning: remove punctuation, special characters.
- Stop-word removal: discard very frequent words (the, is, a…) that carry little semantic weight.
- Stemming / Lemmatisation: reduce words to root form (running → run).
- Vectorisation: convert tokens to numerical representation.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
corpus = ["I love deep learning", "Deep learning is powerful"]
cv = CountVectorizer()
X_bow = cv.fit_transform(corpus).toarray() # bag-of-words
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus).toarray()
Exam-ready points
- Vocabulary size grows with corpus → sparse, high-dimensional BoW vectors.
- n-grams capture word order to some extent: bigrams = pairs of adjacent words.
2. Vector Space Model of Semantics
In the vector space model, each document (or word) is represented as a vector in a high-dimensional space. Semantic similarity is measured by cosine similarity:
\[\text{cos}(u, v) = \frac{u \cdot v}{\|u\|\|v\|} \in [-1, 1]\]
TF-IDF (Term Frequency – Inverse Document Frequency) weights each term:
\[\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \log\frac{N}{\text{DF}(t)}\]
where TF = term frequency in document, N = total documents, DF = number of docs containing term. Rare terms that appear frequently in a document get high weight.
Limitations of BoW / TF-IDF
Exam-ready points
• Does not capture word order or syntax.
• "not good" and "good" are treated similarly.
• Synonyms ("car", "automobile") appear as different dimensions.
→ Dense word embeddings (Ch 13) address these limitations.
• "not good" and "good" are treated similarly.
• Synonyms ("car", "automobile") appear as different dimensions.
→ Dense word embeddings (Ch 13) address these limitations.
- Distributional hypothesis: words with similar contexts have similar meanings (Firth, 1957).
- Co-occurrence matrix → SVD → dense semantic vectors (LSA).
Worked Example — TF-IDF + Cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ["deep learning uses neural networks",
"neural networks have many layers",
"cats and dogs are pets"]
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(docs)
sim = cosine_similarity(X)
print(sim.round(2))
# sim[0,1] ≈ 0.3 (related), sim[0,2] ≈ 0.0 (unrelated)
Exercises
- Manually compute TF-IDF for the word "learning" in the sentence "Deep learning uses learning" given a corpus of 10 documents where 3 contain "learning".
- What is cosine similarity and why is it preferred over Euclidean distance for text comparison?
- List the steps of a text preprocessing pipeline and justify each step.
Viva Questions
- What is TF-IDF and how is it computed?
- What is the distributional hypothesis?
- Why does BoW fail to capture semantic meaning?
- What is Latent Semantic Analysis (LSA)?
- How does cosine similarity differ from Euclidean distance for document comparison?