Unit IV — Deep Learning for NLP & Computer Vision

Chapter 12 — NLP Foundations: Vector Space Semantics

Framework Fill the placeholders below with your full content.

Chapter 12 — NLP Foundations: Vector Space Semantics

Unit IV · Deep Learning for NLP & Computer Vision

Objectives

Understand text preprocessing pipelines · Apply bag-of-words and TF-IDF representations · Interpret vector space semantics

1. Introduction to NLP

Natural Language Processing (NLP) deals with enabling machines to understand, interpret, and generate human language. The standard text preprocessing pipeline:

Tokenisation: split text into tokens (words or subwords).
Lowercasing & cleaning: remove punctuation, special characters.
Stop-word removal: discard very frequent words (the, is, a…) that carry little semantic weight.
Stemming / Lemmatisation: reduce words to root form (running → run).
Vectorisation: convert tokens to numerical representation.

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
corpus = ["I love deep learning", "Deep learning is powerful"]
cv  = CountVectorizer()
X_bow = cv.fit_transform(corpus).toarray()   # bag-of-words
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus).toarray()

Exam-ready points

Vocabulary size grows with corpus → sparse, high-dimensional BoW vectors.
n-grams capture word order to some extent: bigrams = pairs of adjacent words.

2. Vector Space Model of Semantics

In the vector space model, each document (or word) is represented as a vector in a high-dimensional space. Semantic similarity is measured by cosine similarity:

\[\text{cos}(u, v) = \frac{u \cdot v}{\|u\|\|v\|} \in [-1, 1]\]

TF-IDF (Term Frequency – Inverse Document Frequency) weights each term:

\[\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \log\frac{N}{\text{DF}(t)}\]

where TF = term frequency in document, N = total documents, DF = number of docs containing term. Rare terms that appear frequently in a document get high weight.

Limitations of BoW / TF-IDF

• Does not capture word order or syntax.
• "not good" and "good" are treated similarly.
• Synonyms ("car", "automobile") appear as different dimensions.
→ Dense word embeddings (Ch 13) address these limitations.

Exam-ready points

Distributional hypothesis: words with similar contexts have similar meanings (Firth, 1957).
Co-occurrence matrix → SVD → dense semantic vectors (LSA).

Worked Example — TF-IDF + Cosine similarity

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

docs = ["deep learning uses neural networks",
        "neural networks have many layers",
        "cats and dogs are pets"]
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(docs)
sim = cosine_similarity(X)
print(sim.round(2))
# sim[0,1] ≈ 0.3 (related), sim[0,2] ≈ 0.0 (unrelated)

Exercises

Manually compute TF-IDF for the word "learning" in the sentence "Deep learning uses learning" given a corpus of 10 documents where 3 contain "learning".
What is cosine similarity and why is it preferred over Euclidean distance for text comparison?
List the steps of a text preprocessing pipeline and justify each step.

Viva Questions

What is TF-IDF and how is it computed?
What is the distributional hypothesis?
Why does BoW fail to capture semantic meaning?
What is Latent Semantic Analysis (LSA)?
How does cosine similarity differ from Euclidean distance for document comparison?