Exp 12 — Spam detection with TF-IDF

Record-ready template Fill placeholders with your dataset, code, outputs, plots, and viva.
AIML355 • Fundamentals of Deep Learning Lab

EXP12 — Spam Detection with TF-IDF

Record-ready template Replace placeholders with your final work (code + outputs + screenshots).
Submission checklist
Aim ✓ • Environment ✓ • Dataset ✓ • Procedure ✓ • Code ✓ • Output ✓ • Discussion ✓ • Viva ✓

1) Aim

To build an NLP model for spam detection using TF-IDF in Python.

Learning outcomes
  • Prepare labeled SMS/email dataset and clean text.
  • Build TF-IDF features and train classifier.
  • Evaluate using precision/recall/F1 and confusion matrix.

2) Requirements / Environment

Software
  • Python 3.10+ (recommended)
  • TensorFlow/Keras (or PyTorch where applicable)
  • NumPy, Pandas, Matplotlib
  • Jupyter/Colab optional
Hardware
  • CPU is OK for small runs; GPU optional
  • RAM: 4–8 GB+ recommended
Reproducibility
Record library versions and random seed in your final report.

3) Dataset

  • Source: Use SMS Spam Collection dataset (UCI / Kaggle) and cite it.
  • Features/Labels: [Describe X and y; mention classes if classification]
  • Split: [Train/Validation/Test or K-fold]
  • Preprocessing: [Scaling/Normalization, resizing, tokenization, etc.]

4) Procedure / Steps

  1. Load dataset and perform preprocessing.
  2. Define model architecture and justify key choices.
  3. Compile model (loss + optimizer + metrics).
  4. Train with validation and log curves.
  5. Evaluate on test set and compute required metrics.
  6. Summarize observations and limitations.
Model hint
TF-IDF + Linear SVM/LogReg baseline; tune n-grams and max_features.

5) Code (Skeleton)

Paste your complete runnable code below (or attach notebook link in the final submission).

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, confusion_matrix

# TODO: load spam dataset
# df = pd.read_csv(...)
# X = df['text']
# y = df['label']

vec = TfidfVectorizer(max_features=20000, ngram_range=(1,2))
Xv = vec.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(Xv, y, test_size=0.2, random_state=42, stratify=y)

clf = LinearSVC()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(classification_report(y_test, pred))

6) Results / Output

  • Metrics: [Write your final values: accuracy/F1 or MAE/MSE]
  • Plots: [Attach loss/metric curves; prediction vs actual plots if forecasting]
  • Screenshots: [Paste screenshots of outputs, confusion matrix, sample predictions]

7) Observations / Discussion

  • [Observation 1: what changed when you tuned epochs/batch size?]
  • [Observation 2: evidence of overfitting/underfitting?]
  • [Observation 3: what improved performance (augmentation, regularization, fine-tuning)?]

8) Conclusion

Write 3–6 lines summarizing what you implemented, key result, and what you learned.

9) Viva Questions

  1. Why is precision important for spam detection?
  2. What is a confusion matrix?
  3. What is the difference between training and validation data?
  4. Explain overfitting and two ways to reduce it.
  5. Why do we normalize/scale inputs?
  6. What does batch size and epoch mean?
  7. How do you choose a loss function for a task?

10) Post-lab Assignment

  • Try LinearSVC vs LogisticRegression.
  • Tune n-grams and max_features and compare.
Tip: press Esc to close.