NLP Concepts

Natural Language Processing (NLP) is the field of machine learning focused on understanding and generating human language. Before any model can work with text, that text must be transformed into numerical representations. This section covers the essential preprocessing pipeline and vectorization methods.

Text Preprocessing Pipeline

Text preprocessing converts raw text into a clean, numerical format that models can consume. The steps are typically applied in order.

Step	What It Does	Example
1. Tokenization	Split text into individual words/tokens	"I love ML" → `["I", "love", "ML"]`
2. Lowercasing	Convert to lowercase	`"ML"` → `"ml"`
3. Stop Word Removal	Remove common words (the, is, a, an)	`["I", "love", "ml"]` → `["love", "ml"]`
4. Stemming	Reduce words to root form (crude, rule-based)	"running", "runs", "ran" → `"run"`
5. Lemmatization	Reduce to dictionary base form (uses language rules)	"better" → `"good"`, "ran" → `"run"`
6. Vectorization	Convert text to numbers	TF-IDF, Word2Vec, or embeddings

Stemming vs Lemmatization

	Stemming	Lemmatization
Approach	Rule-based suffix stripping	Dictionary/grammar-based lookup
Speed	Faster	Slower
Accuracy	May produce non-words ("studi" from "studies")	Always produces real words
Use case	When speed matters and approximate roots are acceptable	When linguistic accuracy matters

Text Vectorization Methods

Method	How It Works	Pros	Cons
Bag of Words (BoW)	Count occurrences of each word	Simple, easy to implement	Loses word order, high-dimensional, sparse
TF-IDF	Term Frequency x Inverse Document Frequency	Downweights common words, upweights distinctive words	Still sparse, loses word order
Word2Vec	Dense vector embeddings in continuous space	Captures semantic meaning; similar words are close	Requires training or pre-trained vectors
Pre-trained Embeddings (GloVe, fastText)	Pre-computed vectors from large corpora	Quick to use, no training needed	May not fit domain-specific vocabulary

TF-IDF Explained

# TF-IDF = Term Frequency × Inverse Document Frequency
# TF(t, d) = count of term t in document d / total terms in d
# IDF(t) = log(total documents / documents containing t)
# TF-IDF(t, d) = TF(t, d) × IDF(t)

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = vectorizer.fit_transform(documents)

TF-IDF upweights words that are distinctive to a document and downweights words that appear everywhere.

Word2Vec

Word2Vec learns dense vector representations where semantically similar words end up close in vector space.

# Famous example: King - Man + Woman ≈ Queen
# Two training modes:
# - Skip-gram: predicts surrounding context words from the center word
# - CBOW: predicts the center word from surrounding context words

Key Insight

Word2Vec captures semantic relationships. "King - Man + Woman ≈ Queen" demonstrates that the model learns meaning, not just co-occurrence. Skip-gram works better with rare words; CBOW is faster and works better with frequent words.

Dealing with Large Vocabularies

When you have a rich vocabulary with many low-frequency words, the feature space becomes very sparse.

Solution: Apply stemming + stop word removal to reduce vocabulary size and consolidate word variants before vectorization.

note

TF-IDF does not reduce vocabulary size — it only reweights terms. If vocabulary sparsity is the problem, preprocessing steps like stemming and stop word removal are the right approach.

Flashcards

1 / 8

Question

What is the correct order of a basic NLP preprocessing pipeline?

Click to reveal

Answer

1) Tokenization, 2) Lowercasing, 3) Stop word removal, 4) Stemming or Lemmatization, 5) Vectorization (BoW, TF-IDF, or embeddings). Not all steps are always needed — the pipeline depends on the task.

Text Preprocessing Pipeline​

Stemming vs Lemmatization​

Text Vectorization Methods​

TF-IDF Explained​

Word2Vec​

Dealing with Large Vocabularies​

Flashcards​