Skip to main content

NLP Concepts

Natural Language Processing (NLP) is the field of machine learning focused on understanding and generating human language. Before any model can work with text, that text must be transformed into numerical representations. This section covers the essential preprocessing pipeline and vectorization methods.

Text Preprocessing Pipeline​

Text preprocessing converts raw text into a clean, numerical format that models can consume. The steps are typically applied in order.

StepWhat It DoesExample
1. TokenizationSplit text into individual words/tokens"I love ML" → ["I", "love", "ML"]
2. LowercasingConvert to lowercase"ML" → "ml"
3. Stop Word RemovalRemove common words (the, is, a, an)["I", "love", "ml"] → ["love", "ml"]
4. StemmingReduce words to root form (crude, rule-based)"running", "runs", "ran" → "run"
5. LemmatizationReduce to dictionary base form (uses language rules)"better" → "good", "ran" → "run"
6. VectorizationConvert text to numbersTF-IDF, Word2Vec, or embeddings

Stemming vs Lemmatization​

StemmingLemmatization
ApproachRule-based suffix strippingDictionary/grammar-based lookup
SpeedFasterSlower
AccuracyMay produce non-words ("studi" from "studies")Always produces real words
Use caseWhen speed matters and approximate roots are acceptableWhen linguistic accuracy matters

Text Vectorization Methods​

MethodHow It WorksProsCons
Bag of Words (BoW)Count occurrences of each wordSimple, easy to implementLoses word order, high-dimensional, sparse
TF-IDFTerm Frequency x Inverse Document FrequencyDownweights common words, upweights distinctive wordsStill sparse, loses word order
Word2VecDense vector embeddings in continuous spaceCaptures semantic meaning; similar words are closeRequires training or pre-trained vectors
Pre-trained Embeddings (GloVe, fastText)Pre-computed vectors from large corporaQuick to use, no training neededMay not fit domain-specific vocabulary

TF-IDF Explained​

# TF-IDF = Term Frequency × Inverse Document Frequency
# TF(t, d) = count of term t in document d / total terms in d
# IDF(t) = log(total documents / documents containing t)
# TF-IDF(t, d) = TF(t, d) × IDF(t)

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = vectorizer.fit_transform(documents)

TF-IDF upweights words that are distinctive to a document and downweights words that appear everywhere.

Word2Vec​

Word2Vec learns dense vector representations where semantically similar words end up close in vector space.

# Famous example: King - Man + Woman ≈ Queen
# Two training modes:
# - Skip-gram: predicts surrounding context words from the center word
# - CBOW: predicts the center word from surrounding context words
Key Insight

Word2Vec captures semantic relationships. "King - Man + Woman ≈ Queen" demonstrates that the model learns meaning, not just co-occurrence. Skip-gram works better with rare words; CBOW is faster and works better with frequent words.

Dealing with Large Vocabularies​

When you have a rich vocabulary with many low-frequency words, the feature space becomes very sparse.

Solution: Apply stemming + stop word removal to reduce vocabulary size and consolidate word variants before vectorization.

note

TF-IDF does not reduce vocabulary size — it only reweights terms. If vocabulary sparsity is the problem, preprocessing steps like stemming and stop word removal are the right approach.

Flashcards​

1 / 8
Question

What is the correct order of a basic NLP preprocessing pipeline?

Click to reveal
Answer

1) Tokenization, 2) Lowercasing, 3) Stop word removal, 4) Stemming or Lemmatization, 5) Vectorization (BoW, TF-IDF, or embeddings). Not all steps are always needed — the pipeline depends on the task.