NLP Concepts
Natural Language Processing (NLP) is the field of machine learning focused on understanding and generating human language. Before any model can work with text, that text must be transformed into numerical representations. This section covers the essential preprocessing pipeline and vectorization methods.
Text Preprocessing Pipeline​
Text preprocessing converts raw text into a clean, numerical format that models can consume. The steps are typically applied in order.
| Step | What It Does | Example |
|---|---|---|
| 1. Tokenization | Split text into individual words/tokens | "I love ML" → ["I", "love", "ML"] |
| 2. Lowercasing | Convert to lowercase | "ML" → "ml" |
| 3. Stop Word Removal | Remove common words (the, is, a, an) | ["I", "love", "ml"] → ["love", "ml"] |
| 4. Stemming | Reduce words to root form (crude, rule-based) | "running", "runs", "ran" → "run" |
| 5. Lemmatization | Reduce to dictionary base form (uses language rules) | "better" → "good", "ran" → "run" |
| 6. Vectorization | Convert text to numbers | TF-IDF, Word2Vec, or embeddings |
Stemming vs Lemmatization​
| Stemming | Lemmatization | |
|---|---|---|
| Approach | Rule-based suffix stripping | Dictionary/grammar-based lookup |
| Speed | Faster | Slower |
| Accuracy | May produce non-words ("studi" from "studies") | Always produces real words |
| Use case | When speed matters and approximate roots are acceptable | When linguistic accuracy matters |
Text Vectorization Methods​
| Method | How It Works | Pros | Cons |
|---|---|---|---|
| Bag of Words (BoW) | Count occurrences of each word | Simple, easy to implement | Loses word order, high-dimensional, sparse |
| TF-IDF | Term Frequency x Inverse Document Frequency | Downweights common words, upweights distinctive words | Still sparse, loses word order |
| Word2Vec | Dense vector embeddings in continuous space | Captures semantic meaning; similar words are close | Requires training or pre-trained vectors |
| Pre-trained Embeddings (GloVe, fastText) | Pre-computed vectors from large corpora | Quick to use, no training needed | May not fit domain-specific vocabulary |
TF-IDF Explained​
# TF-IDF = Term Frequency × Inverse Document Frequency
# TF(t, d) = count of term t in document d / total terms in d
# IDF(t) = log(total documents / documents containing t)
# TF-IDF(t, d) = TF(t, d) × IDF(t)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = vectorizer.fit_transform(documents)
TF-IDF upweights words that are distinctive to a document and downweights words that appear everywhere.
Word2Vec​
Word2Vec learns dense vector representations where semantically similar words end up close in vector space.
# Famous example: King - Man + Woman ≈ Queen
# Two training modes:
# - Skip-gram: predicts surrounding context words from the center word
# - CBOW: predicts the center word from surrounding context words
Word2Vec captures semantic relationships. "King - Man + Woman ≈ Queen" demonstrates that the model learns meaning, not just co-occurrence. Skip-gram works better with rare words; CBOW is faster and works better with frequent words.
Dealing with Large Vocabularies​
When you have a rich vocabulary with many low-frequency words, the feature space becomes very sparse.
Solution: Apply stemming + stop word removal to reduce vocabulary size and consolidate word variants before vectorization.
TF-IDF does not reduce vocabulary size — it only reweights terms. If vocabulary sparsity is the problem, preprocessing steps like stemming and stop word removal are the right approach.
Flashcards​
What is the correct order of a basic NLP preprocessing pipeline?
Click to reveal1) Tokenization, 2) Lowercasing, 3) Stop word removal, 4) Stemming or Lemmatization, 5) Vectorization (BoW, TF-IDF, or embeddings). Not all steps are always needed — the pipeline depends on the task.