Feature Engineering

Feature engineering transforms raw data into features that better represent the underlying patterns for your model. The right encoding for categorical variables, intelligent dimensionality reduction, and detecting multicollinearity can make the difference between a mediocre model and an excellent one.

Quick Reference

Categorical Encoding

Method	How It Works	When to Use
One-Hot Encoding	Create binary column for each category. [Red, Blue, Green] → [1,0,0], [0,1,0], [0,0,1]	Low-cardinality nominal data (< ~20 categories). Default for nominal categories
Ordinal Encoding	Assign integers based on order. [Low, Med, High] → [1, 2, 3]	Truly ordered categories only. Never for nominal categories
Label Encoding	Assign arbitrary integers to categories	Tree-based models (XGBoost, Random Forest) which don't assume order
Similarity Encoding	Encode based on string similarity between category values	High-cardinality with typos/misspellings (e.g., "New York", "new york", "NY")
Target Encoding	Replace category with mean of target variable for that category	High-cardinality when one-hot creates too many features. Risk of data leakage

Dimensionality Reduction

Method	What It Does	Preserves Info?
PCA	Linear projection to principal components (max variance directions)	Yes — captures maximum variance in fewer dimensions. Must scale first
Remove correlated features	Drop features with high pairwise correlation	No — discards entire features. Use when interpretability is needed
L1 / Lasso	Regularization that drives coefficients to zero → automatic feature selection	Partially — keeps most important features. Produces sparse models
RFE	Iteratively removes least important features based on model	Partially — maintains interpretability
Autoencoder	Neural network compresses data through bottleneck layer	Yes — non-linear compression. Better than PCA for complex relationships

Multicollinearity Detection

Method	What It Tells You	Threshold
Correlation Matrix / Heatmap	Pairwise correlation between features	\|r\| > 0.8 = highly correlated
VIF (Variance Inflation Factor)	How much variance is inflated by correlation with others	VIF > 5-10 = problematic
PCA eigenvalues	Near-zero eigenvalue = features are linearly dependent	Eigenvalue ≈ 0 = collinear features

Flashcards

1 / 10

Question

When should you use one-hot encoding vs label encoding?

Click to reveal

Answer

One-hot encoding for nominal categories with low cardinality (< ~20 values) — it is the default for non-ordered categories. Label encoding for tree-based models (XGBoost, Random Forest) where arbitrary integer ordering does not matter because trees split on thresholds.

Key Insight

"Feature selection via regularization" = L1/Lasso (drives weights to zero, producing sparse models). "Reduce overfitting without eliminating features" = L2/Ridge (shrinks all weights but keeps every feature). If you need to explain which features matter, L1 is your tool.

Quick Reference​

Categorical Encoding​

Dimensionality Reduction​

Multicollinearity Detection​

Flashcards​

Quick Reference

Categorical Encoding

Dimensionality Reduction

Multicollinearity Detection

Flashcards