Skip to main content

Feature Engineering

Feature engineering transforms raw data into features that better represent the underlying patterns for your model. The right encoding for categorical variables, intelligent dimensionality reduction, and detecting multicollinearity can make the difference between a mediocre model and an excellent one.

Quick Reference​

Categorical Encoding​

MethodHow It WorksWhen to Use
One-Hot EncodingCreate binary column for each category. [Red, Blue, Green] → [1,0,0], [0,1,0], [0,0,1]Low-cardinality nominal data (< ~20 categories). Default for nominal categories
Ordinal EncodingAssign integers based on order. [Low, Med, High] → [1, 2, 3]Truly ordered categories only. Never for nominal categories
Label EncodingAssign arbitrary integers to categoriesTree-based models (XGBoost, Random Forest) which don't assume order
Similarity EncodingEncode based on string similarity between category valuesHigh-cardinality with typos/misspellings (e.g., "New York", "new york", "NY")
Target EncodingReplace category with mean of target variable for that categoryHigh-cardinality when one-hot creates too many features. Risk of data leakage

Dimensionality Reduction​

MethodWhat It DoesPreserves Info?
PCALinear projection to principal components (max variance directions)Yes — captures maximum variance in fewer dimensions. Must scale first
Remove correlated featuresDrop features with high pairwise correlationNo — discards entire features. Use when interpretability is needed
L1 / LassoRegularization that drives coefficients to zero → automatic feature selectionPartially — keeps most important features. Produces sparse models
RFEIteratively removes least important features based on modelPartially — maintains interpretability
AutoencoderNeural network compresses data through bottleneck layerYes — non-linear compression. Better than PCA for complex relationships

Multicollinearity Detection​

MethodWhat It Tells YouThreshold
Correlation Matrix / HeatmapPairwise correlation between features|r| > 0.8 = highly correlated
VIF (Variance Inflation Factor)How much variance is inflated by correlation with othersVIF > 5-10 = problematic
PCA eigenvaluesNear-zero eigenvalue = features are linearly dependentEigenvalue ≈ 0 = collinear features

Flashcards​

1 / 10
Question

When should you use one-hot encoding vs label encoding?

Click to reveal
Answer

One-hot encoding for nominal categories with low cardinality (< ~20 values) — it is the default for non-ordered categories. Label encoding for tree-based models (XGBoost, Random Forest) where arbitrary integer ordering does not matter because trees split on thresholds.

Key Insight

"Feature selection via regularization" = L1/Lasso (drives weights to zero, producing sparse models). "Reduce overfitting without eliminating features" = L2/Ridge (shrinks all weights but keeps every feature). If you need to explain which features matter, L1 is your tool.