Feature Engineering
Feature engineering transforms raw data into features that better represent the underlying patterns for your model. The right encoding for categorical variables, intelligent dimensionality reduction, and detecting multicollinearity can make the difference between a mediocre model and an excellent one.
Quick Reference​
Categorical Encoding​
| Method | How It Works | When to Use |
|---|---|---|
| One-Hot Encoding | Create binary column for each category. [Red, Blue, Green] → [1,0,0], [0,1,0], [0,0,1] | Low-cardinality nominal data (< ~20 categories). Default for nominal categories |
| Ordinal Encoding | Assign integers based on order. [Low, Med, High] → [1, 2, 3] | Truly ordered categories only. Never for nominal categories |
| Label Encoding | Assign arbitrary integers to categories | Tree-based models (XGBoost, Random Forest) which don't assume order |
| Similarity Encoding | Encode based on string similarity between category values | High-cardinality with typos/misspellings (e.g., "New York", "new york", "NY") |
| Target Encoding | Replace category with mean of target variable for that category | High-cardinality when one-hot creates too many features. Risk of data leakage |
Dimensionality Reduction​
| Method | What It Does | Preserves Info? |
|---|---|---|
| PCA | Linear projection to principal components (max variance directions) | Yes — captures maximum variance in fewer dimensions. Must scale first |
| Remove correlated features | Drop features with high pairwise correlation | No — discards entire features. Use when interpretability is needed |
| L1 / Lasso | Regularization that drives coefficients to zero → automatic feature selection | Partially — keeps most important features. Produces sparse models |
| RFE | Iteratively removes least important features based on model | Partially — maintains interpretability |
| Autoencoder | Neural network compresses data through bottleneck layer | Yes — non-linear compression. Better than PCA for complex relationships |
Multicollinearity Detection​
| Method | What It Tells You | Threshold |
|---|---|---|
| Correlation Matrix / Heatmap | Pairwise correlation between features | |r| > 0.8 = highly correlated |
| VIF (Variance Inflation Factor) | How much variance is inflated by correlation with others | VIF > 5-10 = problematic |
| PCA eigenvalues | Near-zero eigenvalue = features are linearly dependent | Eigenvalue ≈ 0 = collinear features |
Flashcards​
When should you use one-hot encoding vs label encoding?
Click to revealOne-hot encoding for nominal categories with low cardinality (< ~20 values) — it is the default for non-ordered categories. Label encoding for tree-based models (XGBoost, Random Forest) where arbitrary integer ordering does not matter because trees split on thresholds.
"Feature selection via regularization" = L1/Lasso (drives weights to zero, producing sparse models). "Reduce overfitting without eliminating features" = L2/Ridge (shrinks all weights but keeps every feature). If you need to explain which features matter, L1 is your tool.