Data Preparation
Data preparation typically consumes 60-80% of an ML project's time. Getting this right — handling missing values, scaling features appropriately, and augmenting limited datasets — directly determines model quality. A perfectly chosen algorithm will fail on poorly prepared data.
Quick Reference​
Missing Data Strategies​
| Method | How It Works | When to Use |
|---|---|---|
| Mean/Median imputation | Replace missing with column mean or median | Numeric features. Median for skewed data |
| Mode imputation | Replace with most frequent value | Categorical features |
| Forward fill | Use previous time step's value | Time-series data (sensor readings, stock prices) |
| Linear interpolation | Estimate between two known points | Time-series with gradual changes |
| Replace with NaN | Keep as NaN, let algorithm handle it | DeepAR and XGBoost handle missing values natively |
| Drop rows/columns | Remove rows or columns with missing data | When missing percentage is very high (>50%) or data is abundant |
| Multiple imputation | Create multiple imputed datasets, average results | When missing data mechanism matters (research-grade) |
Data Scaling / Normalization​
| Method | Formula | When to Use |
|---|---|---|
| Standard Scaler (Z-score) | (x - mean) / std | Data roughly normal. Best for PCA. Handles outliers better than MinMax |
| Min-Max Scaler | (x - min) / (max - min) | Need values in [0, 1]. Neural networks. Sensitive to outliers |
| Log Transform | log(x) | Right-skewed data (mode < median < mean). Makes distribution more normal |
Data Augmentation (Images)​
| Technique | What It Does |
|---|---|
| Rotation | Rotate image by random degrees |
| Flipping | Horizontal/vertical flip |
| Cropping | Random crops of the image |
| Scaling/Zooming | Resize image randomly |
| Color jittering | Randomly adjust brightness, contrast, saturation |
| Translation | Shift image horizontally/vertically |
Flashcards​
When should you use median imputation instead of mean imputation?
Click to revealUse median imputation when the data is skewed. The mean is pulled by outliers in skewed distributions, while the median is robust and represents the central tendency more accurately.
Split BEFORE Scale — this is a critical rule. Fit your scaler on the training set ONLY, then apply the same transformation to validation and test sets. Scaling before splitting leaks information from test data into training, giving you overly optimistic performance estimates.