Skip to main content

Data Preparation

Data preparation typically consumes 60-80% of an ML project's time. Getting this right — handling missing values, scaling features appropriately, and augmenting limited datasets — directly determines model quality. A perfectly chosen algorithm will fail on poorly prepared data.

Quick Reference​

Missing Data Strategies​

MethodHow It WorksWhen to Use
Mean/Median imputationReplace missing with column mean or medianNumeric features. Median for skewed data
Mode imputationReplace with most frequent valueCategorical features
Forward fillUse previous time step's valueTime-series data (sensor readings, stock prices)
Linear interpolationEstimate between two known pointsTime-series with gradual changes
Replace with NaNKeep as NaN, let algorithm handle itDeepAR and XGBoost handle missing values natively
Drop rows/columnsRemove rows or columns with missing dataWhen missing percentage is very high (>50%) or data is abundant
Multiple imputationCreate multiple imputed datasets, average resultsWhen missing data mechanism matters (research-grade)

Data Scaling / Normalization​

MethodFormulaWhen to Use
Standard Scaler (Z-score)(x - mean) / stdData roughly normal. Best for PCA. Handles outliers better than MinMax
Min-Max Scaler(x - min) / (max - min)Need values in [0, 1]. Neural networks. Sensitive to outliers
Log Transformlog(x)Right-skewed data (mode < median < mean). Makes distribution more normal

Data Augmentation (Images)​

TechniqueWhat It Does
RotationRotate image by random degrees
FlippingHorizontal/vertical flip
CroppingRandom crops of the image
Scaling/ZoomingResize image randomly
Color jitteringRandomly adjust brightness, contrast, saturation
TranslationShift image horizontally/vertically

Flashcards​

1 / 10
Question

When should you use median imputation instead of mean imputation?

Click to reveal
Answer

Use median imputation when the data is skewed. The mean is pulled by outliers in skewed distributions, while the median is robust and represents the central tendency more accurately.

Key Insight

Split BEFORE Scale — this is a critical rule. Fit your scaler on the training set ONLY, then apply the same transformation to validation and test sets. Scaling before splitting leaks information from test data into training, giving you overly optimistic performance estimates.