Skip to main content

Validation Strategies

How you split and validate your data directly impacts whether your performance estimates are trustworthy. A model that looks great on a poorly designed validation split may fail in production. This section covers the main validation strategies and critical rules for special cases.

Validation Approaches​

StrategyHow It WorksWhen to Use
Simple Train/Val/Test SplitRandomly split data (e.g., 70/15/15 or 80/10/10)Large datasets where a single split provides enough samples in each set
k-Fold Cross-ValidationSplit into k folds. Train on k-1 folds, validate on 1. Rotate k times. Average resultsSmall datasets where you need a more reliable performance estimate
Stratified k-FoldSame as k-fold but maintains class proportions in each foldImbalanced classification — always use stratified for imbalanced data
Time Series SplitTrain on past data, validate on future data. Never random splitAny time-series data — chronological split only

Simple Train/Val/Test Split​

The most straightforward approach: randomly shuffle and divide the data.

from sklearn.model_selection import train_test_split

# Two-step split: first train+val vs test, then train vs val
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.15, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.176, random_state=42 # 0.176 of 0.85 ≈ 0.15
)

Typical ratios: 70/15/15, 80/10/10, or 60/20/20 depending on dataset size.

k-Fold Cross-Validation​

When data is limited, k-fold gives a more robust performance estimate by using every sample for both training and validation.

from sklearn.model_selection import cross_val_score, KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
print(f"Mean accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

Common values: k=5 or k=10. Higher k = less bias but more variance and longer compute time.

Critical Rules​

Three Rules You Must Follow

1. Time series = ALWAYS chronological split. Never randomly split time series data. Train on the past, test on the future. Random splitting leaks future information into training.

2. Imbalanced data = ALWAYS stratified split. Use stratify=y in scikit-learn or StratifiedKFold. This preserves class ratios across all folds.

3. Scale AFTER splitting. Fit the scaler on the training set only, then apply the same transformation to validation and test sets. Scaling before splitting causes data leakage.

# Time series: chronological split
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
X_train, X_test = X[train_idx], X[test_idx]
# train_idx always precedes test_idx chronologically

# Imbalanced data: stratified split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)

# Correct scaling order
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit on train only
X_val_scaled = scaler.transform(X_val) # apply same transform
X_test_scaled = scaler.transform(X_test) # apply same transform

Flashcards​

1 / 7
Question

When should you use k-fold cross-validation instead of a simple train/test split?

Click to reveal
Answer

When your dataset is small and a single split may not be representative. K-fold uses every sample for both training and validation (across different folds), giving a more reliable and less biased performance estimate.

Pro Tip

The test set should be touched only once — at the very end. If you repeatedly evaluate on the test set and adjust your model, you are effectively tuning on test data, which defeats its purpose as an unbiased performance estimate.