Class Imbalance
Class imbalance occurs when one class significantly outnumbers another — for example, 99% legitimate transactions and 1% fraud. Standard models trained on imbalanced data tend to predict the majority class and ignore the minority class entirely. This section covers practical techniques to address this problem.
Techniques for Handling Imbalanced Data​
| Technique | How It Works | Effort Level |
|---|---|---|
| Class Weights / Cost Function | Assign higher penalty for misclassifying the minority class in the loss function | Lowest effort |
| SMOTE | Generate synthetic minority samples by interpolating between existing minority neighbors | Moderate |
| Random Oversampling | Duplicate minority class samples | Low |
| Random Undersampling | Remove majority class samples | Low |
| Stratified Sampling | Maintain class proportions when splitting train/val/test | Essential (always do this) |
| Change Metric | Use AUC, F1, Precision, or Recall instead of Accuracy | Essential (always do this) |
| Collect More Data | Get more minority class samples | Highest effort |
Recommended Escalation Path​
Start with the least effort and escalate as needed:
- Change your metric — switch from accuracy to AUC or F1
- Apply class weights — fastest code change
- Use SMOTE — generates meaningful synthetic samples
- Collect more minority data — best solution but most expensive
SMOTE vs Simple Duplication​
SMOTE and random oversampling are not the same thing. Random oversampling duplicates existing minority samples, which causes the model to memorize those exact points. SMOTE creates new synthetic points by interpolating between neighboring minority samples, producing more diverse training data.
# SMOTE creates synthetic samples between neighbors
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# XGBoost class weight approach
import xgboost as xgb
# For a dataset with 950 negative and 50 positive samples:
model = xgb.XGBClassifier(
scale_pos_weight=950/50, # = 19
eval_metric='auc' # Never use accuracy for imbalanced data
)
Class Weights in Practice​
| Framework | Parameter | How to Set |
|---|---|---|
| XGBoost | scale_pos_weight | count(negative) / count(positive) |
| Scikit-learn | class_weight='balanced' | Automatically adjusts weights inversely proportional to class frequency |
| PyTorch | weight parameter in loss function | Pass a tensor of class weights to CrossEntropyLoss |
Stratified Sampling​
Always use stratified sampling when splitting imbalanced data into train/validation/test sets. Without it, some splits may end up with very few (or zero) minority class samples, making evaluation unreliable.
from sklearn.model_selection import StratifiedKFold, train_test_split
# Stratified train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Stratified k-fold cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in skf.split(X, y):
X_fold_train, X_fold_val = X[train_idx], X[val_idx]
Flashcards​
Why is accuracy a poor metric for imbalanced datasets?
Click to revealOn a 99/1 split, a model that always predicts the majority class achieves 99% accuracy while catching zero minority cases. Use AUC, F1, Precision, or Recall instead.