Skip to main content

Class Imbalance

Class imbalance occurs when one class significantly outnumbers another — for example, 99% legitimate transactions and 1% fraud. Standard models trained on imbalanced data tend to predict the majority class and ignore the minority class entirely. This section covers practical techniques to address this problem.

Techniques for Handling Imbalanced Data​

TechniqueHow It WorksEffort Level
Class Weights / Cost FunctionAssign higher penalty for misclassifying the minority class in the loss functionLowest effort
SMOTEGenerate synthetic minority samples by interpolating between existing minority neighborsModerate
Random OversamplingDuplicate minority class samplesLow
Random UndersamplingRemove majority class samplesLow
Stratified SamplingMaintain class proportions when splitting train/val/testEssential (always do this)
Change MetricUse AUC, F1, Precision, or Recall instead of AccuracyEssential (always do this)
Collect More DataGet more minority class samplesHighest effort

Start with the least effort and escalate as needed:

  1. Change your metric — switch from accuracy to AUC or F1
  2. Apply class weights — fastest code change
  3. Use SMOTE — generates meaningful synthetic samples
  4. Collect more minority data — best solution but most expensive

SMOTE vs Simple Duplication​

Common Misconception

SMOTE and random oversampling are not the same thing. Random oversampling duplicates existing minority samples, which causes the model to memorize those exact points. SMOTE creates new synthetic points by interpolating between neighboring minority samples, producing more diverse training data.

# SMOTE creates synthetic samples between neighbors
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# XGBoost class weight approach
import xgboost as xgb

# For a dataset with 950 negative and 50 positive samples:
model = xgb.XGBClassifier(
scale_pos_weight=950/50, # = 19
eval_metric='auc' # Never use accuracy for imbalanced data
)

Class Weights in Practice​

FrameworkParameterHow to Set
XGBoostscale_pos_weightcount(negative) / count(positive)
Scikit-learnclass_weight='balanced'Automatically adjusts weights inversely proportional to class frequency
PyTorchweight parameter in loss functionPass a tensor of class weights to CrossEntropyLoss

Stratified Sampling​

Key Insight

Always use stratified sampling when splitting imbalanced data into train/validation/test sets. Without it, some splits may end up with very few (or zero) minority class samples, making evaluation unreliable.

from sklearn.model_selection import StratifiedKFold, train_test_split

# Stratified train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)

# Stratified k-fold cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in skf.split(X, y):
X_fold_train, X_fold_val = X[train_idx], X[val_idx]

Flashcards​

1 / 7
Question

Why is accuracy a poor metric for imbalanced datasets?

Click to reveal
Answer

On a 99/1 split, a model that always predicts the majority class achieves 99% accuracy while catching zero minority cases. Use AUC, F1, Precision, or Recall instead.