Skip to main content

Regularization

Regularization is any technique that constrains or penalizes a model to prevent it from memorizing the training data. Without regularization, complex models tend to overfit — performing well on training data but poorly on unseen data.

Regularization Techniques​

TechniqueHow It WorksEffectWhen to Use
L1 (Lasso)Adds |weights| penalty to loss functionDrives some weights to exactly zero — automatic feature selectionWhen you want to identify which features matter most
L2 (Ridge)Adds weights² penalty to loss functionShrinks all weights toward zero (never exactly zero)When you want to keep all features but reduce their impact
Elastic NetCombination of L1 + L2 penaltiesFeature selection + weight shrinkageWhen you are unsure which to pick, or want both effects
DropoutRandomly deactivates neurons during training (e.g., 20-50%)Forces network to not rely on specific neuronsNeural networks only — increase dropout rate to reduce overfitting
Early StoppingMonitor validation loss; stop training when it starts increasingPrevents training past the optimal pointAny iterative training process
Data AugmentationCreate synthetic training examples (rotations, flips, crops)Increases effective dataset sizeImage data with limited training samples
Reduce Model ComplexityFewer layers, fewer neurons, shallower treesLess capacity to memorize training dataWhen model is clearly too complex for the available data

L1 vs L2: A Closer Look​

# L1 Regularization (Lasso)
loss = original_loss + lambda * sum(abs(weights))
# Result: some weights become exactly 0 → sparse model

# L2 Regularization (Ridge)
loss = original_loss + lambda * sum(weights ** 2)
# Result: all weights shrink toward 0, but none become exactly 0

# Elastic Net
loss = original_loss + alpha * sum(abs(weights)) + (1 - alpha) * sum(weights ** 2)
# Result: combines feature selection (L1) with weight shrinkage (L2)
Key Insight

L1 (Lasso): "Which features matter?" — produces sparse models with some features completely eliminated. L2 (Ridge): "Keep everything but tone it down" — all features retained with smaller coefficients.

Dropout in Practice​

Dropout works by randomly setting a fraction of neuron outputs to zero during each training step. This prevents co-adaptation of neurons and acts as an implicit ensemble of many sub-networks.

  • Typical dropout rates: 20-50% of neurons per layer
  • Disabled during inference — all neurons are active at prediction time
  • Outputs are scaled during training to compensate for dropped neurons
caution

Dropout is applied only during training. If you forget to disable it during inference, predictions will be noisy and degraded. Most frameworks handle this automatically with model.eval() or equivalent.

Early Stopping​

Early stopping monitors validation loss during training and halts when performance on the validation set starts degrading.

Metric to WatchInterpretation
Training loss keeps decreasing, validation loss starts increasingOverfitting is beginning — stop here
Both training and validation loss decreasingModel is still learning — continue training
Training loss flat, validation loss flatModel has converged — training can stop
note

Always monitor validation loss, not training loss. Training loss will continue to decrease as the model memorizes the data — it does not indicate generalization ability.

Flashcards​

1 / 8
Question

What is the key difference between L1 and L2 regularization?

Click to reveal
Answer

L1 (Lasso) drives some weights to exactly zero, performing automatic feature selection. L2 (Ridge) shrinks all weights toward zero but never eliminates them entirely. L1 = sparse models, L2 = small-weight models.