Regularization
Regularization is any technique that constrains or penalizes a model to prevent it from memorizing the training data. Without regularization, complex models tend to overfit — performing well on training data but poorly on unseen data.
Regularization Techniques​
| Technique | How It Works | Effect | When to Use |
|---|---|---|---|
| L1 (Lasso) | Adds |weights| penalty to loss function | Drives some weights to exactly zero — automatic feature selection | When you want to identify which features matter most |
| L2 (Ridge) | Adds weights² penalty to loss function | Shrinks all weights toward zero (never exactly zero) | When you want to keep all features but reduce their impact |
| Elastic Net | Combination of L1 + L2 penalties | Feature selection + weight shrinkage | When you are unsure which to pick, or want both effects |
| Dropout | Randomly deactivates neurons during training (e.g., 20-50%) | Forces network to not rely on specific neurons | Neural networks only — increase dropout rate to reduce overfitting |
| Early Stopping | Monitor validation loss; stop training when it starts increasing | Prevents training past the optimal point | Any iterative training process |
| Data Augmentation | Create synthetic training examples (rotations, flips, crops) | Increases effective dataset size | Image data with limited training samples |
| Reduce Model Complexity | Fewer layers, fewer neurons, shallower trees | Less capacity to memorize training data | When model is clearly too complex for the available data |
L1 vs L2: A Closer Look​
# L1 Regularization (Lasso)
loss = original_loss + lambda * sum(abs(weights))
# Result: some weights become exactly 0 → sparse model
# L2 Regularization (Ridge)
loss = original_loss + lambda * sum(weights ** 2)
# Result: all weights shrink toward 0, but none become exactly 0
# Elastic Net
loss = original_loss + alpha * sum(abs(weights)) + (1 - alpha) * sum(weights ** 2)
# Result: combines feature selection (L1) with weight shrinkage (L2)
L1 (Lasso): "Which features matter?" — produces sparse models with some features completely eliminated. L2 (Ridge): "Keep everything but tone it down" — all features retained with smaller coefficients.
Dropout in Practice​
Dropout works by randomly setting a fraction of neuron outputs to zero during each training step. This prevents co-adaptation of neurons and acts as an implicit ensemble of many sub-networks.
- Typical dropout rates: 20-50% of neurons per layer
- Disabled during inference — all neurons are active at prediction time
- Outputs are scaled during training to compensate for dropped neurons
Dropout is applied only during training. If you forget to disable it during inference, predictions will be noisy and degraded. Most frameworks handle this automatically with model.eval() or equivalent.
Early Stopping​
Early stopping monitors validation loss during training and halts when performance on the validation set starts degrading.
| Metric to Watch | Interpretation |
|---|---|
| Training loss keeps decreasing, validation loss starts increasing | Overfitting is beginning — stop here |
| Both training and validation loss decreasing | Model is still learning — continue training |
| Training loss flat, validation loss flat | Model has converged — training can stop |
Always monitor validation loss, not training loss. Training loss will continue to decrease as the model memorizes the data — it does not indicate generalization ability.
Flashcards​
What is the key difference between L1 and L2 regularization?
Click to revealL1 (Lasso) drives some weights to exactly zero, performing automatic feature selection. L2 (Ridge) shrinks all weights toward zero but never eliminates them entirely. L1 = sparse models, L2 = small-weight models.