Regularization

Regularization is any technique that constrains or penalizes a model to prevent it from memorizing the training data. Without regularization, complex models tend to overfit — performing well on training data but poorly on unseen data.

Regularization Techniques

Technique	How It Works	Effect	When to Use
L1 (Lasso)	Adds \|weights\| penalty to loss function	Drives some weights to exactly zero — automatic feature selection	When you want to identify which features matter most
L2 (Ridge)	Adds weights² penalty to loss function	Shrinks all weights toward zero (never exactly zero)	When you want to keep all features but reduce their impact
Elastic Net	Combination of L1 + L2 penalties	Feature selection + weight shrinkage	When you are unsure which to pick, or want both effects
Dropout	Randomly deactivates neurons during training (e.g., 20-50%)	Forces network to not rely on specific neurons	Neural networks only — increase dropout rate to reduce overfitting
Early Stopping	Monitor validation loss; stop training when it starts increasing	Prevents training past the optimal point	Any iterative training process
Data Augmentation	Create synthetic training examples (rotations, flips, crops)	Increases effective dataset size	Image data with limited training samples
Reduce Model Complexity	Fewer layers, fewer neurons, shallower trees	Less capacity to memorize training data	When model is clearly too complex for the available data

L1 vs L2: A Closer Look

# L1 Regularization (Lasso)
loss = original_loss + lambda * sum(abs(weights))
# Result: some weights become exactly 0 → sparse model

# L2 Regularization (Ridge)
loss = original_loss + lambda * sum(weights ** 2)
# Result: all weights shrink toward 0, but none become exactly 0

# Elastic Net
loss = original_loss + alpha * sum(abs(weights)) + (1 - alpha) * sum(weights ** 2)
# Result: combines feature selection (L1) with weight shrinkage (L2)

Key Insight

L1 (Lasso): "Which features matter?" — produces sparse models with some features completely eliminated. L2 (Ridge): "Keep everything but tone it down" — all features retained with smaller coefficients.

Dropout in Practice

Dropout works by randomly setting a fraction of neuron outputs to zero during each training step. This prevents co-adaptation of neurons and acts as an implicit ensemble of many sub-networks.

Typical dropout rates: 20-50% of neurons per layer
Disabled during inference — all neurons are active at prediction time
Outputs are scaled during training to compensate for dropped neurons

caution

Dropout is applied only during training. If you forget to disable it during inference, predictions will be noisy and degraded. Most frameworks handle this automatically with model.eval() or equivalent.

Early Stopping

Early stopping monitors validation loss during training and halts when performance on the validation set starts degrading.

Metric to Watch	Interpretation
Training loss keeps decreasing, validation loss starts increasing	Overfitting is beginning — stop here
Both training and validation loss decreasing	Model is still learning — continue training
Training loss flat, validation loss flat	Model has converged — training can stop

note

Always monitor validation loss, not training loss. Training loss will continue to decrease as the model memorizes the data — it does not indicate generalization ability.

Flashcards

1 / 8

Question

What is the key difference between L1 and L2 regularization?

Click to reveal

Answer

L1 (Lasso) drives some weights to exactly zero, performing automatic feature selection. L2 (Ridge) shrinks all weights toward zero but never eliminates them entirely. L1 = sparse models, L2 = small-weight models.

Regularization Techniques​

L1 vs L2: A Closer Look​

Dropout in Practice​

Early Stopping​

Flashcards​

Regularization Techniques

L1 vs L2: A Closer Look

Dropout in Practice

Early Stopping

Flashcards