Training Concepts
Training is the process of iteratively adjusting model weights to minimize a loss function. Understanding gradient descent, learning rates, loss functions, and distributed training strategies is essential for building models that converge efficiently and generalize well.
Gradient Descent Variants​
| Variant | Data Per Update | Best For | Trade-off |
|---|---|---|---|
| Batch Gradient Descent | Entire dataset | Small datasets | Most stable convergence but slowest |
| Stochastic GD (SGD) | One sample | Very large datasets | Fast but noisy updates |
| Mini-batch GD | Small batch (32, 64, 128, 256) | Most practical scenarios | Best balance of speed and stability |
Mini-batch gradient descent is the default in practice. The batch_size hyperparameter controls how many samples are used per update.
Learning Rate Effects​
The learning rate controls how large each weight update step is during training.
| Learning Rate | Effect | What You Observe |
|---|---|---|
| Too high | Overshoots minimum, oscillates, diverges | Loss jumps around or increases |
| Too low | Very slow convergence, may get stuck in local minima | Loss decreases extremely slowly |
| Just right | Smooth, steady decrease in loss | Loss curve decreases consistently, then flattens |
"Training loss oscillates wildly" means the learning rate is too high — decrease it. "Training loss barely decreases" means it is too low — increase it.
Loss Functions​
Each problem type has a corresponding loss function that measures how far off the model's predictions are.
| Loss Function | Problem Type | What It Measures |
|---|---|---|
| Binary Cross-Entropy (Log Loss) | Binary classification | Penalizes confident wrong predictions heavily |
| Categorical Cross-Entropy | Multi-class classification | Same principle, extended across multiple classes |
| MSE (Mean Squared Error) | Regression | Average squared difference; penalizes large errors more |
| MAE (Mean Absolute Error) | Regression | Average absolute difference; more robust to outliers |
| Hinge Loss | SVM classification | Penalizes misclassifications and low-confidence correct predictions |
# Loss function formulas
# Binary Cross-Entropy
loss = -mean(y * log(p) + (1 - y) * log(1 - p))
# MSE
loss = mean((y - y_pred) ** 2)
# MAE
loss = mean(abs(y - y_pred))
Distributed Training​
When datasets are too large or models are too big for a single machine, distributed training becomes necessary.
| Strategy | How It Works | When to Use |
|---|---|---|
| Data Parallelism | Same model on each GPU, different data batches. Gradients averaged across GPUs | Large datasets where the model fits in a single GPU's memory |
| Model Parallelism | Model split across multiple GPUs (different layers on different GPUs) | Model too large for single GPU memory (e.g., massive transformers) |
Data parallelism is the more common approach. Frameworks like Horovod simplify the implementation by handling gradient synchronization across GPUs.
Flashcards​
What are the three main variants of gradient descent?
Click to revealBatch (full dataset per update, stable but slow), Stochastic/SGD (one sample per update, fast but noisy), and Mini-batch (small batch per update, best balance). Mini-batch is the default in practice.
Categorical cross-entropy and binary cross-entropy are not interchangeable. Use binary cross-entropy for two-class problems with a sigmoid output, and categorical cross-entropy for multi-class problems with a softmax output.