Training Concepts

Training is the process of iteratively adjusting model weights to minimize a loss function. Understanding gradient descent, learning rates, loss functions, and distributed training strategies is essential for building models that converge efficiently and generalize well.

Gradient Descent Variants

Variant	Data Per Update	Best For	Trade-off
Batch Gradient Descent	Entire dataset	Small datasets	Most stable convergence but slowest
Stochastic GD (SGD)	One sample	Very large datasets	Fast but noisy updates
Mini-batch GD	Small batch (32, 64, 128, 256)	Most practical scenarios	Best balance of speed and stability

Mini-batch gradient descent is the default in practice. The batch_size hyperparameter controls how many samples are used per update.

Learning Rate Effects

The learning rate controls how large each weight update step is during training.

Learning Rate	Effect	What You Observe
Too high	Overshoots minimum, oscillates, diverges	Loss jumps around or increases
Too low	Very slow convergence, may get stuck in local minima	Loss decreases extremely slowly
Just right	Smooth, steady decrease in loss	Loss curve decreases consistently, then flattens

Key Insight

"Training loss oscillates wildly" means the learning rate is too high — decrease it. "Training loss barely decreases" means it is too low — increase it.

Loss Functions

Each problem type has a corresponding loss function that measures how far off the model's predictions are.

Loss Function	Problem Type	What It Measures
Binary Cross-Entropy (Log Loss)	Binary classification	Penalizes confident wrong predictions heavily
Categorical Cross-Entropy	Multi-class classification	Same principle, extended across multiple classes
MSE (Mean Squared Error)	Regression	Average squared difference; penalizes large errors more
MAE (Mean Absolute Error)	Regression	Average absolute difference; more robust to outliers
Hinge Loss	SVM classification	Penalizes misclassifications and low-confidence correct predictions

# Loss function formulas
# Binary Cross-Entropy
loss = -mean(y * log(p) + (1 - y) * log(1 - p))

# MSE
loss = mean((y - y_pred) ** 2)

# MAE
loss = mean(abs(y - y_pred))

Distributed Training

When datasets are too large or models are too big for a single machine, distributed training becomes necessary.

Strategy	How It Works	When to Use
Data Parallelism	Same model on each GPU, different data batches. Gradients averaged across GPUs	Large datasets where the model fits in a single GPU's memory
Model Parallelism	Model split across multiple GPUs (different layers on different GPUs)	Model too large for single GPU memory (e.g., massive transformers)

note

Data parallelism is the more common approach. Frameworks like Horovod simplify the implementation by handling gradient synchronization across GPUs.

Flashcards

1 / 8

Question

What are the three main variants of gradient descent?

Click to reveal

Answer

Batch (full dataset per update, stable but slow), Stochastic/SGD (one sample per update, fast but noisy), and Mini-batch (small batch per update, best balance). Mini-batch is the default in practice.

Common Misconception

Categorical cross-entropy and binary cross-entropy are not interchangeable. Use binary cross-entropy for two-class problems with a sigmoid output, and categorical cross-entropy for multi-class problems with a softmax output.

Gradient Descent Variants​

Learning Rate Effects​

Loss Functions​

Distributed Training​

Flashcards​

Gradient Descent Variants

Learning Rate Effects

Loss Functions

Distributed Training

Flashcards