Skip to main content

Training Concepts

Training is the process of iteratively adjusting model weights to minimize a loss function. Understanding gradient descent, learning rates, loss functions, and distributed training strategies is essential for building models that converge efficiently and generalize well.

Gradient Descent Variants​

VariantData Per UpdateBest ForTrade-off
Batch Gradient DescentEntire datasetSmall datasetsMost stable convergence but slowest
Stochastic GD (SGD)One sampleVery large datasetsFast but noisy updates
Mini-batch GDSmall batch (32, 64, 128, 256)Most practical scenariosBest balance of speed and stability

Mini-batch gradient descent is the default in practice. The batch_size hyperparameter controls how many samples are used per update.

Learning Rate Effects​

The learning rate controls how large each weight update step is during training.

Learning RateEffectWhat You Observe
Too highOvershoots minimum, oscillates, divergesLoss jumps around or increases
Too lowVery slow convergence, may get stuck in local minimaLoss decreases extremely slowly
Just rightSmooth, steady decrease in lossLoss curve decreases consistently, then flattens
Key Insight

"Training loss oscillates wildly" means the learning rate is too high — decrease it. "Training loss barely decreases" means it is too low — increase it.

Loss Functions​

Each problem type has a corresponding loss function that measures how far off the model's predictions are.

Loss FunctionProblem TypeWhat It Measures
Binary Cross-Entropy (Log Loss)Binary classificationPenalizes confident wrong predictions heavily
Categorical Cross-EntropyMulti-class classificationSame principle, extended across multiple classes
MSE (Mean Squared Error)RegressionAverage squared difference; penalizes large errors more
MAE (Mean Absolute Error)RegressionAverage absolute difference; more robust to outliers
Hinge LossSVM classificationPenalizes misclassifications and low-confidence correct predictions
# Loss function formulas
# Binary Cross-Entropy
loss = -mean(y * log(p) + (1 - y) * log(1 - p))

# MSE
loss = mean((y - y_pred) ** 2)

# MAE
loss = mean(abs(y - y_pred))

Distributed Training​

When datasets are too large or models are too big for a single machine, distributed training becomes necessary.

StrategyHow It WorksWhen to Use
Data ParallelismSame model on each GPU, different data batches. Gradients averaged across GPUsLarge datasets where the model fits in a single GPU's memory
Model ParallelismModel split across multiple GPUs (different layers on different GPUs)Model too large for single GPU memory (e.g., massive transformers)
note

Data parallelism is the more common approach. Frameworks like Horovod simplify the implementation by handling gradient synchronization across GPUs.

Flashcards​

1 / 8
Question

What are the three main variants of gradient descent?

Click to reveal
Answer

Batch (full dataset per update, stable but slow), Stochastic/SGD (one sample per update, fast but noisy), and Mini-batch (small batch per update, best balance). Mini-batch is the default in practice.

Common Misconception

Categorical cross-entropy and binary cross-entropy are not interchangeable. Use binary cross-entropy for two-class problems with a sigmoid output, and categorical cross-entropy for multi-class problems with a softmax output.