Skip to main content

Hyperparameter Tuning

Hyperparameters are the configuration knobs you set before training begins — they are not learned from data. Choosing the right values can dramatically improve model performance, and knowing which strategy to use for searching the hyperparameter space saves both time and compute.

Tuning Strategies​

StrategyHow It WorksBest WhenTrade-off
Grid SearchTry every combination of specified valuesSmall parameter space, few hyperparametersExhaustive but slow — O(n^k) combinations
Random SearchSample random combinations from parameter rangesLarge parameter space, good starting pointOften finds good solutions faster than grid search
Bayesian OptimizationUses previous results to intelligently select next combinationsExpensive evaluations, limited compute budgetMost efficient with fewer trials — learns which regions are promising
HyperbandEarly-stops poorly performing configurations, allocates more resources to promising onesMany parallel jobs, want fast resultsFastest strategy — stops bad configs early
Key Insight

Bayesian optimization is the most sample-efficient strategy — it learns from past trials. Hyperband is the fastest when you can run many parallel jobs because it quickly discards unpromising configurations.

Key Hyperparameters by Algorithm​

XGBoost​

HyperparameterWhat It ControlsTuning Guidance
max_depthMaximum tree depthHigher = more complex, risk overfitting. Start with 3-10
eta (learning_rate)Step size per boosting roundLower = more rounds needed but better generalization. Range: 0.01-0.3
num_roundNumber of boosting roundsMore rounds + lower eta = better (use early stopping)
min_child_weightMin sum of instance weight in a child nodeHigher = more conservative, prevents overfitting
scale_pos_weightBalance positive/negative classesSet to count(negative) / count(positive) for imbalanced data
alpha (L1), lambda (L2)Regularization strengthHigher = more regularization = simpler model

Neural Networks​

HyperparameterWhat It ControlsTuning Guidance
learning_rateStep size for gradient descentStart with 0.001. Use logarithmic scaling for search range
batch_sizeSamples per gradient updateLarger = faster training, less noise. Smaller = better generalization
epochsFull passes through datasetUse early stopping to prevent overfitting

K-Means​

HyperparameterWhat It ControlsTuning Guidance
kNumber of clustersUse the elbow method: plot k vs SSE, pick the "elbow" point
init_methodHow initial centroids are placedk-means++ is better than random initialization

Hyperparameter Scaling Types​

When defining search ranges, the scaling type determines how values are sampled.

ScalingUse WhenExamples
LinearParameter effect is proportionalnum_round, max_depth, batch_size
LogarithmicSearch across orders of magnitudelearning_rate (0.0001 to 0.1), regularization strength
Reverse LogarithmicValues very close to 1 matterMomentum (0.9 to 0.999)
# Example: defining hyperparameter ranges
hyperparameter_ranges = {
"learning_rate": (0.0001, 0.1), # logarithmic scale
"max_depth": (3, 10), # linear (integer) scale
"momentum": (0.9, 0.999), # reverse logarithmic scale
}

Flashcards​

1 / 8
Question

When should you use Bayesian optimization over grid search?

Click to reveal
Answer

When each training run is expensive and your compute budget is limited. Bayesian optimization learns from previous trials and intelligently selects the next set of hyperparameters to try, making it the most sample-efficient strategy.

note

When tuning momentum (values like 0.9, 0.99, 0.999), use reverse logarithmic scaling because the meaningful differences are concentrated very close to 1.0.