Hyperparameter Tuning
Hyperparameters are the configuration knobs you set before training begins — they are not learned from data. Choosing the right values can dramatically improve model performance, and knowing which strategy to use for searching the hyperparameter space saves both time and compute.
Tuning Strategies​
| Strategy | How It Works | Best When | Trade-off |
|---|---|---|---|
| Grid Search | Try every combination of specified values | Small parameter space, few hyperparameters | Exhaustive but slow — O(n^k) combinations |
| Random Search | Sample random combinations from parameter ranges | Large parameter space, good starting point | Often finds good solutions faster than grid search |
| Bayesian Optimization | Uses previous results to intelligently select next combinations | Expensive evaluations, limited compute budget | Most efficient with fewer trials — learns which regions are promising |
| Hyperband | Early-stops poorly performing configurations, allocates more resources to promising ones | Many parallel jobs, want fast results | Fastest strategy — stops bad configs early |
Bayesian optimization is the most sample-efficient strategy — it learns from past trials. Hyperband is the fastest when you can run many parallel jobs because it quickly discards unpromising configurations.
Key Hyperparameters by Algorithm​
XGBoost​
| Hyperparameter | What It Controls | Tuning Guidance |
|---|---|---|
max_depth | Maximum tree depth | Higher = more complex, risk overfitting. Start with 3-10 |
eta (learning_rate) | Step size per boosting round | Lower = more rounds needed but better generalization. Range: 0.01-0.3 |
num_round | Number of boosting rounds | More rounds + lower eta = better (use early stopping) |
min_child_weight | Min sum of instance weight in a child node | Higher = more conservative, prevents overfitting |
scale_pos_weight | Balance positive/negative classes | Set to count(negative) / count(positive) for imbalanced data |
alpha (L1), lambda (L2) | Regularization strength | Higher = more regularization = simpler model |
Neural Networks​
| Hyperparameter | What It Controls | Tuning Guidance |
|---|---|---|
learning_rate | Step size for gradient descent | Start with 0.001. Use logarithmic scaling for search range |
batch_size | Samples per gradient update | Larger = faster training, less noise. Smaller = better generalization |
epochs | Full passes through dataset | Use early stopping to prevent overfitting |
K-Means​
| Hyperparameter | What It Controls | Tuning Guidance |
|---|---|---|
k | Number of clusters | Use the elbow method: plot k vs SSE, pick the "elbow" point |
init_method | How initial centroids are placed | k-means++ is better than random initialization |
Hyperparameter Scaling Types​
When defining search ranges, the scaling type determines how values are sampled.
| Scaling | Use When | Examples |
|---|---|---|
| Linear | Parameter effect is proportional | num_round, max_depth, batch_size |
| Logarithmic | Search across orders of magnitude | learning_rate (0.0001 to 0.1), regularization strength |
| Reverse Logarithmic | Values very close to 1 matter | Momentum (0.9 to 0.999) |
# Example: defining hyperparameter ranges
hyperparameter_ranges = {
"learning_rate": (0.0001, 0.1), # logarithmic scale
"max_depth": (3, 10), # linear (integer) scale
"momentum": (0.9, 0.999), # reverse logarithmic scale
}
Flashcards​
When should you use Bayesian optimization over grid search?
Click to revealWhen each training run is expensive and your compute budget is limited. Bayesian optimization learns from previous trials and intelligently selects the next set of hyperparameters to try, making it the most sample-efficient strategy.
When tuning momentum (values like 0.9, 0.99, 0.999), use reverse logarithmic scaling because the meaningful differences are concentrated very close to 1.0.