Hyperparameter Tuning

Hyperparameters are the configuration knobs you set before training begins — they are not learned from data. Choosing the right values can dramatically improve model performance, and knowing which strategy to use for searching the hyperparameter space saves both time and compute.

Tuning Strategies

Strategy	How It Works	Best When	Trade-off
Grid Search	Try every combination of specified values	Small parameter space, few hyperparameters	Exhaustive but slow — O(n^k) combinations
Random Search	Sample random combinations from parameter ranges	Large parameter space, good starting point	Often finds good solutions faster than grid search
Bayesian Optimization	Uses previous results to intelligently select next combinations	Expensive evaluations, limited compute budget	Most efficient with fewer trials — learns which regions are promising
Hyperband	Early-stops poorly performing configurations, allocates more resources to promising ones	Many parallel jobs, want fast results	Fastest strategy — stops bad configs early

Key Insight

Bayesian optimization is the most sample-efficient strategy — it learns from past trials. Hyperband is the fastest when you can run many parallel jobs because it quickly discards unpromising configurations.

Key Hyperparameters by Algorithm

XGBoost

Hyperparameter	What It Controls	Tuning Guidance
`max_depth`	Maximum tree depth	Higher = more complex, risk overfitting. Start with 3-10
`eta` (learning_rate)	Step size per boosting round	Lower = more rounds needed but better generalization. Range: 0.01-0.3
`num_round`	Number of boosting rounds	More rounds + lower eta = better (use early stopping)
`min_child_weight`	Min sum of instance weight in a child node	Higher = more conservative, prevents overfitting
`scale_pos_weight`	Balance positive/negative classes	Set to count(negative) / count(positive) for imbalanced data
`alpha` (L1), `lambda` (L2)	Regularization strength	Higher = more regularization = simpler model

Neural Networks

Hyperparameter	What It Controls	Tuning Guidance
`learning_rate`	Step size for gradient descent	Start with 0.001. Use logarithmic scaling for search range
`batch_size`	Samples per gradient update	Larger = faster training, less noise. Smaller = better generalization
`epochs`	Full passes through dataset	Use early stopping to prevent overfitting

K-Means

Hyperparameter	What It Controls	Tuning Guidance
`k`	Number of clusters	Use the elbow method: plot k vs SSE, pick the "elbow" point
`init_method`	How initial centroids are placed	`k-means++` is better than random initialization

Hyperparameter Scaling Types

When defining search ranges, the scaling type determines how values are sampled.

Scaling	Use When	Examples
Linear	Parameter effect is proportional	`num_round`, `max_depth`, `batch_size`
Logarithmic	Search across orders of magnitude	`learning_rate` (0.0001 to 0.1), regularization strength
Reverse Logarithmic	Values very close to 1 matter	Momentum (0.9 to 0.999)

# Example: defining hyperparameter ranges
hyperparameter_ranges = {
    "learning_rate": (0.0001, 0.1),   # logarithmic scale
    "max_depth": (3, 10),              # linear (integer) scale
    "momentum": (0.9, 0.999),          # reverse logarithmic scale
}

Flashcards

1 / 8

Question

When should you use Bayesian optimization over grid search?

Click to reveal

Answer

When each training run is expensive and your compute budget is limited. Bayesian optimization learns from previous trials and intelligently selects the next set of hyperparameters to try, making it the most sample-efficient strategy.

note

When tuning momentum (values like 0.9, 0.99, 0.999), use reverse logarithmic scaling because the meaningful differences are concentrated very close to 1.0.

Tuning Strategies​

Key Hyperparameters by Algorithm​

XGBoost​

Neural Networks​

K-Means​

Hyperparameter Scaling Types​

Flashcards​