Evaluation Metrics
Choosing the right evaluation metric is just as important as choosing the right model. The wrong metric can give you a false sense of confidence. This section covers the essential metrics for classification, regression, and forecasting problems.
Classification Metrics
Confusion Matrix
The confusion matrix is the foundation for understanding classification metrics.
Predicted Positive Predicted Negative
Actual Positive TP (True Positive) FN (False Negative) ← "Missed" (Type II Error)
Actual Negative FP (False Positive) TN (True Negative)
↑ "False alarm" (Type I Error)
Metric Reference
| Metric | Formula | When to Use |
|---|---|---|
| Accuracy | (TP + TN) / Total | Balanced classes only |
| Precision | TP / (TP + FP) | When false positives are costly |
| Recall (Sensitivity) | TP / (TP + FN) | When false negatives are costly |
| F1 Score | 2 * (Precision * Recall) / (Precision + Recall) | Balance between precision and recall |
| AUC-ROC | Area under ROC curve (TPR vs FPR) | Comparing classifiers; threshold-independent |
| Log Loss | -mean(y*log(p) + (1-y)*log(1-p)) | When probability calibration matters |
| Specificity | TN / (TN + FP) | When correctly identifying negatives matters |
- FP costly (wrongly accusing someone) → Precision
- FN costly (missing a disease) → Recall
- Need balance → F1 Score
- Compare models overall → AUC-ROC
- Imbalanced data → Never use accuracy — use AUC, F1, or Precision/Recall
Precision vs Recall: Real-World Examples
| Scenario | Priority | Why |
|---|---|---|
| Spam filter | Precision | Do not lose real emails (minimize false positives) |
| Cancer screening | Recall | Do not miss any cancer cases (minimize false negatives) |
| Fraud detection | Recall | Catch all fraudulent transactions |
| Content moderation | Precision | Do not wrongly remove legitimate content |
Accuracy is misleading on imbalanced data. A model that always predicts the majority class achieves 99% accuracy on a 99/1 split — but it catches zero minority cases. Always use AUC, F1, or Precision/Recall for imbalanced datasets.
Regression Metrics
| Metric | Formula | Key Detail |
|---|---|---|
| MSE | mean((y - y_pred)²) | Penalizes large errors more (squared). Always positive |
| RMSE | sqrt(MSE) | Same units as the target variable. Most commonly used |
| MAE | mean(|y - y_pred|) | Robust to outliers (unlike MSE) |
| MAPE | mean(|y - y_pred| / |y|) * 100 | Percentage error. Interpretable but undefined when y=0 |
| R² (R-squared) | 1 - (SS_res / SS_tot) | Proportion of variance explained. 1.0 = perfect, 0 = baseline |
# Regression metric formulas
import numpy as np
def rmse(y, y_pred):
return np.sqrt(np.mean((y - y_pred) ** 2))
def mae(y, y_pred):
return np.mean(np.abs(y - y_pred))
def r_squared(y, y_pred):
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
return 1 - (ss_res / ss_tot)
Never use accuracy, precision, recall, or F1 for regression problems. These are classification-only metrics. For regression, use RMSE, MSE, MAE, MAPE, or R².
Forecasting Metrics
| Metric | What It Measures | Practical Guidance |
|---|---|---|
| Weighted Quantile Loss (wQL) | Accuracy of probabilistic forecasts at specific quantiles | Use higher quantile (P75, P90) when underforecasting is more costly. Lower quantile (P10, P25) when overforecasting is costly |
| Coverage Score | Whether prediction intervals are well-calibrated | Coverage at quantile q should approximately equal q. If 90% interval covers only 70%, the model underestimates uncertainty |
Flashcards
When should you use precision as your primary metric?
Click to revealWhen false positives are costly. Examples: spam filters (don't lose real emails), content moderation (don't remove legitimate content), loan approvals (don't wrongly deny applications). Precision = TP / (TP + FP).