Skip to main content

Evaluation Metrics

Choosing the right evaluation metric is just as important as choosing the right model. The wrong metric can give you a false sense of confidence. This section covers the essential metrics for classification, regression, and forecasting problems.

Classification Metrics

Confusion Matrix

The confusion matrix is the foundation for understanding classification metrics.

                    Predicted Positive    Predicted Negative
Actual Positive TP (True Positive) FN (False Negative) ← "Missed" (Type II Error)
Actual Negative FP (False Positive) TN (True Negative)
↑ "False alarm" (Type I Error)

Metric Reference

MetricFormulaWhen to Use
Accuracy(TP + TN) / TotalBalanced classes only
PrecisionTP / (TP + FP)When false positives are costly
Recall (Sensitivity)TP / (TP + FN)When false negatives are costly
F1 Score2 * (Precision * Recall) / (Precision + Recall)Balance between precision and recall
AUC-ROCArea under ROC curve (TPR vs FPR)Comparing classifiers; threshold-independent
Log Loss-mean(y*log(p) + (1-y)*log(1-p))When probability calibration matters
SpecificityTN / (TN + FP)When correctly identifying negatives matters
Metric Selection Guide
  • FP costly (wrongly accusing someone) → Precision
  • FN costly (missing a disease) → Recall
  • Need balanceF1 Score
  • Compare models overallAUC-ROC
  • Imbalanced dataNever use accuracy — use AUC, F1, or Precision/Recall

Precision vs Recall: Real-World Examples

ScenarioPriorityWhy
Spam filterPrecisionDo not lose real emails (minimize false positives)
Cancer screeningRecallDo not miss any cancer cases (minimize false negatives)
Fraud detectionRecallCatch all fraudulent transactions
Content moderationPrecisionDo not wrongly remove legitimate content
Common Misconception

Accuracy is misleading on imbalanced data. A model that always predicts the majority class achieves 99% accuracy on a 99/1 split — but it catches zero minority cases. Always use AUC, F1, or Precision/Recall for imbalanced datasets.

Regression Metrics

MetricFormulaKey Detail
MSEmean((y - y_pred)²)Penalizes large errors more (squared). Always positive
RMSEsqrt(MSE)Same units as the target variable. Most commonly used
MAEmean(|y - y_pred|)Robust to outliers (unlike MSE)
MAPEmean(|y - y_pred| / |y|) * 100Percentage error. Interpretable but undefined when y=0
R² (R-squared)1 - (SS_res / SS_tot)Proportion of variance explained. 1.0 = perfect, 0 = baseline
# Regression metric formulas
import numpy as np

def rmse(y, y_pred):
return np.sqrt(np.mean((y - y_pred) ** 2))

def mae(y, y_pred):
return np.mean(np.abs(y - y_pred))

def r_squared(y, y_pred):
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
return 1 - (ss_res / ss_tot)
caution

Never use accuracy, precision, recall, or F1 for regression problems. These are classification-only metrics. For regression, use RMSE, MSE, MAE, MAPE, or R².

Forecasting Metrics

MetricWhat It MeasuresPractical Guidance
Weighted Quantile Loss (wQL)Accuracy of probabilistic forecasts at specific quantilesUse higher quantile (P75, P90) when underforecasting is more costly. Lower quantile (P10, P25) when overforecasting is costly
Coverage ScoreWhether prediction intervals are well-calibratedCoverage at quantile q should approximately equal q. If 90% interval covers only 70%, the model underestimates uncertainty

Flashcards

1 / 9
Question

When should you use precision as your primary metric?

Click to reveal
Answer

When false positives are costly. Examples: spam filters (don't lose real emails), content moderation (don't remove legitimate content), loan approvals (don't wrongly deny applications). Precision = TP / (TP + FP).