Evaluation Metrics

Choosing the right evaluation metric is just as important as choosing the right model. The wrong metric can give you a false sense of confidence. This section covers the essential metrics for classification, regression, and forecasting problems.

Classification Metrics

Confusion Matrix

The confusion matrix is the foundation for understanding classification metrics.

                    Predicted Positive    Predicted Negative
Actual Positive     TP (True Positive)    FN (False Negative) ← "Missed"  (Type II Error)
Actual Negative     FP (False Positive)   TN (True Negative)
                    ↑ "False alarm" (Type I Error)

Metric Reference

Metric	Formula	When to Use
Accuracy	`(TP + TN) / Total`	Balanced classes only
Precision	`TP / (TP + FP)`	When false positives are costly
Recall (Sensitivity)	`TP / (TP + FN)`	When false negatives are costly
F1 Score	`2 * (Precision * Recall) / (Precision + Recall)`	Balance between precision and recall
AUC-ROC	Area under ROC curve (TPR vs FPR)	Comparing classifiers; threshold-independent
Log Loss	`-mean(ylog(p) + (1-y)log(1-p))`	When probability calibration matters
Specificity	`TN / (TN + FP)`	When correctly identifying negatives matters

Metric Selection Guide

FP costly (wrongly accusing someone) → Precision
FN costly (missing a disease) → Recall
Need balance → F1 Score
Compare models overall → AUC-ROC
Imbalanced data → Never use accuracy — use AUC, F1, or Precision/Recall

Precision vs Recall: Real-World Examples

Scenario	Priority	Why
Spam filter	Precision	Do not lose real emails (minimize false positives)
Cancer screening	Recall	Do not miss any cancer cases (minimize false negatives)
Fraud detection	Recall	Catch all fraudulent transactions
Content moderation	Precision	Do not wrongly remove legitimate content

Common Misconception

Accuracy is misleading on imbalanced data. A model that always predicts the majority class achieves 99% accuracy on a 99/1 split — but it catches zero minority cases. Always use AUC, F1, or Precision/Recall for imbalanced datasets.

Regression Metrics

Metric	Formula	Key Detail
MSE	`mean((y - y_pred)²)`	Penalizes large errors more (squared). Always positive
RMSE	`sqrt(MSE)`	Same units as the target variable. Most commonly used
MAE	`mean(\|y - y_pred\|)`	Robust to outliers (unlike MSE)
MAPE	`mean(\|y - y_pred\| / \|y\|) * 100`	Percentage error. Interpretable but undefined when y=0
R² (R-squared)	`1 - (SS_res / SS_tot)`	Proportion of variance explained. 1.0 = perfect, 0 = baseline

# Regression metric formulas
import numpy as np

def rmse(y, y_pred):
    return np.sqrt(np.mean((y - y_pred) ** 2))

def mae(y, y_pred):
    return np.mean(np.abs(y - y_pred))

def r_squared(y, y_pred):
    ss_res = np.sum((y - y_pred) ** 2)
    ss_tot = np.sum((y - np.mean(y)) ** 2)
    return 1 - (ss_res / ss_tot)

caution

Never use accuracy, precision, recall, or F1 for regression problems. These are classification-only metrics. For regression, use RMSE, MSE, MAE, MAPE, or R².

Forecasting Metrics

Metric	What It Measures	Practical Guidance
Weighted Quantile Loss (wQL)	Accuracy of probabilistic forecasts at specific quantiles	Use higher quantile (P75, P90) when underforecasting is more costly. Lower quantile (P10, P25) when overforecasting is costly
Coverage Score	Whether prediction intervals are well-calibrated	Coverage at quantile q should approximately equal q. If 90% interval covers only 70%, the model underestimates uncertainty

Flashcards

1 / 9

Question

When should you use precision as your primary metric?

Click to reveal

Answer

When false positives are costly. Examples: spam filters (don't lose real emails), content moderation (don't remove legitimate content), loan approvals (don't wrongly deny applications). Precision = TP / (TP + FP).

Classification Metrics​

Confusion Matrix​

Metric Reference​

Precision vs Recall: Real-World Examples​

Regression Metrics​

Forecasting Metrics​

Flashcards​