Statistics & Probability
A solid grasp of statistics underpins effective machine learning. From choosing the right data transforms to diagnosing model problems, statistical concepts help you make informed decisions at every stage of the ML pipeline.
Distributions​
Understanding data distributions helps you select appropriate models and transformations.
| Distribution | Shape / Type | When It Applies | Example |
|---|---|---|---|
| Normal (Gaussian) | Continuous, symmetric, bell-shaped | Most common assumption in statistics | Heights, test scores, measurement errors |
| Poisson | Discrete, count data | Count of events in a fixed interval | Number of calls per hour, defects per batch |
| Binomial | Discrete, number of successes | Fixed number of independent trials | Coin flips, pass/fail in n tests |
| Uniform | All values equally likely | No prior knowledge about distribution | Random number generators, initial random weights |
import numpy as np
import scipy.stats as stats
# Generate samples from common distributions
normal_data = np.random.normal(mean=0, scale=1, size=1000)
poisson_data = np.random.poisson(lam=5, size=1000)
binomial_data = np.random.binomial(n=10, p=0.5, size=1000)
uniform_data = np.random.uniform(low=0, high=1, size=1000)
Skewness and Transformations​
| Distribution Shape | Relationship | Transform |
|---|---|---|
| Right-skewed (long tail to the right) | Mode < Median < Mean | Log transform to normalize |
| Left-skewed (long tail to the left) | Mean < Median < Mode | Square or exponential transform may help |
| Symmetric (Normal) | Mean ≈ Median ≈ Mode | No transform needed |
When you see "mode < median < mean" or "long tail to the right," the data is right-skewed. Apply a log transform to make the distribution more normal, which improves the performance of linear models and models that assume normality.
import numpy as np
# Right-skewed data → log transform
original_data = [1, 2, 3, 5, 8, 15, 50, 200, 1000]
transformed = np.log1p(original_data) # log(1 + x) handles zeros
Correlation and Multicollinearity​
Correlation measures the linear relationship between two variables. When features are highly correlated, it creates multicollinearity problems for many models.
| Method | What It Tells You | Threshold |
|---|---|---|
| Correlation Matrix / Heatmap | Pairwise correlation between features | |r| > 0.8 = highly correlated |
| VIF (Variance Inflation Factor) | How much a feature's variance is inflated by correlation with others | VIF > 5-10 = problematic |
| PCA eigenvalues | Near-zero eigenvalue = features are linearly dependent | Eigenvalue near 0 = collinear |
import pandas as pd
# Detect multicollinearity
correlation_matrix = df.corr()
# Check VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data[vif_data["VIF"] > 5]) # Flag problematic features
Outlier Detection​
| Method | How It Works | When to Use |
|---|---|---|
| IQR Rule | Outlier if value < Q1 - 1.5IQR or > Q3 + 1.5IQR | General-purpose, robust |
| Z-Score | Outlier if |z-score| > 3 | Data is approximately normal |
| Box Plots | Visual identification of outliers | Quick exploratory analysis |
What to do with outliers: Remove them, cap/winsorize them, or apply a robust transformation (log, square root).
Residual Plot Analysis​
Residual plots (predicted vs residual) help diagnose model problems after training.
| Pattern in Residual Plot | What It Means | Action |
|---|---|---|
| Random scatter around zero | Good model fit | No action needed |
| Curved pattern (U-shape) | Non-linear relationship not captured | Add polynomial features or use a non-linear model |
| Fan shape (wider spread on one side) | Heteroscedasticity (non-constant variance) | Log transform the target variable or use robust regression |
| Systematic over/underestimation | Model has systematic bias | Add missing features or change model type |
A good residual plot shows random scatter around zero with no visible patterns. Any systematic pattern indicates your model is missing something.
p-Values and Statistical Significance​
A p-value is the probability of observing results as extreme as the data, assuming the null hypothesis is true.
- p < 0.05 → typically considered statistically significant
- Used in feature selection (chi-squared tests, ANOVA)
- Lower p-value = stronger evidence against the null hypothesis
Flashcards​
How do you identify a right-skewed distribution from summary statistics?
Click to revealMode < Median < Mean. The long tail extends to the right, pulling the mean higher than the median. Apply a log transform to normalize it.