Calibrating Win-Probability Models: Reliability Curves Using 100,000+ Game-State Snapshots
An XGBoost classifier produces probabilities. The probabilities are usually wrong — not in the sense that they pick the wrong winner, but in the sense that "60% confidence" predictions do not actually win 60% of the time. Sometimes they win 55%. Sometimes 65%. The mismatch between predicted probability and observed frequency is called miscalibration, and fixing it is the difference between a model you can size positions against and a model that will quietly destroy your bankroll.
This post walks through the full calibration pipeline for sports win-probability models: build reliability curves on 100,000+ game-state snapshots, compute Expected Calibration Error, fit isotonic and Platt calibrators, and validate on holdout data. Working Python throughout.
The reliability curve
A reliability curve plots predicted probability against observed frequency. Bin your predictions into deciles (or any number of bins): the bottom 10% of predictions go in bin 1, the next 10% in bin 2, etc. For each bin, compute the mean predicted probability and the actual win rate.
If the model is well-calibrated, the points lie on the diagonal y=x. Predictions of 30% have an actual 30% win rate. Predictions of 70% have an actual 70% win rate.
Most untreated XGBoost models do not lie on the diagonal. They are overconfident at the extremes — the model says 90% but the actual rate is 80%. Sometimes they are under-confident in the middle — the model says 55% but the actual rate is 60%.
import numpy as np
def reliability_data(y_true, y_prob, n_bins=15):
bins = np.linspace(0, 1, n_bins + 1)
points = []
for lo, hi in zip(bins[:-1], bins[1:]):
mask = (y_prob >= lo) & (y_prob < hi)
n = mask.sum()
if n == 0:
continue
mean_pred = y_prob[mask].mean()
actual = y_true[mask].mean()
points.append({"bin_lo": lo, "bin_hi": hi, "n": n,
"mean_pred": mean_pred, "actual": actual})
return points
Plot mean_pred vs actual with point sizes proportional to bin n. Add the diagonal. The further the points are from the diagonal, the worse the calibration.
Expected Calibration Error
Expected Calibration Error (ECE) is the single number summary of the reliability curve. It is the bin-size-weighted average of the absolute difference between mean prediction and actual frequency:
def expected_calibration_error(y_true, y_prob, n_bins=15):
bins = np.linspace(0, 1, n_bins + 1)
ece = 0.0
n_total = len(y_true)
for lo, hi in zip(bins[:-1], bins[1:]):
mask = (y_prob >= lo) & (y_prob < hi)
if mask.sum() == 0:
continue
bin_acc = y_true[mask].mean()
bin_conf = y_prob[mask].mean()
ece += (mask.sum() / n_total) * abs(bin_acc - bin_conf)
return ece
For sports win-probability models, the practical thresholds are:
| ECE | Interpretation |
|---|---|
| < 0.03 | Excellent. Production-ready. |
| 0.03 - 0.05 | Good. Safe for Kelly sizing. |
| 0.05 - 0.07 | Marginal. Recalibrate before betting. |
| 0.07 - 0.10 | Bad. Do not size positions against. |
| > 0.10 | Broken. Retrain or rebuild. |
Isotonic calibration
Isotonic regression learns a non-parametric monotonic mapping from predicted probabilities to calibrated probabilities. It is the right default for sports models because it can correct any monotonic miscalibration without imposing a specific functional form.
from sklearn.isotonic import IsotonicRegression
def fit_isotonic(y_true_calib, y_prob_calib):
iso = IsotonicRegression(out_of_bounds="clip")
iso.fit(y_prob_calib, y_true_calib)
return iso
# Fit on a held-out calibration set, NOT the training set
iso = fit_isotonic(y_train_calib, raw_pred_calib)
# At inference
def predict_calibrated(features):
raw = clf.predict_proba(features)[:, 1]
return iso.transform(raw)
Critical: fit the calibrator on a separate held-out set, not on the training data. Fitting on training data produces an overconfident calibrator that does not generalize.
Platt scaling
Platt scaling fits a logistic regression to map raw probabilities to calibrated ones. It is more constrained than isotonic (sigmoid shape only) but works well when the miscalibration is sigmoid-like and the calibration set is small.
from sklearn.linear_model import LogisticRegression
def fit_platt(y_true_calib, y_prob_calib):
lr = LogisticRegression()
lr.fit(y_prob_calib.reshape(-1, 1), y_true_calib)
return lr
platt = fit_platt(y_train_calib, raw_pred_calib)
def predict_platt(features):
raw = clf.predict_proba(features)[:, 1].reshape(-1, 1)
return platt.predict_proba(raw)[:, 1]
For sports models with 10,000+ calibration samples, isotonic typically produces lower ECE than Platt. For models with under 1,000 samples, Platt is more stable.
The 100,000+ snapshot dataset
Calibration matters more for in-play models than pre-game models because the snapshot count is much higher. A single NBA game produces 30-50 snapshots (one per minute or per significant event). A full season produces 30,000-50,000 snapshots. Three seasons produces 100,000+.
With this many calibration samples, isotonic regression has plenty to work with and ECE typically drops to 0.02-0.04 after calibration. The reliability curve looks essentially flat against the diagonal.
# Production pipeline
from sklearn.model_selection import train_test_split
# Hold out 20% for calibration
X_model, X_calib, y_model, y_calib = train_test_split(
X_full, y_full, test_size=0.2, random_state=42
)
# Train base model on the 80%
clf = XGBClassifier(n_estimators=400, max_depth=5, learning_rate=0.05)
clf.fit(X_model, y_model)
# Get raw predictions on the 20% calibration set
raw_calib = clf.predict_proba(X_calib)[:, 1]
# Fit isotonic on calibration set
iso = IsotonicRegression(out_of_bounds="clip")
iso.fit(raw_calib, y_calib)
# Final ECE
calibrated_calib = iso.transform(raw_calib)
print(f"Pre-calibration ECE: {expected_calibration_error(y_calib, raw_calib):.4f}")
print(f"Post-calibration ECE: {expected_calibration_error(y_calib, calibrated_calib):.4f}")
A typical output for a well-trained NBA in-play model might show pre-calibration ECE around 0.05 and post-calibration ECE around 0.03. Real lift from a single calibration step.
Per-sport calibration tables
Different sports have different miscalibration patterns. NCAAMB models tend to be slightly overconfident; soccer models tend to be slightly under-confident. Fit a separate isotonic per sport rather than a global one. The per-sport approach typically reduces ECE by an additional 0.01-0.02.
Live recalibration
Calibration drifts over the season. The fitted isotonic from October is not perfect by April. Two countermeasures:
Periodic retraining: refit the isotonic monthly or weekly on the most recent N=500 outcomes per sport.
Live recalibrator: maintain a rolling window of recent (prediction, outcome) pairs and refit on every settlement event. The recalibrator runs in-process and adjusts model output at inference time.
from collections import deque
class LiveRecalibrator:
def __init__(self, window=500):
self.window = window
self.preds = deque(maxlen=window)
self.actuals = deque(maxlen=window)
self._iso = None
def record(self, pred, actual):
self.preds.append(pred)
self.actuals.append(actual)
if len(self.preds) >= 50:
self._iso = IsotonicRegression(out_of_bounds="clip")
self._iso.fit(list(self.preds), list(self.actuals))
def adjust(self, raw_prob):
if self._iso is None:
return raw_prob
return float(self._iso.transform([raw_prob])[0])
The bottom line
Calibration is the difference between a model that picks winners and a model whose probabilities are safe to size positions against. The pipeline is small — reliability curve, ECE, isotonic regression, periodic refit — but each piece is necessary. Skip any of them and Kelly sizing breaks.
Per-sport calibrated probabilities, ECE published
ZenHodl publishes Expected Calibration Error per sport on a public methodology page. Calibrated probabilities for 11 sports. Free seven-day trial.
Try ZenHodl free