Calibrating XGBoost Probabilities with Isotonic Regression
If you're using model.predict_proba() from XGBoost and treating the output as a real probability, you're likely making decisions on wrong numbers. XGBoost's raw probability outputs are systematically miscalibrated — when the model says "80% likely," the true frequency might be anywhere from 60% to 90%.
This guide shows you how to measure calibration error, fix it with post-hoc calibration, and verify the fix works.
The Problem: Why Raw Probabilities Are Wrong
XGBoost optimizes log-loss during training, which encourages the model to be directionally correct but doesn't guarantee the predicted probabilities match real-world frequencies. This is especially true when:
- Your dataset is imbalanced (more of one class than another)
- You use aggressive hyperparameters (deep trees, high learning rate)
- You apply the model to a domain shift (trained on 2024 data, predicting 2026)
Step 1: Measure Calibration Error (ECE)
Expected Calibration Error (ECE) bins predictions into buckets (e.g., 0.7–0.8) and compares the average predicted probability to the actual win rate in each bucket:
import numpy as np
from sklearn.calibration import calibration_curve
def expected_calibration_error(y_true, y_prob, n_bins=10):
"""Compute ECE: weighted average of |predicted - actual| per bin."""
bin_edges = np.linspace(0, 1, n_bins + 1)
ece = 0.0
total = len(y_true)
for lo, hi in zip(bin_edges[:-1], bin_edges[1:]):
mask = (y_prob >= lo) & (y_prob < hi)
if mask.sum() == 0:
continue
avg_pred = y_prob[mask].mean()
avg_true = y_true[mask].mean()
ece += abs(avg_pred - avg_true) * mask.sum() / total
return ece
# Example: measure your model's calibration
from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=200, max_depth=5)
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]
ece = expected_calibration_error(y_test, y_prob)
print(f"ECE before calibration: {ece:.4f}")
# Typical output: 0.04 - 0.15 (4-15 percentage points off)
Step 2: Apply Isotonic Regression
Isotonic regression learns a monotonic mapping from raw probabilities to calibrated ones. It's non-parametric (no assumptions about the shape of the miscalibration) and works better than Platt scaling for XGBoost:
from sklearn.isotonic import IsotonicRegression
# IMPORTANT: use a held-out calibration set, NOT the training data
y_cal_prob = model.predict_proba(X_cal)[:, 1]
iso = IsotonicRegression(out_of_bounds='clip')
iso.fit(y_cal_prob, y_cal)
# Now calibrate test predictions
y_prob_raw = model.predict_proba(X_test)[:, 1]
y_prob_calibrated = iso.predict(y_prob_raw)
ece_after = expected_calibration_error(y_test, y_prob_calibrated)
print(f"ECE after isotonic: {ece_after:.4f}")
# Typical output: 0.005 - 0.02 (10x improvement)
Step 3: Verify with a Reliability Diagram
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
for ax, probs, title in [
(ax1, y_prob_raw, "Before Calibration"),
(ax2, y_prob_calibrated, "After Isotonic")
]:
fraction_pos, mean_pred = calibration_curve(
y_test, probs, n_bins=10, strategy='uniform'
)
ax.plot([0, 1], [0, 1], 'k--', label='Perfect')
ax.plot(mean_pred, fraction_pos, 'o-', label='Model')
ax.set_xlabel('Predicted probability')
ax.set_ylabel('Actual frequency')
ax.set_title(title)
ax.legend()
plt.tight_layout()
plt.savefig('calibration_comparison.png', dpi=150)
plt.show()
Isotonic vs Platt Scaling
Two common calibration methods:
- Platt scaling: Fits a logistic regression on raw probabilities. Assumes sigmoid-shaped miscalibration. Works well for SVMs, less well for tree-based models.
- Isotonic regression: Non-parametric, learns any monotonic mapping. Better for XGBoost/LightGBM where miscalibration is often non-sigmoid. Needs more calibration data (~500+ samples).
For gradient-boosted trees, isotonic almost always wins.
Live Rolling Recalibration
In production, your model's calibration drifts over time as the data distribution changes (new seasons, meta shifts, etc.). A rolling recalibrator fixes this without retraining:
class LiveRecalibrator:
"""Rolling isotonic recalibration on last N resolved predictions."""
def __init__(self, buffer_size=500, min_samples=50):
self.buffer_size = buffer_size
self.min_samples = min_samples
self.preds = []
self.actuals = []
self.iso = None
def record(self, predicted_prob, actual_outcome):
self.preds.append(predicted_prob)
self.actuals.append(actual_outcome)
# Keep rolling window
if len(self.preds) > self.buffer_size:
self.preds = self.preds[-self.buffer_size:]
self.actuals = self.actuals[-self.buffer_size:]
# Refit periodically
if len(self.preds) >= self.min_samples and len(self.preds) % 25 == 0:
self.iso = IsotonicRegression(out_of_bounds='clip')
self.iso.fit(self.preds, self.actuals)
def adjust(self, raw_prob):
if self.iso is None:
return raw_prob
return float(self.iso.predict([raw_prob])[0])
This pattern is used in production by sports prediction platforms like ZenHodl, which serves calibrated probabilities across 10 sports with ECE consistently under 0.01.
Common Pitfalls
- Don't calibrate on training data — use a held-out calibration split (typically 20% of your data, separate from both train and test)
- Don't calibrate before feature selection — calibrate as the very last step
- Watch for sample size — isotonic regression needs ~500+ samples. With fewer, use Platt scaling
- Re-calibrate periodically — calibration degrades as the world changes. Use rolling recalibration in production
See calibrated predictions in action across 10 sports
View Live Results →