CodeFix Solution

Calibrating XGBoost Probabilities with Isotonic Regression

April 14, 2026 · 10 min read · Python, XGBoost, Calibration

If you're using model.predict_proba() from XGBoost and treating the output as a real probability, you're likely making decisions on wrong numbers. XGBoost's raw probability outputs are systematically miscalibrated — when the model says "80% likely," the true frequency might be anywhere from 60% to 90%.

This guide shows you how to measure calibration error, fix it with post-hoc calibration, and verify the fix works.

The Problem: Why Raw Probabilities Are Wrong

XGBoost optimizes log-loss during training, which encourages the model to be directionally correct but doesn't guarantee the predicted probabilities match real-world frequencies. This is especially true when:

Step 1: Measure Calibration Error (ECE)

Expected Calibration Error (ECE) bins predictions into buckets (e.g., 0.7–0.8) and compares the average predicted probability to the actual win rate in each bucket:

import numpy as np
from sklearn.calibration import calibration_curve

def expected_calibration_error(y_true, y_prob, n_bins=10):
    """Compute ECE: weighted average of |predicted - actual| per bin."""
    bin_edges = np.linspace(0, 1, n_bins + 1)
    ece = 0.0
    total = len(y_true)

    for lo, hi in zip(bin_edges[:-1], bin_edges[1:]):
        mask = (y_prob >= lo) & (y_prob < hi)
        if mask.sum() == 0:
            continue
        avg_pred = y_prob[mask].mean()
        avg_true = y_true[mask].mean()
        ece += abs(avg_pred - avg_true) * mask.sum() / total

    return ece

# Example: measure your model's calibration
from xgboost import XGBClassifier

model = XGBClassifier(n_estimators=200, max_depth=5)
model.fit(X_train, y_train)

y_prob = model.predict_proba(X_test)[:, 1]
ece = expected_calibration_error(y_test, y_prob)
print(f"ECE before calibration: {ece:.4f}")
# Typical output: 0.04 - 0.15 (4-15 percentage points off)

Step 2: Apply Isotonic Regression

Isotonic regression learns a monotonic mapping from raw probabilities to calibrated ones. It's non-parametric (no assumptions about the shape of the miscalibration) and works better than Platt scaling for XGBoost:

from sklearn.isotonic import IsotonicRegression

# IMPORTANT: use a held-out calibration set, NOT the training data
y_cal_prob = model.predict_proba(X_cal)[:, 1]

iso = IsotonicRegression(out_of_bounds='clip')
iso.fit(y_cal_prob, y_cal)

# Now calibrate test predictions
y_prob_raw = model.predict_proba(X_test)[:, 1]
y_prob_calibrated = iso.predict(y_prob_raw)

ece_after = expected_calibration_error(y_test, y_prob_calibrated)
print(f"ECE after isotonic: {ece_after:.4f}")
# Typical output: 0.005 - 0.02 (10x improvement)

Step 3: Verify with a Reliability Diagram

import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

for ax, probs, title in [
    (ax1, y_prob_raw, "Before Calibration"),
    (ax2, y_prob_calibrated, "After Isotonic")
]:
    fraction_pos, mean_pred = calibration_curve(
        y_test, probs, n_bins=10, strategy='uniform'
    )
    ax.plot([0, 1], [0, 1], 'k--', label='Perfect')
    ax.plot(mean_pred, fraction_pos, 'o-', label='Model')
    ax.set_xlabel('Predicted probability')
    ax.set_ylabel('Actual frequency')
    ax.set_title(title)
    ax.legend()

plt.tight_layout()
plt.savefig('calibration_comparison.png', dpi=150)
plt.show()

Isotonic vs Platt Scaling

Two common calibration methods:

For gradient-boosted trees, isotonic almost always wins.

Live Rolling Recalibration

In production, your model's calibration drifts over time as the data distribution changes (new seasons, meta shifts, etc.). A rolling recalibrator fixes this without retraining:

class LiveRecalibrator:
    """Rolling isotonic recalibration on last N resolved predictions."""

    def __init__(self, buffer_size=500, min_samples=50):
        self.buffer_size = buffer_size
        self.min_samples = min_samples
        self.preds = []
        self.actuals = []
        self.iso = None

    def record(self, predicted_prob, actual_outcome):
        self.preds.append(predicted_prob)
        self.actuals.append(actual_outcome)
        # Keep rolling window
        if len(self.preds) > self.buffer_size:
            self.preds = self.preds[-self.buffer_size:]
            self.actuals = self.actuals[-self.buffer_size:]
        # Refit periodically
        if len(self.preds) >= self.min_samples and len(self.preds) % 25 == 0:
            self.iso = IsotonicRegression(out_of_bounds='clip')
            self.iso.fit(self.preds, self.actuals)

    def adjust(self, raw_prob):
        if self.iso is None:
            return raw_prob
        return float(self.iso.predict([raw_prob])[0])

This pattern is used in production by sports prediction platforms like ZenHodl, which serves calibrated probabilities across 10 sports with ECE consistently under 0.01.

Common Pitfalls

  1. Don't calibrate on training data — use a held-out calibration split (typically 20% of your data, separate from both train and test)
  2. Don't calibrate before feature selection — calibrate as the very last step
  3. Watch for sample size — isotonic regression needs ~500+ samples. With fewer, use Platt scaling
  4. Re-calibrate periodically — calibration degrades as the world changes. Use rolling recalibration in production

See calibrated predictions in action across 10 sports

View Live Results →