Calibrating XGBoost Probabilities with Isotonic Regression

April 14, 2026 · 10 min read · Python, XGBoost, Calibration

If you're using model.predict_proba() from XGBoost and treating the output as a real probability, you're likely making decisions on wrong numbers. XGBoost's raw probability outputs are systematically miscalibrated — when the model says "80% likely," the true frequency might be anywhere from 60% to 90%.

This guide shows you how to measure calibration error, fix it with post-hoc calibration, and verify the fix works.

The Problem: Why Raw Probabilities Are Wrong

XGBoost optimizes log-loss during training, which encourages the model to be directionally correct but doesn't guarantee the predicted probabilities match real-world frequencies. This is especially true when:

Your dataset is imbalanced (more of one class than another)
You use aggressive hyperparameters (deep trees, high learning rate)
You apply the model to a domain shift (trained on 2024 data, predicting 2026)

Step 1: Measure Calibration Error (ECE)

Expected Calibration Error (ECE) bins predictions into buckets (e.g., 0.7–0.8) and compares the average predicted probability to the actual win rate in each bucket:

import numpy as np
from sklearn.calibration import calibration_curve

def expected_calibration_error(y_true, y_prob, n_bins=10):
    """Compute ECE: weighted average of |predicted - actual| per bin."""
    bin_edges = np.linspace(0, 1, n_bins + 1)
    ece = 0.0
    total = len(y_true)

    for lo, hi in zip(bin_edges[:-1], bin_edges[1:]):
        mask = (y_prob >= lo) & (y_prob < hi)
        if mask.sum() == 0:
            continue
        avg_pred = y_prob[mask].mean()
        avg_true = y_true[mask].mean()
        ece += abs(avg_pred - avg_true) * mask.sum() / total

    return ece

# Example: measure your model's calibration
from xgboost import XGBClassifier

model = XGBClassifier(n_estimators=200, max_depth=5)
model.fit(X_train, y_train)

y_prob = model.predict_proba(X_test)[:, 1]
ece = expected_calibration_error(y_test, y_prob)
print(f"ECE before calibration: {ece:.4f}")
# Typical output: 0.04 - 0.15 (4-15 percentage points off)

Step 2: Apply Isotonic Regression

Isotonic regression learns a monotonic mapping from raw probabilities to calibrated ones. It's non-parametric (no assumptions about the shape of the miscalibration) and works better than Platt scaling for XGBoost:

from sklearn.isotonic import IsotonicRegression

# IMPORTANT: use a held-out calibration set, NOT the training data
y_cal_prob = model.predict_proba(X_cal)[:, 1]

iso = IsotonicRegression(out_of_bounds='clip')
iso.fit(y_cal_prob, y_cal)

# Now calibrate test predictions
y_prob_raw = model.predict_proba(X_test)[:, 1]
y_prob_calibrated = iso.predict(y_prob_raw)

ece_after = expected_calibration_error(y_test, y_prob_calibrated)
print(f"ECE after isotonic: {ece_after:.4f}")
# Typical output: 0.005 - 0.02 (10x improvement)

Step 3: Verify with a Reliability Diagram

import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

for ax, probs, title in [
    (ax1, y_prob_raw, "Before Calibration"),
    (ax2, y_prob_calibrated, "After Isotonic")
]:
    fraction_pos, mean_pred = calibration_curve(
        y_test, probs, n_bins=10, strategy='uniform'
    )
    ax.plot([0, 1], [0, 1], 'k--', label='Perfect')
    ax.plot(mean_pred, fraction_pos, 'o-', label='Model')
    ax.set_xlabel('Predicted probability')
    ax.set_ylabel('Actual frequency')
    ax.set_title(title)
    ax.legend()

plt.tight_layout()
plt.savefig('calibration_comparison.png', dpi=150)
plt.show()

Isotonic vs Platt Scaling

Two common calibration methods:

Platt scaling: Fits a logistic regression on raw probabilities. Assumes sigmoid-shaped miscalibration. Works well for SVMs, less well for tree-based models.
Isotonic regression: Non-parametric, learns any monotonic mapping. Better for XGBoost/LightGBM where miscalibration is often non-sigmoid. Needs more calibration data (~500+ samples).

For gradient-boosted trees, isotonic almost always wins.

Live Rolling Recalibration

In production, your model's calibration drifts over time as the data distribution changes (new seasons, meta shifts, etc.). A rolling recalibrator fixes this without retraining:

class LiveRecalibrator:
    """Rolling isotonic recalibration on last N resolved predictions."""

    def __init__(self, buffer_size=500, min_samples=50):
        self.buffer_size = buffer_size
        self.min_samples = min_samples
        self.preds = []
        self.actuals = []
        self.iso = None

    def record(self, predicted_prob, actual_outcome):
        self.preds.append(predicted_prob)
        self.actuals.append(actual_outcome)
        # Keep rolling window
        if len(self.preds) > self.buffer_size:
            self.preds = self.preds[-self.buffer_size:]
            self.actuals = self.actuals[-self.buffer_size:]
        # Refit periodically
        if len(self.preds) >= self.min_samples and len(self.preds) % 25 == 0:
            self.iso = IsotonicRegression(out_of_bounds='clip')
            self.iso.fit(self.preds, self.actuals)

    def adjust(self, raw_prob):
        if self.iso is None:
            return raw_prob
        return float(self.iso.predict([raw_prob])[0])

This pattern is used in production by sports prediction platforms like ZenHodl, which serves calibrated probabilities across 10 sports with ECE consistently under 0.01.

Common Pitfalls

Don't calibrate on training data — use a held-out calibration split (typically 20% of your data, separate from both train and test)
Don't calibrate before feature selection — calibrate as the very last step
Watch for sample size — isotonic regression needs ~500+ samples. With fewer, use Platt scaling
Re-calibrate periodically — calibration degrades as the world changes. Use rolling recalibration in production

See calibrated predictions in action across 10 sports

View Live Results →