CodeFix Solution

How to Build a March Madness Prediction Model in Python (With Calibrated Bracket Probabilities)

April 22, 2026 · 14 min read · Python, XGBoost, ELO, Machine Learning, College Basketball

Most "March Madness prediction model" tutorials reduce to one of two things: a seed-based cheat sheet (which hits ~72% in the first round and collapses after), or a KenPom-mimicking power ranking with no calibration check. Neither teaches you how to build something you'd actually trust with money on the line.

A real March Madness prediction model has to do three things the shortcut versions won't:

  1. Produce calibrated win probabilities for every tournament game — when it says 70%, the team should win about 70% of the time
  2. Account for the neutral-site nature of tournament games, which kills home-court advantage assumptions baked into regular-season models
  3. Hold up to honest out-of-sample backtesting — no peeking at tournament games during training or calibration

This guide walks through all three using Python, XGBoost, and ESPN's publicly available play-by-play data. At the end, you'll have a working NCAAMB win-probability model you can apply to any tournament bracket — including a reproducible backtest on the 2026 tournament (67 games) that we ran ourselves and published the results of.

What You'll Build

By the end of this tutorial:

Everything is Python 3.11+. You need pandas, numpy, xgboost, scikit-learn, and requests. That's it.

Step 1: Get ESPN Men's College Basketball Data

ESPN has a public (undocumented) API that exposes scoreboards and play-by-play for every NCAAMB game. No auth. You just hit a URL with a date and parse JSON.

import requests, pandas as pd

def fetch_ncaamb_games(date_yyyymmdd: str) -> list[dict]:
    """Pull every NCAAMB game for a given date with final scores."""
    url = "https://site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/scoreboard"
    r = requests.get(url, params={"dates": date_yyyymmdd, "limit": 200}, timeout=20)
    r.raise_for_status()
    out = []
    for ev in r.json().get("events", []):
        comp = ev["competitions"][0]
        home = next(t for t in comp["competitors"] if t["homeAway"] == "home")
        away = next(t for t in comp["competitors"] if t["homeAway"] == "away")
        out.append({
            "game_id": int(ev["id"]),
            "date": date_yyyymmdd,
            "home_team": home["team"]["abbreviation"],
            "away_team": away["team"]["abbreviation"],
            "home_score": int(home.get("score", 0) or 0),
            "away_score": int(away.get("score", 0) or 0),
            "home_won": int(home.get("winner", False)),
            "season_type": ev.get("season", {}).get("type"),  # 3 = postseason
            "notes": [n.get("headline","") for n in comp.get("notes", [])],
        })
    return out

For a full season, iterate over every day from early November to early April:

from datetime import date, timedelta

def fetch_season(start: str, end: str) -> pd.DataFrame:
    """Fetch every NCAAMB game between two ISO dates."""
    d0 = date.fromisoformat(start); d1 = date.fromisoformat(end)
    rows = []
    cur = d0
    while cur <= d1:
        try:
            rows.extend(fetch_ncaamb_games(cur.strftime("%Y%m%d")))
        except Exception as e:
            print(f"Skipping {cur}: {e}")
        cur += timedelta(days=1)
    return pd.DataFrame(rows)

games_2425 = fetch_season("2024-11-04", "2025-04-07")

That's your base dataset: one row per game, with final scores and a season_type flag that tells you which games are regular season (type 2) vs. tournament (type 3).

Heads up: ESPN rate-limits requests to this endpoint. If you're pulling multiple seasons, add time.sleep(0.5) between calls or you'll get intermittent 429s.

Step 2: Build an ELO Rating System (With Basketball MoV)

ELO is the single most important feature for a college basketball prediction model. It captures team strength in one number, updates after every game, and — critically — is available pre-game for any future matchup.

The basic ELO update is:

expected = 1 / (1 + 10 ** ((opp_elo - team_elo) / 400))
team_elo_new = team_elo + K * (actual - expected)

Where K is the learning rate (we use 20 for college basketball) and actual is 1 for a win, 0 for a loss. But basketball has two wrinkles pure ELO doesn't handle: home-court advantage and margin of victory.

Home-court advantage (HFA): Home teams win ~60% of regular-season college basketball games. Bake that into the expectation with an ELO bonus (~70 points).

Margin of victory (MoV): A 40-point win should shift ratings more than a 1-point win. We use the standard basketball formula:

def mov_multiplier(margin: float, elo_diff_winner: float) -> float:
    """Basketball MoV multiplier (FiveThirtyEight style)."""
    raw = ((abs(margin) + 3) ** 0.8) / (7.5 + 0.006 * max(0, elo_diff_winner))
    return max(1.0, min(raw, 2.5))

Full ELO loop with HFA and MoV:

K = 20.0
HFA = 70.0

def compute_elo(games: pd.DataFrame) -> tuple[dict, dict]:
    """Return (current_ratings, pre_game_elo_diff_by_game_id)."""
    elo = {}  # team -> current rating
    game_elo_diff = {}  # game_id -> home_elo - away_elo (pre-game, no HFA)

    games = games.sort_values("game_id").reset_index(drop=True)
    for _, r in games.iterrows():
        h, a = r["home_team"], r["away_team"]
        hs, as_ = r["home_score"], r["away_score"]
        if not h or not a or pd.isna(hs) or pd.isna(as_):
            continue
        he = elo.get(h, 1500.0)
        ae = elo.get(a, 1500.0)
        game_elo_diff[int(r["game_id"])] = he - ae  # neutral diff for features

        expected_h = 1.0 / (1.0 + 10 ** ((ae - he - HFA) / 400.0))
        actual_h = 1.0 if hs > as_ else (0.5 if hs == as_ else 0.0)
        margin = abs(hs - as_)
        mov = mov_multiplier(margin, abs(he + HFA - ae) if actual_h == 1 else abs(ae - he - HFA))

        delta = K * mov * (actual_h - expected_h)
        elo[h] = he + delta
        elo[a] = ae - delta

    return elo, game_elo_diff

After running this on a full season of regular-season games, your top 10 teams should look reasonable — typically the same teams KenPom has in his top 10, just with different absolute numbers.

Step 3: Build Team-Level Priors (Pace, ORTG, DRTG)

ELO alone is a single-dimension team rating. To get matchup-specific predictions (does a fast team beat a slow team?), you need pace and efficiency priors.

From box-score data:

def compute_team_priors(boxscores: pd.DataFrame) -> pd.DataFrame:
    """
    boxscores: one row per team per game with possessions, points_for, points_against.
    Returns rolling-to-date priors: pace (poss/40 min), ORTG (pts/100 poss), DRTG (opp pts/100 poss).
    """
    bs = boxscores.sort_values(["team", "game_id"]).copy()
    bs["poss_40"] = bs["possessions"] * (40.0 / bs["minutes_played"])
    bs["ortg"] = 100.0 * bs["points_for"] / bs["possessions"]
    bs["drtg"] = 100.0 * bs["points_against"] / bs["possessions"]

    # Rolling mean, expanding window, shifted by 1 so we never see the current game
    for col in ["poss_40", "ortg", "drtg"]:
        bs[f"prior_{col}"] = bs.groupby("team")[col].apply(
            lambda s: s.shift(1).expanding().mean()
        ).reset_index(level=0, drop=True)

    return bs[["game_id", "team", "prior_poss_40", "prior_ortg", "prior_drtg"]]

Key detail: shift by 1 before the expanding mean. This is what prevents the feature from "seeing the future" — on game n, the prior is computed from games 1 through n-1, never including the current game.

This is the single most common bug in sports prediction models. If you skip the shift, your training ECE looks amazing and your production ECE is garbage.

Step 4: Train the XGBoost Win Probability Model

The input features for a pre-game prediction:

Training:

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

FEATURES = [
    "score_diff", "seconds_remaining", "period", "time_fraction",
    "elo_diff", "is_home", "pregame_wp",
    "score_diff_x_tf", "score_diff_sq",
    "total_score", "score_diff_x_elo",
    "pace_diff", "ortg_diff", "drtg_diff",
]

X = training_df[FEATURES].values
y = training_df["home_won_final"].values

X_train, X_cal, y_train, y_cal = train_test_split(
    X, y, test_size=0.3, shuffle=False  # chronological split
)

model = XGBClassifier(
    max_depth=5,
    learning_rate=0.05,
    n_estimators=500,
    objective="binary:logistic",
    eval_metric="logloss",
    use_label_encoder=False,
)
model.fit(X_train, y_train, eval_set=[(X_cal, y_cal)], verbose=False)

Use a chronological split (shuffle=False), not random. Random splits inflate accuracy because the model sees future games during training.

Step 5: Calibrate With Isotonic Regression

XGBoost's predict_proba output is often miscalibrated — especially for extreme probabilities. A model that says "95%" might actually be right only 87% of the time in that bucket.

Isotonic regression fixes this by learning a monotone mapping from raw XGBoost probabilities to empirical frequencies:

from sklearn.isotonic import IsotonicRegression

raw_probs_cal = model.predict_proba(X_cal)[:, 1]
iso = IsotonicRegression(out_of_bounds="clip")
iso.fit(raw_probs_cal, y_cal)

def calibrated_predict(X):
    raw = model.predict_proba(X)[:, 1]
    return iso.transform(raw)

Always calibrate on a held-out slice (the X_cal set above) — never on the training data. Calibrating on training data will make your probabilities perfect on training and useless in production.

Step 6: Measure Expected Calibration Error (ECE)

The single most important number for a probabilistic prediction model. ECE is the weighted average gap between predicted probability and empirical frequency, across probability buckets:

import numpy as np

def expected_calibration_error(y_pred: np.ndarray, y_true: np.ndarray, n_bins: int = 10) -> float:
    """Standard ECE with equal-width probability buckets."""
    bins = np.linspace(0.0, 1.0, n_bins + 1)
    ece = 0.0
    for i in range(n_bins):
        mask = (y_pred >= bins[i]) & (y_pred < bins[i+1])
        if i == n_bins - 1:
            mask = (y_pred >= bins[i]) & (y_pred <= bins[i+1])
        if mask.sum() == 0:
            continue
        gap = abs(y_pred[mask].mean() - y_true[mask].mean())
        weight = mask.sum() / len(y_pred)
        ece += gap * weight
    return ece

test_probs = calibrated_predict(X_test)
ece = expected_calibration_error(test_probs, y_test, n_bins=10)
print(f"ECE: {ece:.4f}")

What's good? For NCAAMB, we target ECE below 3%. Anything above 5% means your probabilities are systematically miscalibrated and you shouldn't trust them for position sizing.

Step 7: Backtest on a Past Tournament

Here's the payoff. Pick a past tournament, freeze the ELO state at the regular-season cutoff, and run the model on every tournament game.

def backtest_tournament(tournament_games: pd.DataFrame, model, iso, elo: dict, priors: pd.DataFrame):
    """Run calibrated predictions on every tournament game with PRE-tournament ELO."""
    preds = []
    for _, g in tournament_games.iterrows():
        he = elo.get(g["home_team"], 1500.0)
        ae = elo.get(g["away_team"], 1500.0)
        elo_diff = he - ae  # neutral-site — no HFA
        pregame_wp = 1.0 / (1.0 + 10 ** (-elo_diff / 400.0))
        hp = priors.loc[g["home_team"]]; ap = priors.loc[g["away_team"]]

        features = np.array([[
            0.0, 2400.0, 1, 1.0,
            elo_diff, 1, pregame_wp,
            0.0, 0.0, 0.0, 0.0,
            hp["prior_pace"] - ap["prior_pace"],
            hp["prior_ortg"] - ap["prior_ortg"],
            hp["prior_drtg"] - ap["prior_drtg"],
        ]])
        raw = model.predict_proba(features)[:, 1]
        cal = iso.transform(raw)[0]
        preds.append({**g.to_dict(), "model_wp": cal, "pred": int(cal >= 0.5)})
    return pd.DataFrame(preds)

Evaluate accuracy, Brier score, and ECE on the tournament predictions. A well-calibrated model should hit 65-72% overall accuracy on a modern NCAA tournament, with tournament ECE in the single digits on a reasonable sample size.

What we got: When we ran this exact pipeline on the 2026 bracket (67 games), our model hit 48/67 = 71.6% — including 100% on the Final Four and National Championship. Full per-round breakdown and every wrong pick is in the public retrospective.

Common Mistakes That Kill a March Madness Model

Not excluding tournament games from ELO training

If your ELO updates include the tournament games you're trying to predict, you've leaked the future. Every tournament game must be run with ELO frozen at the regular-season cutoff.

Applying home-court advantage to neutral-site games

Regular-season models bake in a ~70-point ELO HFA. Tournament games are at neutral sites. If you don't strip HFA from your pre-game features for tournament predictions, your home team will be overweighted. This systematically biases your picks toward higher seeds.

Calibrating on training data

Fit your isotonic regressor on a held-out calibration set, not the training set. If you don't split cleanly, your training ECE will look perfect and your production ECE will be terrible.

Not measuring ECE at all

Accuracy alone can hide a miscalibrated model. A 71% accurate model whose "80% confident" picks only win 65% of the time is giving you bad Kelly sizing. ECE is the only honest check.

Ignoring the Sweet 16 compression

Every serious model — KenPom, 538, Torvik, ours — underperforms its overall accuracy in the Sweet 16 because the talent gap compresses. Don't panic-tune your model when it hits 55-65% in that round. Structural variance is not a bug.

Want to skip building this yourself?

ZenHodl's API gives you pre-built, pre-calibrated NCAAMB win probabilities, plus historical snapshots for backtesting your own bracket strategies. 7-day free trial, no credit card.

Get API access →

Where to Go From Here

The pipeline above gets you to a solid baseline — roughly on par with KenPom for first-round accuracy. From here, the most productive extensions are:

Summary

A production-grade March Madness prediction model is ELO + team efficiency priors + calibrated XGBoost + an honest tournament backtest. The ingredients are all publicly available from ESPN. The hard part is the discipline: chronological splits, held-out calibration, and measuring ECE. Get those right and you can build something that competes with KenPom-grade models in a weekend.

Next tutorial: how to build the college-football equivalent — same calibration discipline, different sport, different feature set.

Related Reading