How to Build a College Football Prediction Model in Python (With Calibrated Win Probabilities)

April 22, 2026 · 12 min read · Python, XGBoost, Machine Learning, College Football

Most "how to predict college football" tutorials stop at training a logistic regression on final scores and calling it a day. That's not a prediction model — that's a classifier with a 52% accuracy ceiling. A real college football prediction model has to do three things the shortcut version won't:

Produce live win probabilities that update every play, not just pregame picks
Be calibrated — when it says 70%, the team actually wins about 70% of the time
Hold up against the real market (Pinnacle lines, ESPN WP, Polymarket prices) when you backtest

This guide walks through all three — using Python, XGBoost, and ESPN's publicly available play-by-play data. At the end, you'll have a working CFB win probability model you can actually trust, and the exact code to measure whether your probabilities are calibrated (most people's aren't).

What You'll Build

By the end of this tutorial:

A Python pipeline that loads ESPN CFB play-by-play for the last 3 seasons
An XGBoost classifier predicting home team win probability at any game state
Isotonic calibration to fix the model's systematic over/under-confidence
An Expected Calibration Error (ECE) measurement that tells you the model's honesty score
A simple backtest that treats the model as a trading signal against ESPN WP

Why College Football Is Harder Than the NFL

Before we write code, understand what makes CFB prediction a different animal:

130 FBS teams vs. 32 NFL teams — 4× the team-pair combinations, far less training data per matchup
Massive talent gaps — Georgia vs. UMass is not Chiefs vs. Jaguars. Blowouts break models trained on close games
Roster turnover every year — transfer portal + NFL draft + recruiting = your "team identity" feature degrades each May
Coaching staff changes — schemes shift, historical play-pattern features leak
More variance per play — college QB play swings wider than NFL, so win probability shifts harder on single plays

The good news: these are all solvable. You just need the right features and the right calibration step.

Prerequisites

Python 3.9+
pip install pandas numpy xgboost scikit-learn pyarrow requests
Disk: ~2 GB for 3 seasons of play-by-play data
RAM: 8 GB+ (XGBoost with 500k+ rows is memory-hungry)

Step 1: Get the Training Data

ESPN publishes play-by-play JSON for every FBS game at predictable URLs. You can pull the last several seasons with a simple scraper, or use cfbfastR if you prefer R. Here's the Python approach:

import json
import requests
import pandas as pd
from pathlib import Path

def fetch_espn_cfb_game(game_id: int) -> dict:
    """Fetch ESPN's play-by-play JSON for a CFB game."""
    url = f"https://site.api.espn.com/apis/site/v2/sports/football/college-football/summary?event={game_id}"
    r = requests.get(url, timeout=10)
    r.raise_for_status()
    return r.json()

def extract_plays(game_json: dict) -> list:
    """Flatten the nested plays structure into rows."""
    rows = []
    home_id = game_json["boxscore"]["teams"][1]["team"]["id"]
    for drive in game_json.get("drives", {}).get("previous", []):
        for play in drive.get("plays", []):
            rows.append({
                "game_id": game_json["header"]["id"],
                "period": play["period"]["number"],
                "clock_seconds": parse_clock(play["clock"]["displayValue"], play["period"]["number"]),
                "home_score": play.get("homeScore", 0),
                "away_score": play.get("awayScore", 0),
                "possession_home": play.get("start", {}).get("team", {}).get("id") == home_id,
                "down": play.get("start", {}).get("down"),
                "distance": play.get("start", {}).get("distance"),
                "yard_line": play.get("start", {}).get("yardLine"),
                "espn_wp": play.get("winProbability", {}).get("homeWinPercentage"),
            })
    return rows

def parse_clock(clock_str: str, period: int) -> int:
    """Convert '14:32' + period to total seconds remaining in regulation."""
    mins, secs = map(int, clock_str.split(":"))
    elapsed_periods = period - 1
    return (4 - period) * 900 + (mins * 60 + secs)

Loop this over a season's game IDs (they're sequential per week) and save to Parquet. You'll want 3 seasons minimum — 2 to train, 1 to test. Never train and test on the same season.

Step 2: Feature Engineering

Raw play-by-play isn't what you feed a model. You need features that capture game state, not play outcome. The classics:

import numpy as np
import pandas as pd

def build_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # Core game-state features
    df["score_diff"] = df["home_score"] - df["away_score"]
    df["secs_remaining"] = df["clock_seconds"]  # regulation only
    df["time_fraction"] = 1 - (df["secs_remaining"] / 3600)  # 0 = start, 1 = end

    # Possession matters a lot when the game is close
    df["possession_home_int"] = df["possession_home"].astype(int)

    # Down/distance/field position (proxy for drive danger)
    df["down"] = df["down"].fillna(0).astype(int)
    df["distance"] = df["distance"].fillna(0).clip(upper=40)
    df["yards_to_goal"] = (100 - df["yard_line"]).fillna(50)

    # Interaction: late-game close-score states are the hardest to predict
    df["late_and_close"] = (
        (df["time_fraction"] > 0.85) & (df["score_diff"].abs() <= 8)
    ).astype(int)

    # Log-scaled distance from 50/50 score (helps tree splits)
    df["abs_score_diff_log"] = np.log1p(df["score_diff"].abs())

    return df

Gotcha: don't include ESPN's winProbability.homeWinPercentage as a feature. It's the answer leaking in. Use it only as a market comparison baseline later in the backtest.

Step 3: Train the XGBoost Model

A gradient-boosted tree handles non-linear interactions (time × score × possession) far better than logistic regression. Keep it simple with ~500 trees and light depth:

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

FEATURES = [
    "score_diff", "secs_remaining", "time_fraction",
    "possession_home_int", "down", "distance", "yards_to_goal",
    "late_and_close", "abs_score_diff_log", "period",
]

# Time-split: train on 2023+2024, test on 2025 — never mix seasons
train = df[df["season"].isin(["2023", "2024"])]
test  = df[df["season"] == "2025"]

X_train, y_train = train[FEATURES], train["home_won"]
X_test,  y_test  = test[FEATURES],  test["home_won"]

model = XGBClassifier(
    n_estimators=500,
    max_depth=5,
    learning_rate=0.05,
    subsample=0.85,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=1.0,
    eval_metric="logloss",
    early_stopping_rounds=25,
    random_state=42,
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=False,
)

raw_probs = model.predict_proba(X_test)[:, 1]
print(f"Log loss on test: {sklearn.metrics.log_loss(y_test, raw_probs):.4f}")

You'll get log-loss somewhere in the 0.40 – 0.48 range. That's respectable but the probabilities are still not calibrated. The model is confident where it shouldn't be.

Step 4: Calibrate With Isotonic Regression

The single most under-appreciated step in sports prediction. XGBoost gives you "probabilities" that are actually scores — they rank cases correctly but the absolute numbers are wrong. Isotonic regression maps the raw score to the actual observed win rate at each bucket.

from sklearn.isotonic import IsotonicRegression

# Fit calibrator on a held-out slice of training data (not test!)
X_train_cal, X_train_fit, y_train_cal, y_train_fit = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)
model.fit(X_train_fit, y_train_fit)

# Calibrate on the held-out slice
cal_probs_raw = model.predict_proba(X_train_cal)[:, 1]
calibrator = IsotonicRegression(out_of_bounds="clip")
calibrator.fit(cal_probs_raw, y_train_cal)

# Now transform test set predictions
test_raw   = model.predict_proba(X_test)[:, 1]
test_prob  = calibrator.transform(test_raw)

Step 5: Measure Expected Calibration Error (ECE)

ECE tells you, in one number, how honest your probabilities are. Bin your predictions, compare the bin's average predicted probability to its actual win rate, and weight by bin size. Lower is better. Below 2% is excellent; above 5% means you still need work.

import numpy as np

def expected_calibration_error(y_true, y_prob, n_bins=20):
    bins = np.linspace(0, 1, n_bins + 1)
    ece = 0.0
    n = len(y_prob)
    for i in range(n_bins):
        mask = (y_prob >= bins[i]) & (y_prob < bins[i + 1])
        if mask.sum() == 0:
            continue
        bin_prob = y_prob[mask].mean()
        bin_true = y_true[mask].mean()
        bin_weight = mask.sum() / n
        ece += bin_weight * abs(bin_prob - bin_true)
    return ece

print(f"Raw ECE:        {expected_calibration_error(y_test.values, test_raw):.4f}")
print(f"Calibrated ECE: {expected_calibration_error(y_test.values, test_prob):.4f}")

Typical results: raw ECE lands around 4-6%, isotonic drops it to 1-2%. If you don't see that reduction, your calibrator is probably being fit on the wrong data (i.e., you're leaking between calibration and test folds).

Step 6: Backtest Against ESPN's Win Probability

Here's where most tutorials stop. Don't. A model with good log-loss can still lose money against the real market. Simulate it:

MIN_EDGE = 0.05   # require 5 percentage points of disagreement
FEE = 0.02        # 2% taker fee (Polymarket-realistic)
SLIPPAGE = 0.01   # 1% slippage

test_df = X_test.copy()
test_df["y_true"] = y_test.values
test_df["model_prob"] = test_prob
test_df["market_prob"] = market_wp  # ESPN WP, loaded earlier

# Signal: take a position when our model disagrees with ESPN
test_df["edge"] = test_df["model_prob"] - test_df["market_prob"]

buys  = test_df[test_df["edge"] >=  MIN_EDGE]  # model says home is undervalued
sells = test_df[test_df["edge"] <= -MIN_EDGE]  # model says home is overvalued

# Simulate "buy at market, settle at true outcome"
buy_pnl  = (buys["y_true"]         - buys["market_prob"])  - FEE - SLIPPAGE
sell_pnl = ((1 - sells["y_true"])  - (1 - sells["market_prob"])) - FEE - SLIPPAGE

total_pnl = buy_pnl.sum() + sell_pnl.sum()
n_trades  = len(buys) + len(sells)
print(f"Trades: {n_trades} | Total PnL: {total_pnl:.2f} | Per-trade: {total_pnl/n_trades:.4f}")

A healthy CFB model clears 0.5–1.5 cents per trade net at a 5-point minimum edge. Less than that and the fees eat your signal.

The Gotchas That Will Bite You

Look-ahead bias: your features cannot include any column that was populated after the play happened. Final score = leak. Drive result = leak.
Game-level splits: don't shuffle rows across games in your train/test split. All plays from one game must be in the same fold, or you leak the outcome through adjacent plays.
Class imbalance: if your training set is 55% home wins, the model will lean home. Use scale_pos_weight or subsample.
Calibration on test set: fitting the calibrator on test data is cheating. Always use a held-out calibration fold from the training set.
Early season: weeks 1-3 are the noisiest period of the season. Every model underperforms there. Either down-weight them or cut them from your training data.

The Shortcut: Use a Pre-Built Calibrated API

Doing all of this yourself is a weekend of work if everything goes right, or a month if it doesn't. If you want the output without writing the pipeline, there's a calibrated probability API that runs this exact stack — XGBoost ensembles, isotonic calibration, published ECE under 2% — across 10+ sports including CFB.

Every call returns fair_prob (the calibrated home win probability), market_prob (current Polymarket price), and edge (the difference). You skip the data pipeline and get straight to backtesting or live trading:

import requests

resp = requests.get(
    "https://zenhodl.net/v1/edges",
    headers={"X-API-Key": "your_key"},
    params={"sport": "CFB", "min_edge": 5},
)

for edge in resp.json():
    print(f"{edge['team']} vs {edge['opponent']}: "
          f"model={edge['fair_prob']:.2f} market={edge['market_prob']:.2f} "
          f"edge={edge['edge_c']}c")

Skip the training pipeline. Get the signals.

Every prediction is calibrated, live, and measured. Public track record, ECE published per sport, full API access.

Start 7-Day Free Trial →

Bottom Line

A college football prediction model isn't hard to build. A college football prediction model that you can trust with money requires the three steps almost nobody does: isotonic calibration, ECE measurement, and honest market backtesting. Do those, and you'll have a model that beats the casual picks sites. Skip them, and you'll have one more overconfident classifier.

Once CFB season kicks off in August, the teams that win the SEO race are the ones that shipped their content in spring. Now you know the math — go ship.