How to Build a College Football Prediction Model in Python (With Calibrated Win Probabilities)
Most "how to predict college football" tutorials stop at training a logistic regression on final scores and calling it a day. That's not a prediction model — that's a classifier with a 52% accuracy ceiling. A real college football prediction model has to do three things the shortcut version won't:
- Produce live win probabilities that update every play, not just pregame picks
- Be calibrated — when it says 70%, the team actually wins about 70% of the time
- Hold up against the real market (Pinnacle lines, ESPN WP, Polymarket prices) when you backtest
This guide walks through all three — using Python, XGBoost, and ESPN's publicly available play-by-play data. At the end, you'll have a working CFB win probability model you can actually trust, and the exact code to measure whether your probabilities are calibrated (most people's aren't).
What You'll Build
By the end of this tutorial:
- A Python pipeline that loads ESPN CFB play-by-play for the last 3 seasons
- An XGBoost classifier predicting home team win probability at any game state
- Isotonic calibration to fix the model's systematic over/under-confidence
- An Expected Calibration Error (ECE) measurement that tells you the model's honesty score
- A simple backtest that treats the model as a trading signal against ESPN WP
Why College Football Is Harder Than the NFL
Before we write code, understand what makes CFB prediction a different animal:
- 130 FBS teams vs. 32 NFL teams — 4× the team-pair combinations, far less training data per matchup
- Massive talent gaps — Georgia vs. UMass is not Chiefs vs. Jaguars. Blowouts break models trained on close games
- Roster turnover every year — transfer portal + NFL draft + recruiting = your "team identity" feature degrades each May
- Coaching staff changes — schemes shift, historical play-pattern features leak
- More variance per play — college QB play swings wider than NFL, so win probability shifts harder on single plays
The good news: these are all solvable. You just need the right features and the right calibration step.
Prerequisites
- Python 3.9+
pip install pandas numpy xgboost scikit-learn pyarrow requests- Disk: ~2 GB for 3 seasons of play-by-play data
- RAM: 8 GB+ (XGBoost with 500k+ rows is memory-hungry)
Step 1: Get the Training Data
ESPN publishes play-by-play JSON for every FBS game at predictable URLs. You can pull the last several seasons with a simple scraper, or use cfbfastR if you prefer R. Here's the Python approach:
import json
import requests
import pandas as pd
from pathlib import Path
def fetch_espn_cfb_game(game_id: int) -> dict:
"""Fetch ESPN's play-by-play JSON for a CFB game."""
url = f"https://site.api.espn.com/apis/site/v2/sports/football/college-football/summary?event={game_id}"
r = requests.get(url, timeout=10)
r.raise_for_status()
return r.json()
def extract_plays(game_json: dict) -> list:
"""Flatten the nested plays structure into rows."""
rows = []
home_id = game_json["boxscore"]["teams"][1]["team"]["id"]
for drive in game_json.get("drives", {}).get("previous", []):
for play in drive.get("plays", []):
rows.append({
"game_id": game_json["header"]["id"],
"period": play["period"]["number"],
"clock_seconds": parse_clock(play["clock"]["displayValue"], play["period"]["number"]),
"home_score": play.get("homeScore", 0),
"away_score": play.get("awayScore", 0),
"possession_home": play.get("start", {}).get("team", {}).get("id") == home_id,
"down": play.get("start", {}).get("down"),
"distance": play.get("start", {}).get("distance"),
"yard_line": play.get("start", {}).get("yardLine"),
"espn_wp": play.get("winProbability", {}).get("homeWinPercentage"),
})
return rows
def parse_clock(clock_str: str, period: int) -> int:
"""Convert '14:32' + period to total seconds remaining in regulation."""
mins, secs = map(int, clock_str.split(":"))
elapsed_periods = period - 1
return (4 - period) * 900 + (mins * 60 + secs)
Loop this over a season's game IDs (they're sequential per week) and save to Parquet. You'll want 3 seasons minimum — 2 to train, 1 to test. Never train and test on the same season.
Step 2: Feature Engineering
Raw play-by-play isn't what you feed a model. You need features that capture game state, not play outcome. The classics:
import numpy as np
import pandas as pd
def build_features(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
# Core game-state features
df["score_diff"] = df["home_score"] - df["away_score"]
df["secs_remaining"] = df["clock_seconds"] # regulation only
df["time_fraction"] = 1 - (df["secs_remaining"] / 3600) # 0 = start, 1 = end
# Possession matters a lot when the game is close
df["possession_home_int"] = df["possession_home"].astype(int)
# Down/distance/field position (proxy for drive danger)
df["down"] = df["down"].fillna(0).astype(int)
df["distance"] = df["distance"].fillna(0).clip(upper=40)
df["yards_to_goal"] = (100 - df["yard_line"]).fillna(50)
# Interaction: late-game close-score states are the hardest to predict
df["late_and_close"] = (
(df["time_fraction"] > 0.85) & (df["score_diff"].abs() <= 8)
).astype(int)
# Log-scaled distance from 50/50 score (helps tree splits)
df["abs_score_diff_log"] = np.log1p(df["score_diff"].abs())
return df
Gotcha: don't include ESPN's winProbability.homeWinPercentage as a feature. It's the answer leaking in. Use it only as a market comparison baseline later in the backtest.
Step 3: Train the XGBoost Model
A gradient-boosted tree handles non-linear interactions (time × score × possession) far better than logistic regression. Keep it simple with ~500 trees and light depth:
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
FEATURES = [
"score_diff", "secs_remaining", "time_fraction",
"possession_home_int", "down", "distance", "yards_to_goal",
"late_and_close", "abs_score_diff_log", "period",
]
# Time-split: train on 2023+2024, test on 2025 — never mix seasons
train = df[df["season"].isin(["2023", "2024"])]
test = df[df["season"] == "2025"]
X_train, y_train = train[FEATURES], train["home_won"]
X_test, y_test = test[FEATURES], test["home_won"]
model = XGBClassifier(
n_estimators=500,
max_depth=5,
learning_rate=0.05,
subsample=0.85,
colsample_bytree=0.8,
reg_alpha=0.1,
reg_lambda=1.0,
eval_metric="logloss",
early_stopping_rounds=25,
random_state=42,
)
model.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
verbose=False,
)
raw_probs = model.predict_proba(X_test)[:, 1]
print(f"Log loss on test: {sklearn.metrics.log_loss(y_test, raw_probs):.4f}")
You'll get log-loss somewhere in the 0.40 – 0.48 range. That's respectable but the probabilities are still not calibrated. The model is confident where it shouldn't be.
Step 4: Calibrate With Isotonic Regression
The single most under-appreciated step in sports prediction. XGBoost gives you "probabilities" that are actually scores — they rank cases correctly but the absolute numbers are wrong. Isotonic regression maps the raw score to the actual observed win rate at each bucket.
from sklearn.isotonic import IsotonicRegression
# Fit calibrator on a held-out slice of training data (not test!)
X_train_cal, X_train_fit, y_train_cal, y_train_fit = train_test_split(
X_train, y_train, test_size=0.2, random_state=42
)
model.fit(X_train_fit, y_train_fit)
# Calibrate on the held-out slice
cal_probs_raw = model.predict_proba(X_train_cal)[:, 1]
calibrator = IsotonicRegression(out_of_bounds="clip")
calibrator.fit(cal_probs_raw, y_train_cal)
# Now transform test set predictions
test_raw = model.predict_proba(X_test)[:, 1]
test_prob = calibrator.transform(test_raw)
Step 5: Measure Expected Calibration Error (ECE)
ECE tells you, in one number, how honest your probabilities are. Bin your predictions, compare the bin's average predicted probability to its actual win rate, and weight by bin size. Lower is better. Below 2% is excellent; above 5% means you still need work.
import numpy as np
def expected_calibration_error(y_true, y_prob, n_bins=20):
bins = np.linspace(0, 1, n_bins + 1)
ece = 0.0
n = len(y_prob)
for i in range(n_bins):
mask = (y_prob >= bins[i]) & (y_prob < bins[i + 1])
if mask.sum() == 0:
continue
bin_prob = y_prob[mask].mean()
bin_true = y_true[mask].mean()
bin_weight = mask.sum() / n
ece += bin_weight * abs(bin_prob - bin_true)
return ece
print(f"Raw ECE: {expected_calibration_error(y_test.values, test_raw):.4f}")
print(f"Calibrated ECE: {expected_calibration_error(y_test.values, test_prob):.4f}")
Typical results: raw ECE lands around 4-6%, isotonic drops it to 1-2%. If you don't see that reduction, your calibrator is probably being fit on the wrong data (i.e., you're leaking between calibration and test folds).
Step 6: Backtest Against ESPN's Win Probability
Here's where most tutorials stop. Don't. A model with good log-loss can still lose money against the real market. Simulate it:
MIN_EDGE = 0.05 # require 5 percentage points of disagreement
FEE = 0.02 # 2% taker fee (Polymarket-realistic)
SLIPPAGE = 0.01 # 1% slippage
test_df = X_test.copy()
test_df["y_true"] = y_test.values
test_df["model_prob"] = test_prob
test_df["market_prob"] = market_wp # ESPN WP, loaded earlier
# Signal: take a position when our model disagrees with ESPN
test_df["edge"] = test_df["model_prob"] - test_df["market_prob"]
buys = test_df[test_df["edge"] >= MIN_EDGE] # model says home is undervalued
sells = test_df[test_df["edge"] <= -MIN_EDGE] # model says home is overvalued
# Simulate "buy at market, settle at true outcome"
buy_pnl = (buys["y_true"] - buys["market_prob"]) - FEE - SLIPPAGE
sell_pnl = ((1 - sells["y_true"]) - (1 - sells["market_prob"])) - FEE - SLIPPAGE
total_pnl = buy_pnl.sum() + sell_pnl.sum()
n_trades = len(buys) + len(sells)
print(f"Trades: {n_trades} | Total PnL: {total_pnl:.2f} | Per-trade: {total_pnl/n_trades:.4f}")
A healthy CFB model clears 0.5–1.5 cents per trade net at a 5-point minimum edge. Less than that and the fees eat your signal.
The Gotchas That Will Bite You
- Look-ahead bias: your features cannot include any column that was populated after the play happened. Final score = leak. Drive result = leak.
- Game-level splits: don't shuffle rows across games in your train/test split. All plays from one game must be in the same fold, or you leak the outcome through adjacent plays.
- Class imbalance: if your training set is 55% home wins, the model will lean home. Use
scale_pos_weightor subsample. - Calibration on test set: fitting the calibrator on test data is cheating. Always use a held-out calibration fold from the training set.
- Early season: weeks 1-3 are the noisiest period of the season. Every model underperforms there. Either down-weight them or cut them from your training data.
The Shortcut: Use a Pre-Built Calibrated API
Doing all of this yourself is a weekend of work if everything goes right, or a month if it doesn't. If you want the output without writing the pipeline, there's a calibrated probability API that runs this exact stack — XGBoost ensembles, isotonic calibration, published ECE under 2% — across 10+ sports including CFB.
Every call returns fair_prob (the calibrated home win probability), market_prob (current Polymarket price), and edge (the difference). You skip the data pipeline and get straight to backtesting or live trading:
import requests
resp = requests.get(
"https://zenhodl.net/v1/edges",
headers={"X-API-Key": "your_key"},
params={"sport": "CFB", "min_edge": 5},
)
for edge in resp.json():
print(f"{edge['team']} vs {edge['opponent']}: "
f"model={edge['fair_prob']:.2f} market={edge['market_prob']:.2f} "
f"edge={edge['edge_c']}c")
Skip the training pipeline. Get the signals.
Every prediction is calibrated, live, and measured. Public track record, ECE published per sport, full API access.
Bottom Line
A college football prediction model isn't hard to build. A college football prediction model that you can trust with money requires the three steps almost nobody does: isotonic calibration, ECE measurement, and honest market backtesting. Do those, and you'll have a model that beats the casual picks sites. Skip them, and you'll have one more overconfident classifier.
Once CFB season kicks off in August, the teams that win the SEO race are the ones that shipped their content in spring. Now you know the math — go ship.
Related Reading
- Our CFB Model's Honest Retrospective — the debugging story of finding the one feature that made the model profitable.
- Build a Super Bowl prediction model — NFL-tuned ELO + Monte Carlo bracket simulator.
- Build a March Madness prediction model — the college basketball tournament equivalent.
- Best College Basketball Prediction Sites 2026 — competitor buyer's guide (KenPom, Torvik, 538).
- Calibrating XGBoost probabilities with isotonic regression — the calibration step in depth.
- Feature engineering for sports win probability — the 15 features that matter.