How to Build an MLB Prediction Model in Python (ELO + Starting Pitcher Features)

April 22, 2026 · 13 min read · Python, ELO, MLB, Starting Pitcher, Machine Learning

MLB prediction is different from every other major sport. Games are played nearly every day, the season is 162 games long, home-field advantage is tiny, and the single biggest game-by-game variable is who's pitching — not how good the team is overall. A correctly tuned MLB prediction model has to reflect all four of those differences or it will produce numbers that look right on paper and lose money every time.

This guide walks through a production-grade MLB win-probability pipeline in Python — ELO tuned for baseball's 162-game season, starting pitcher features (ERA, WHIP, K/9 differentials), pre-game probability calibration, and a playoff backtest harness that runs against any past postseason bracket.

What You'll Build

A Python data pipeline that pulls ESPN MLB game data for any regular season
An MLB-tuned ELO system (K=4, HFA=24 — much smaller than football or basketball)
A starting pitcher feature pipeline that joins ERA, WHIP, and K/9 differentials for the scheduled starters
A pre-game win-probability function that combines ELO and SP features
A postseason backtest harness with no look-ahead bias
An ECE (Expected Calibration Error) measurement so you know whether your probabilities are honest

Python 3.11+. Dependencies: pandas, numpy, xgboost, scikit-learn, requests.

Step 1: Pull ESPN MLB Data

import requests, pandas as pd
from datetime import date, timedelta

def fetch_mlb_games(date_yyyymmdd: str) -> list[dict]:
    """Pull every MLB game for a given date with final scores."""
    url = "https://site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard"
    r = requests.get(url, params={"dates": date_yyyymmdd, "limit": 30}, timeout=20)
    r.raise_for_status()
    out = []
    for ev in r.json().get("events", []):
        comp = ev["competitions"][0]
        home = next(t for t in comp["competitors"] if t["homeAway"] == "home")
        away = next(t for t in comp["competitors"] if t["homeAway"] == "away")
        if not (home.get("winner") or away.get("winner")):
            continue
        # Pull starting pitcher info from leaders or probables
        home_sp = _extract_starting_pitcher(home)
        away_sp = _extract_starting_pitcher(away)
        out.append({
            "game_id": int(ev["id"]),
            "date": date_yyyymmdd,
            "home_team": home["team"]["abbreviation"],
            "away_team": away["team"]["abbreviation"],
            "home_score": int(home.get("score", 0) or 0),
            "away_score": int(away.get("score", 0) or 0),
            "home_won": int(home.get("winner", False)),
            "home_sp": home_sp,
            "away_sp": away_sp,
            "season_type": ev.get("season", {}).get("type"),
        })
    return out

def _extract_starting_pitcher(team_comp: dict) -> str | None:
    """ESPN embeds starting pitcher in probables or leaders — check both."""
    for prob in team_comp.get("probables", []):
        if prob.get("playerClass") == "pitcher":
            return prob.get("athlete", {}).get("fullName")
    return None

Then iterate over the full season from March opening day through end of September:

def fetch_season(start: str, end: str) -> pd.DataFrame:
    d0 = date.fromisoformat(start); d1 = date.fromisoformat(end)
    rows = []
    cur = d0
    while cur <= d1:
        try:
            rows.extend(fetch_mlb_games(cur.strftime("%Y%m%d")))
        except Exception as e:
            print(f"skip {cur}: {e}")
        cur += timedelta(days=1)
    return pd.DataFrame(rows)

season_2025 = fetch_season("2025-03-27", "2025-09-30")

Step 2: MLB-Tuned ELO (K=4, HFA=24)

MLB has a much longer season than any other major sport (162 games vs. 17 for NFL, 82 for NBA), which means a smaller K is correct — you want ELO to move slowly because you have so much data. Home-field advantage is also much smaller in baseball: historically about 54% home win rate, which works out to a ~24-point ELO bonus.

K = 4.0       # smaller than football/basketball — 162-game season
HFA = 24.0    # MLB home-field advantage

def compute_mlb_elo(games: pd.DataFrame) -> tuple[dict, dict]:
    """
    Returns (current_ratings, pre_game_elo_diff_by_game_id).
    elo_diff in the second dict is neutral (no HFA).
    """
    elo = {}
    game_elo_diff = {}

    games = games.sort_values("game_id").reset_index(drop=True)
    for _, r in games.iterrows():
        h, a = r["home_team"], r["away_team"]
        hs, as_ = r["home_score"], r["away_score"]
        if not h or not a or pd.isna(hs) or pd.isna(as_):
            continue
        he = elo.get(h, 1500.0)
        ae = elo.get(a, 1500.0)
        game_elo_diff[int(r["game_id"])] = he - ae

        expected_h = 1.0 / (1.0 + 10 ** ((ae - he - HFA) / 400.0))
        actual_h = 1.0 if hs > as_ else (0.5 if hs == as_ else 0.0)
        margin = abs(hs - as_)
        mov = np.log(1 + margin) * (2.2 / (max(0, abs(he-ae))*0.001 + 2.2))
        mov = max(1.0, min(mov, 2.5))

        delta = K * mov * (actual_h - expected_h)
        elo[h] = he + delta
        elo[a] = ae - delta

    return elo, game_elo_diff

Expected range after a full season: top teams hit 1570-1590, bottom teams sink to 1400-1420. This is a tighter range than NFL/NBA because regression toward the mean is much stronger in a 162-game sport.

Step 3: Starting Pitcher Features

This is the feature that separates a toy MLB model from a real one. The probable starting pitchers are public ~24 hours before first pitch. Pull their season stats and compute home-away differentials:

def fetch_pitcher_stats(pitcher_name: str, year: int) -> dict:
    """Pull season-to-date ERA, WHIP, K/9 for a pitcher. Cache aggressively."""
    # Replace with your preferred source: MLB Stats API, Fangraphs, etc.
    # Below is a placeholder — you'll want a real cached lookup.
    return {"era": 4.20, "whip": 1.30, "k_per_9": 8.5}  # league average default

def compute_sp_features(home_sp: str, away_sp: str, year: int) -> dict:
    """Compute pre-game pitcher differentials (home - away)."""
    h = fetch_pitcher_stats(home_sp, year) if home_sp else None
    a = fetch_pitcher_stats(away_sp, year) if away_sp else None
    if h is None or a is None:
        return {"sp_era_diff": 0.0, "sp_whip_diff": 0.0, "sp_k9_diff": 0.0}
    return {
        "sp_era_diff": h["era"] - a["era"],         # negative is good for home
        "sp_whip_diff": h["whip"] - a["whip"],     # negative is good for home
        "sp_k9_diff": h["k_per_9"] - a["k_per_9"], # positive is good for home
    }

Data source note: ESPN's free scoreboard gives you probable pitcher names but not their stats. For stats you'll need to call the MLB Stats API (statsapi.mlb.com), Fangraphs, or scrape Baseball-Reference. Always cache by (pitcher_name, year) to avoid re-fetching.

Step 4: Pre-Game Win Probability

Combine ELO and SP features into a pre-game probability. For the baseline, a logistic combination works well:

def pregame_wp(home_team: str, away_team: str, home_sp: str, away_sp: str,
               elo: dict, year: int) -> float:
    """Pre-game home win probability combining ELO + SP features."""
    he = elo.get(home_team, 1500.0)
    ae = elo.get(away_team, 1500.0)
    elo_diff = he - ae + HFA

    sp = compute_sp_features(home_sp, away_sp, year)
    # SP contribution: good pitcher helps (negative ERA diff, negative WHIP diff, positive K9 diff)
    sp_score = -0.15 * sp["sp_era_diff"] - 0.10 * sp["sp_whip_diff"] + 0.03 * sp["sp_k9_diff"]

    # Convert to ELO-equivalent points (roughly 20 ELO per run of expected scoring)
    total_elo_diff = elo_diff + 60 * sp_score

    return 1.0 / (1.0 + 10 ** (-total_elo_diff / 400.0))

The coefficients above are starting points. With 4,000+ regular-season games in your training set, you can fit them with a logistic regression to get properly tuned weights.

Step 5: Upgrade to XGBoost With Calibration

Once you have ELO and SP features working as a baseline, swap in XGBoost + isotonic calibration for better discrimination:

from xgboost import XGBClassifier
from sklearn.isotonic import IsotonicRegression
from sklearn.model_selection import train_test_split

FEATURES = ["elo_diff", "sp_era_diff", "sp_whip_diff", "sp_k9_diff"]

X = training_df[FEATURES].values
y = training_df["home_won"].values

# Chronological split — never shuffle MLB data (dependencies between games)
cutoff_idx = int(0.7 * len(X))
X_train, X_cal = X[:cutoff_idx], X[cutoff_idx:]
y_train, y_cal = y[:cutoff_idx], y[cutoff_idx:]

model = XGBClassifier(
    max_depth=4, learning_rate=0.05, n_estimators=300,
    objective="binary:logistic", eval_metric="logloss",
)
model.fit(X_train, y_train, eval_set=[(X_cal, y_cal)], verbose=False)

# Calibrate on held-out
raw_cal = model.predict_proba(X_cal)[:, 1]
iso = IsotonicRegression(out_of_bounds="clip")
iso.fit(raw_cal, y_cal)

def calibrated_predict(X):
    return iso.transform(model.predict_proba(X)[:, 1])

Step 6: Measure Expected Calibration Error

def ece(y_pred, y_true, n_bins=10):
    bins = np.linspace(0, 1, n_bins + 1)
    total = 0.0
    for i in range(n_bins):
        mask = (y_pred >= bins[i]) & (y_pred < bins[i+1])
        if i == n_bins - 1:
            mask = (y_pred >= bins[i]) & (y_pred <= bins[i+1])
        if mask.sum() == 0:
            continue
        gap = abs(y_pred[mask].mean() - y_true[mask].mean())
        total += gap * (mask.sum() / len(y_pred))
    return total

print(f"ECE: {ece(calibrated_predict(X_test), y_test):.4f}")

Target: under 3% ECE on the full regular season. MLB is an easier calibration problem than NFL or NBA because the sample size is huge (2,400+ games per season).

Step 7: Backtest on a Past Postseason

The real test: freeze ELO at the end of the regular season, then run pre-game predictions on every playoff game.

def backtest_postseason(playoff_games, model, iso, elo_frozen, year):
    preds = []
    for _, g in playoff_games.iterrows():
        he = elo_frozen.get(g["home_team"], 1500.0)
        ae = elo_frozen.get(g["away_team"], 1500.0)
        sp = compute_sp_features(g["home_sp"], g["away_sp"], year)
        features = np.array([[
            he - ae + HFA,
            sp["sp_era_diff"],
            sp["sp_whip_diff"],
            sp["sp_k9_diff"],
        ]])
        p = calibrated_predict(features)[0]
        preds.append({**g.to_dict(), "model_wp": p, "correct": int((p >= 0.5) == g["home_won"])})
    return pd.DataFrame(preds)

What we got: Our MLB model on the 2025 postseason (47 games) hit 59.6% overall. It called the NLDS at 8/9 (89%) and the NLCS at 4/4 (100%). Full breakdown and every World Series game prediction is in the public retrospective.

Common Mistakes That Kill an MLB Model

Using football/basketball K values

K=20 is fine for NBA. For MLB it's catastrophic — your ELO will swing wildly between every game. Use K=4 and let the long season do the smoothing work.

Over-weighting home-field advantage

HFA in baseball is tiny. A ~2% home win edge, not the 10-15% edge you get in basketball or football. If your model has home teams at 58-62% pre-game, it's too high.

Using regular-season ERA for playoff games

Playoff rotations are compressed. The team's ace pitches Game 1 and Game 4. Their #5 starter never pitches. If you use season-average ERA for playoff predictions, you'll underweight the ace effect. Use game-specific SP matchups.

Ignoring the starting pitcher entirely

An ELO-only MLB model is missing 30-40% of the signal. The starting pitcher is a real, measurable, game-by-game variable.

Calibrating on training data

The same rule as every other sport: hold out a calibration set. Never fit the isotonic regressor on the same data as the base XGBoost.

Want to skip building this yourself?

ZenHodl's API gives you pre-built, pre-calibrated MLB win probabilities with SP-adjusted features, plus historical snapshots for backtesting. 7-day free trial, no credit card.

Get API access →

Summary

A production-grade MLB prediction model is baseball-tuned ELO + starting-pitcher differentials + XGBoost + isotonic calibration. MLB is the highest-variance playoff sport because short series amplify randomness — but that same variance is why calibration matters more here than anywhere else. When your model says 55%, a 162-game dataset makes sure it really means 55%.

Next tutorial: how to build the NBA Finals equivalent — same calibration discipline, different sport, different tuning constants.