Feature Engineering for Sports Win Probability: 15 Features That Actually Matter
Your sports win-probability model is only as good as the features you feed it. Most tutorials throw "score_diff, seconds_remaining, and team_id" at a gradient booster and call it done. That model will get 75% accuracy — but the last 10% of accuracy is where trading edge lives, and that 10% comes from feature engineering.
This post catalogs the 15 features that consistently matter across NBA, NHL, MLB, and esports win-probability models, why each one works, and how to compute each in Python. Every code block is copy-pasteable.
The Core 6: You Can't Skip These
1. Score differential (home - away)
The most predictive feature by a wide margin. Use signed difference (positive = home leading) so the model learns that a +5 NBA lead with 2 minutes left is very different from a -5 deficit.
df['score_diff'] = df['home_score'] - df['away_score']
2. Seconds remaining
The model needs to know how much runway is left. Use game clock in seconds, normalized only if your game lengths vary.
# NBA: 48 min = 2880s. NHL: 60 min = 3600s. MLB: use outs_remaining instead.
def seconds_remaining(period, clock_mm_ss, sport):
mins, secs = map(int, clock_mm_ss.split(':'))
clock_secs = mins * 60 + secs
if sport == 'NBA':
return max(0, (4 - period) * 720 + clock_secs)
elif sport == 'NHL':
return max(0, (3 - period) * 1200 + clock_secs)
return clock_secs
3. Time fraction
Seconds-remaining as a ratio of total game length. Makes late-game states directly comparable to early-game states without hard-coding sport-specific thresholds.
TOTAL = {'NBA': 2880, 'NHL': 3600}
df['time_fraction'] = df['seconds_remaining'] / TOTAL[sport]
4. Period / Inning
Not redundant with seconds_remaining. A 5-point lead with 2 minutes left in Q2 isn't the same as 2 minutes left in Q4, even though time_fraction is identical. Include as an integer.
5. Elo differential
Captures the pre-game skill asymmetry. Compute via rolling Elo update on historical match outcomes; K=32, start_elo=1500. At game time, use (home_elo - away_elo).
def update_elo(elo_a, elo_b, a_won, k=32):
expected_a = 1.0 / (1.0 + 10 ** ((elo_b - elo_a) / 400.0))
a_result = 1.0 if a_won else 0.0
new_a = elo_a + k * (a_result - expected_a)
new_b = elo_b + k * ((1.0 - a_result) - (1.0 - expected_a))
return new_a, new_b
6. Home indicator / side flag
Home teams win more often than away teams, even after controlling for skill. Include as a binary feature.
The Engineered 5: Where Calibration Hides
7. score_diff × time_fraction (interaction)
A 5-point lead at time_fraction=0.9 is nearly guaranteed; at time_fraction=0.1 it's meaningless. A GBM can learn this interaction, but handing it explicitly often improves both training speed and out-of-sample calibration.
df['score_diff_x_tf'] = df['score_diff'] * df['time_fraction']
8. score_diff² (squared)
The relationship between lead size and win probability is non-linear: the marginal impact of extending a 20-point lead to 25 points is much smaller than 5 to 10. Squaring gives the model direct access to this curvature.
9. Pre-game WP (prior)
The win probability implied before the game starts, based purely on team strength. For NBA, this typically comes from pre-game markets or Elo. Lock this in at tip-off and use it as a constant feature for all in-game snapshots from that game.
def pregame_wp_from_elo(home_elo, away_elo, home_advantage=75):
# Home advantage: ~75 Elo points in NBA
return 1.0 / (1.0 + 10 ** ((away_elo - (home_elo + home_advantage)) / 400.0))
10. Pace differential (basketball)
High-pace teams create more possessions, which increases variance. A 10-point lead in a 110-possession game is less safe than the same lead in an 85-possession game. Compute as (home_pace - away_pace) using season-to-date averages.
11. Rolling scoring momentum
Captures streaks and runs. Compute as the net scoring differential over a rolling window (120s, 300s typical for basketball).
def rolling_run_diff(game_events, window_s=120):
"""Return home_minus_away scoring in the last window_s seconds."""
now_t = game_events[-1]['t']
cutoff = now_t - window_s
recent = [e for e in game_events if e['t'] >= cutoff]
return sum(e['delta'] for e in recent)
The Sport-Specific 4: Where Real Edge Comes From
12. NHL: Shots on goal differential
SOG differential is predictive of future goal rate above and beyond current score. A team down 0-1 with a 25-10 SOG lead has a real comeback chance; the same deficit with 5-20 SOG is closer to hopeless.
13. NHL: Power play state
Goals score at ~5x the rate during a power play. Include (home_pp, away_pp, home_skater_advantage) as binary/integer features.
14. MLB: Starter ERA/WHIP differential
For the first 5 innings, starting pitcher quality dominates everything. Include (home_sp_era - away_sp_era), (home_sp_whip - away_sp_whip), and (home_sp_k9 - away_sp_k9).
15. MLB: is_home_batting
Binary. Tells the model whether the current inning half is the home team batting. Combined with score_diff and inning count, this captures the "bottom of the ninth with the tying run at bat" state structurally.
Features to Avoid
| Avoid | Why |
|---|---|
| ESPN's live win probability | If you train on it, your model learns to agree with the market. You lose independent signal. |
| Attendance, weather, referee ID | Tiny or zero effect, high cardinality, adds noise. |
| Team one-hot encoded | Elo already captures team strength. One-hot overfits to roster changes. |
| Raw timestamps | Teaches the model to memorize specific games. Always derive relative time features. |
Feature Importance: What a Production Model Actually Weights
Typical feature importance distribution for a tuned NBA XGBoost model trained on ~600k in-game snapshots:
score_diff 0.38
seconds_remaining 0.19
time_fraction 0.11
score_diff_x_tf 0.09
elo_diff 0.07
score_diff_sq 0.06
pregame_wp 0.04
run_diff_300s 0.03
pace_diff 0.02
run_diff_120s 0.01
Score diff and time are 70% of the signal. Everything else is getting the last 10% of ECE reduction. But the last 10% is where edge lives — a 1-2pp ECE improvement at the tails (80-90% bucket) is the difference between profitable and breakeven trading.
Putting It Together: Feature Pipeline in pandas
def engineer_features(df: pd.DataFrame, sport: str) -> pd.DataFrame:
# Base
df['score_diff'] = df['home_score'] - df['away_score']
df['seconds_remaining'] = df.apply(
lambda r: seconds_remaining(r['period'], r['clock'], sport), axis=1)
df['time_fraction'] = df['seconds_remaining'] / TOTAL[sport]
df['is_home'] = 1
# Engineered
df['score_diff_x_tf'] = df['score_diff'] * df['time_fraction']
df['score_diff_sq'] = df['score_diff'] ** 2
df['pregame_wp'] = df.apply(
lambda r: pregame_wp_from_elo(r['home_elo'], r['away_elo']), axis=1)
df['elo_diff'] = df['home_elo'] - df['away_elo']
# Basketball pace
if sport in ('NBA', 'NCAAMB'):
df['pace_diff'] = df['home_pace_s2d'] - df['away_pace_s2d']
# Hockey SOG
if sport == 'NHL':
df['sog_diff'] = df['home_sog'] - df['away_sog']
df['home_pp'] = df['home_on_pp'].astype(int)
df['away_pp'] = df['away_on_pp'].astype(int)
# MLB starter diff
if sport == 'MLB':
df['sp_era_diff'] = df['away_sp_era'] - df['home_sp_era']
df['sp_whip_diff'] = df['away_sp_whip'] - df['home_sp_whip']
df['sp_k9_diff'] = df['home_sp_k9'] - df['away_sp_k9']
df['is_home_batting'] = df['is_home_batting'].astype(int)
return df
Don't want to engineer features from scratch? ZenHodl's prediction API ships calibrated win probabilities for 11 sports via REST — features already engineered, models already trained.
See the API docsFurther reading: Calibrating XGBoost Probabilities with Isotonic Regression · How to Build a Sports Prediction Model with Python
Related Reading
- NCAAMB 2025-26 Season Report — these features applied to 5,345 college basketball games.
- Build a March Madness prediction model — these features in a tournament context.
- Build a Super Bowl prediction model — NFL-specific feature set.
- Build an MLB prediction model — starting pitcher features on top of this baseline.
- Build a soccer prediction model — adapting these features for 3-outcome sports.
- XGBoost sample weights — the training technique paired with these features.