Feature Engineering for Sports Betting Models: What Actually Moves Win Probability

May 11, 2026 · 13 min read · Python, Feature Engineering, XGBoost, SHAP

You can throw 100 features at an XGBoost model and it will use most of them. The question is which ones actually carry signal versus which ones are noise the model pretends to use. This post walks through the features that consistently move win probability across NBA, NHL, MLB, and football — based on SHAP attribution from production models with thousands of trained games each.

The conclusion up front: the right feature set is small, mostly the same across sports, and dominated by a handful of features that the model relies on disproportionately.

The features that always matter

Five features carry the bulk of the signal in every sport win-probability model we have built:

Feature	Why it matters
Score differential (home - away)	The single largest signal in any in-play model. Late in the game, score diff dominates everything else.
Time remaining (normalized)	How much of the game is left for the score to change.
Pre-game win probability	From your team-strength model (Elo, KenPom, FiveThirtyEight). Carries everything the model knows before the game starts.
Score diff × time remaining	Interaction term. A 5-point lead with 2 minutes left is very different from a 5-point lead with 30 minutes left.
Possession indicator (where applicable)	In basketball and football, who has the ball is critical near the end of close games.

These five features alone get you a model that is 90% as good as a 30-feature one. Everything beyond is incremental.

Sport-specific features that meaningfully add

For each sport, two or three sport-specific features add real value beyond the universal five.

Basketball (NBA, NCAAMB)

Pace (estimated possessions per 48 min): scaled to expected total points. A 100-pace game has more variance than an 85-pace game.
Offensive rating differential: home off rating minus away off rating, shrinks toward zero as season progresses.
Free-throw count differential: late in close games, the team in the bonus has a meaningful structural advantage.

Hockey (NHL)

Shots-on-goal differential: a leading indicator of goals; teams generating more shots are more likely to score next.
Power-play state: the team on the power play has dramatically higher per-minute scoring probability.
Empty-net flag: in the final minutes, pulling the goalie is a giant single-feature jump in opponent scoring probability.

Baseball (MLB)

Starting pitcher quality differential: ERA, FIP, K/9 of starting pitchers. Massive game-to-game effect.
Inning: the relative importance of each remaining at-bat shifts late in games.
Bases occupied state: scoring probability per at-bat varies dramatically with runners on base.

Football (NFL, CFB)

Down and distance: 1st-and-10 vs 4th-and-15 are very different situations.
Field position: yard line scales scoring probability for the team in possession.
Two-minute warning flag: clock management changes dramatically in the final two minutes of each half.

The interaction features that punch above their weight

A handful of engineered interaction features show up in the top-10 SHAP importance for almost every sport:

df["score_diff_x_time_remaining"] = df["score_diff"] * df["time_remaining"]
df["score_diff_sq"] = df["score_diff"] ** 2
df["score_diff_x_elo_diff"] = df["score_diff"] * df["elo_diff"]
df["pregame_wp_x_time_elapsed"] = df["pregame_wp"] * (1 - df["time_remaining"])

The first three add 1-3 percentage points of model AUC over the raw features in our backtests. The fourth captures the way the model should weight pregame information differently early vs late in a game.

The features that look important but are not

Several features that intuitively seem like they should matter contribute almost nothing in practice:

Day of week. No measurable effect after controlling for matchup quality.
Travel distance for the away team. Effect size is in the noise band; not worth the data engineering.
Days of rest. Noisy. Some research shows an effect on starting-pitcher fatigue in MLB, but it does not survive holdout validation in our data.
Referee identity. Effect exists in some sports but is small enough to not survive in a calibrated probability model.
Weather (outdoor sports). Useful only in extreme conditions; noisy across the typical range.

A model that includes these features will use them, but the SHAP attribution shows they contribute fractions of a percentage point to AUC. Cut them and the model performs as well or better.

How to find your own important features

Use SHAP on a trained model:

import shap
import xgboost as xgb

model = xgb.XGBClassifier(n_estimators=400, max_depth=5, learning_rate=0.05)
model.fit(X_train, y_train)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)

# Mean absolute SHAP per feature
import numpy as np
import pandas as pd

mean_abs_shap = np.abs(shap_values).mean(axis=0)
shap_df = pd.DataFrame({
    "feature": X_train.columns,
    "mean_abs_shap": mean_abs_shap,
}).sort_values("mean_abs_shap", ascending=False)
print(shap_df.head(15))

Read the top 10. Drop everything below position 20. Retrain. Compare AUC. The right feature set is usually 12-15 features, not 50.

Feature stability across seasons

The features that matter in February usually matter in October. The features that matter in 2025 usually matter in 2024. Win-probability features are remarkably stable across seasons because the underlying physics of the games does not change.

The exceptions: rule changes (NBA's 2-minute review rules, NCAA's 30-second shot clock changes) can shift feature importance over the course of one season. Pace changes (the NBA's pace increased significantly between 2010 and 2020) shift the relative importance of pace-related features over multiple seasons.

The practical implication: retraining the model annually is enough to capture most drift. Quarterly is overkill. Monthly is unnecessary unless you have a specific reason.

The bottom line

Feature engineering for sports models is mostly subtraction, not addition. Start with the universal five (score diff, time, pregame WP, score-time interaction, possession), add two or three sport-specific features, engineer a handful of interaction terms, and stop. Anything beyond is usually noise the model pretends to learn from.

Production win-probability models for 11 sports

ZenHodl publishes calibrated probabilities using the feature sets described in this post. Free seven-day trial.

Try ZenHodl free