Time-Series Cross-Validation for Sports Models: Why K-Fold Lies

May 5, 2026 · 14 min read · Python, Cross-Validation, Sports ML

You trained an XGBoost model on 5 seasons of NBA play-by-play. cross_val_score with cv=5 returns a mean ROC AUC of 0.94 and a Brier score of 0.038. You ship it. Live performance: AUC 0.78, Brier 0.21. The model didn't suddenly forget how to predict basketball — the validation framework was lying.

Random k-fold cross-validation is the default in scikit-learn for a reason: it's a sane choice for IID tabular data. Sports data isn't IID. There's a strict temporal ordering, the underlying distribution shifts over seasons, and adjacent samples (consecutive plays in the same game) are statistically dependent in ways that make random splits leak future information into training. This post explains the failure modes, walks through three time-aware CV strategies in Python, and shows how to detect the subtler leakage that survives even the obvious fixes.

How K-Fold Lies on Sports Data

Random k-fold puts game N's third-quarter play-by-play in the train set and that same game's fourth-quarter play-by-play in the test set. The model "predicts" the fourth quarter having seen the third — a luxury it never has in production. AUC is inflated by the trivial information shared between adjacent rows: same lineup, same matchup, same momentum.

It gets worse. Even at the game level — if you do GroupKFold by game_id — the splits still mix seasons. A test fold from October 2023 gets predictions from a model trained on data through April 2024. Players whose careers were largely after October 2023 contribute training examples that "leak" their future style of play into the prediction of an earlier game. Subtle but real.

Here's a quick experiment we ran on 5 seasons of NCAAMB data (~5,300 games, ~890,000 play-by-play rows). Same XGBoost classifier, same features, three CV strategies:

CV Strategy	ROC AUC	Brier	Live AUC
Random KFold (5)	0.943	0.038	0.78
GroupKFold by game (5)	0.872	0.082	0.78
Walk-forward by season	0.791	0.151	0.78

Walk-forward is the only one whose CV score matches live performance. The other two over-promise by 6-15 AUC points. If you ship based on the random k-fold number, you'll be devastated when production reality arrives.

Strategy 1: Walk-Forward (Sliding Window)

Walk-forward splits respect time. Train on data up to time T; test on data from T+1 to T+H (the holdout horizon); slide the window forward and repeat. Implementation in pandas:

import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, brier_score_loss

def walk_forward_cv(df, feature_cols, target_col, ts_col,
                    train_window_days=540, test_window_days=90, step_days=90):
    """
    Train on a sliding window, test on the immediately following window.
    Returns list of (auc, brier, n_test) per fold.
    """
    df = df.sort_values(ts_col).reset_index(drop=True)
    t_min, t_max = df[ts_col].min(), df[ts_col].max()

    results = []
    train_start = t_min
    while True:
        train_end = train_start + pd.Timedelta(days=train_window_days)
        test_end = train_end + pd.Timedelta(days=test_window_days)
        if test_end > t_max:
            break

        train = df[(df[ts_col] >= train_start) & (df[ts_col] < train_end)]
        test = df[(df[ts_col] >= train_end) & (df[ts_col] < test_end)]
        if len(train) < 1000 or len(test) < 200:
            train_start += pd.Timedelta(days=step_days)
            continue

        model = XGBClassifier(n_estimators=300, max_depth=5,
                              learning_rate=0.05, eval_metric="logloss")
        model.fit(train[feature_cols], train[target_col])
        probs = model.predict_proba(test[feature_cols])[:, 1]

        results.append({
            "train_end": train_end,
            "auc": roc_auc_score(test[target_col], probs),
            "brier": brier_score_loss(test[target_col], probs),
            "n_test": len(test),
        })
        train_start += pd.Timedelta(days=step_days)

    return pd.DataFrame(results)

The fold-by-fold results expose stability problems random k-fold hides. If your model's AUC drops from 0.82 in 2019 folds to 0.71 in 2024 folds, that's regime drift. K-fold averages it away into a single misleading number. Walk-forward shows you the trajectory.

Strategy 2: Expanding Window

Expanding window keeps the start of the train set fixed at t_min and grows it forward. Each fold has more training data than the last. This matches how production retraining typically works (you accumulate data over time and refit periodically) but it gives unequal-sized folds:

def expanding_window_cv(df, feature_cols, target_col, ts_col,
                        initial_train_days=720, test_window_days=90, step_days=90):
    df = df.sort_values(ts_col).reset_index(drop=True)
    t_min, t_max = df[ts_col].min(), df[ts_col].max()

    results = []
    train_end = t_min + pd.Timedelta(days=initial_train_days)
    while True:
        test_end = train_end + pd.Timedelta(days=test_window_days)
        if test_end > t_max:
            break

        train = df[df[ts_col] < train_end]
        test = df[(df[ts_col] >= train_end) & (df[ts_col] < test_end)]

        model = XGBClassifier(n_estimators=300, max_depth=5,
                              learning_rate=0.05, eval_metric="logloss")
        model.fit(train[feature_cols], train[target_col])
        probs = model.predict_proba(test[feature_cols])[:, 1]

        results.append({
            "train_end": train_end,
            "n_train": len(train),
            "auc": roc_auc_score(test[target_col], probs),
            "brier": brier_score_loss(test[target_col], probs),
        })
        train_end += pd.Timedelta(days=step_days)

    return pd.DataFrame(results)

Use expanding window when older data is still informative (deep team-level features that change slowly). Use sliding window when concept drift is fast (rule changes, meta shifts in esports, post-COVID NBA pace) and old data hurts more than it helps.

Strategy 3: Season-Based Splits

For sports with sharp season boundaries, a season-based split is the cleanest formulation: train on seasons 1 through N−1, test on season N. This matches the deployment pattern (you retrain in the offseason and run all season) and avoids weird mid-season cuts where playoff teams behave differently than regular-season teams.

def season_walk_forward(df, feature_cols, target_col, season_col,
                        min_train_seasons=2):
    seasons = sorted(df[season_col].unique())
    results = []
    for i in range(min_train_seasons, len(seasons)):
        train_seasons = seasons[:i]
        test_season = seasons[i]
        train = df[df[season_col].isin(train_seasons)]
        test = df[df[season_col] == test_season]

        model = XGBClassifier(n_estimators=300, max_depth=5,
                              learning_rate=0.05, eval_metric="logloss")
        model.fit(train[feature_cols], train[target_col])
        probs = model.predict_proba(test[feature_cols])[:, 1]

        results.append({
            "test_season": test_season,
            "n_train": len(train),
            "n_test": len(test),
            "auc": roc_auc_score(test[target_col], probs),
            "brier": brier_score_loss(test[target_col], probs),
        })
    return pd.DataFrame(results)

This is what we use as the canonical CV for sports models in production. It's interpretable (one row per season is easy to read), it tests on out-of-distribution data (the test season has new players, new rules, new tactics), and it's directly comparable across model versions.

The Subtler Leakage You Won't Catch

Even with proper walk-forward CV, leakage can sneak in through feature engineering. Three common cases:

1. Season-Aggregated Features

If your feature is "team's offensive rating this season," computed over the entire season's games, you're leaking. The October version of that feature shouldn't include March games. Fix: compute features as-of the prediction date.

# Wrong: this aggregates the whole season including future games
team_ortg = df.groupby(["team", "season"])["ortg"].transform("mean")

# Right: rolling mean over only past games
df = df.sort_values(["team", "ts"])
team_ortg = (df.groupby("team")["ortg"]
              .rolling(window=20, min_periods=5)
              .mean()
              .reset_index(level=0, drop=True)
              .shift(1))                 # exclude current game

2. Elo Ratings Built In Reverse

If you build Elo ratings by iterating through the historical match log, that's correct. If you load Elo ratings from a snapshot file that was computed on the full dataset, including future games, your training features have peeked. We hit this once and the model's "AUC" jumped by 0.08 — a clear sign something was off, but easy to mistake for genuine improvement.

3. Calibration Set Reuse

Calibration (isotonic, Platt) needs a held-out set. If that set overlaps with the training set, the calibrator memorizes training noise and your validation calibration is fictitious. Three-way split: train, calibration, test. None overlap, all are time-ordered.

def three_way_temporal_split(df, ts_col, train_frac=0.7, cal_frac=0.15):
    df = df.sort_values(ts_col).reset_index(drop=True)
    n = len(df)
    i_train = int(n * train_frac)
    i_cal = int(n * (train_frac + cal_frac))
    return df.iloc[:i_train], df.iloc[i_train:i_cal], df.iloc[i_cal:]

Detecting Leakage Empirically

If your train AUC is meaningfully higher than your test AUC even on a properly time-ordered split, you have leakage. The diagnostic that catches the most cases is the "shuffle the target" test: randomly permute the labels and re-run CV. The score should drop to chance (AUC ~0.5). If it doesn't, you have an information leak that survives randomization — usually because of a feature derived from the target downstream:

import numpy as np

def label_permutation_test(df, feature_cols, target_col, ts_col, cv_func):
    df_shuffled = df.copy()
    df_shuffled[target_col] = np.random.permutation(df[target_col].values)
    return cv_func(df_shuffled, feature_cols, target_col, ts_col)

# AUC should be ~0.5 on permuted labels. If it's 0.6+, you have leakage.

Another diagnostic: train a model on data after the test set's date and predict the test set. If the future-trained model is materially better than the past-trained model, that's expected (more data = better). If the past-trained model is better, you have leakage going the other way (train somehow knows the test labels).

See production-grade time-aware CV in action across 10 sports

View ZenHodl Methodology →

Reporting CV Results Honestly

A single AUC number from a five-fold split is almost always an overstatement. When publishing CV results, include:

Per-fold scores, not just the mean. Show the variance.
Train and test sample sizes per fold. Small test folds are noisy.
The CV strategy explicitly: "season walk-forward" or "expanding window with 90-day test horizon."
The temporal range of each fold's test set. A 2024 test fold tells you about 2024; it doesn't tell you about 2025.
Calibration metrics alongside discrimination metrics. AUC + Brier + ECE.

This is not academic rigor for its own sake. It's the difference between knowing your model is shippable and finding out the hard way.

When Random K-Fold Is Actually Fine

Two cases where random CV is the right choice on sports data:

Sport-agnostic embeddings. If you're learning a player embedding from career-long stats with no temporal feature, random CV by player is reasonable.
Synthetic / simulated data. If your data is generated from a stationary process you control, IID assumptions hold and random CV is fine.

Everywhere else — play-by-play, game outcomes, in-game win probability — default to walk-forward. The cost is a slightly more complex split function. The benefit is a CV score you can actually trust.

Common Pitfalls Recap

Random k-fold on time-ordered data — inflates AUC by 6-15 points
GroupKFold by game without time ordering — better, still leaks across seasons
Aggregating features across the entire season — even with walk-forward CV, this leaks
Reusing the test set for hyperparameter tuning — tune on the calibration set, evaluate on test once
Reporting only the mean CV score — per-fold variance is the diagnostic
Skipping the label-permutation test — it catches the deepest leaks