Sample Weights in XGBoost: Fixing Class Imbalance and Regime Drift

April 21, 2026 · 12 min read · Python, XGBoost, ML Engineering

You've trained an XGBoost classifier. Accuracy looks good. AUC is 0.87. You deploy, and the model is wildly miscalibrated on the rare-event bucket that actually matters. You try hyperparameters; nothing fixes it.

What you're missing is sample_weight — XGBoost's single most underused parameter. It tells the model which rows should count more during gradient computation. Used correctly, it fixes class imbalance, emphasizes recent data, handles regime-specific drift, and boosts calibration on rare-but-critical examples.

This post walks through the four cases where sample_weight is the right tool, with copy-pasteable Python code for each.

The Four Cases for sample_weight

Case	What you weight heavier	Typical weight range
Class imbalance	Minority class rows	2-10x
Recency bias	Recent rows	1-5x
Regime emphasis	Rows matching target regime (playoffs, etc.)	2-5x
Cost-weighted learning	High-stakes rows	2-20x

Case 1: Fixing Class Imbalance

Training a model on MLB moneyline outcomes? ~53% of home teams win. NBA? ~58%. Not too bad. But fraud detection, conversion prediction, rare-disease classification? You might have 1-5% positives. Without rebalancing, the model learns "always predict 0" and hits 98% accuracy.

The standard fix: inverse-frequency weights

import numpy as np

def balanced_sample_weights(y):
    """Weight each class inversely proportional to its frequency."""
    class_counts = np.bincount(y)
    n_classes = len(class_counts)
    weights = len(y) / (n_classes * class_counts)
    return weights[y]

# Training
y_train = np.array([0, 0, 0, 0, 0, 1, 1])   # 5:2 imbalance
weights = balanced_sample_weights(y_train)
# weights -> [0.7, 0.7, 0.7, 0.7, 0.7, 1.75, 1.75]

model = XGBClassifier(n_estimators=300, max_depth=4)
model.fit(X_train, y_train, sample_weight=weights)

XGBoost also has scale_pos_weight which does the same thing but only for binary classification. Use scale_pos_weight = n_negative / n_positive. For multi-class, you must use sample_weight.

Watch out: rebalancing hurts raw probability calibration. The model learns to predict the balanced distribution, not the real one. Always apply post-hoc isotonic calibration when using sample_weight for imbalance.

Case 2: Emphasizing Recent Data

Sports, finance, ads — any time-series domain has distribution drift. Rules change. Team rosters change. Market structure evolves. A 2021 game and a 2026 game are not from the same distribution, even though they're both labeled "NBA regular season."

Solution: decay weights exponentially by row age.

from datetime import datetime

def recency_weights(dates: np.ndarray, half_life_days: float = 365):
    """Exponential decay: rows half_life_days old get half the weight."""
    today = datetime.now()
    ages_days = np.array([(today - d).days for d in dates])
    decay_rate = np.log(2) / half_life_days
    return np.exp(-decay_rate * ages_days)

# Example
import pandas as pd
df['w'] = recency_weights(df['date'], half_life_days=730)  # 2-year half-life
model.fit(df[feature_cols], df['y'], sample_weight=df['w'].values)

A 2-year half-life means 2024 data counts at full weight, 2022 data at 50%, 2020 at 25%. Adjust based on how fast your domain drifts. Trading markets: 6-12 months. Sports: 1-2 years. Medical imaging: longer.

Case 3: Regime Emphasis (Playoffs, Early Season, Weather)

This is where sample_weight becomes a precision tool. You have data across multiple regimes, but you trade primarily in one of them.

Example: emphasizing NBA playoffs

# Assume df has an is_playoff column (1 = playoff game, 0 = regular season)
# Playoffs are ~6% of games historically; we want them weighted 3x during training

df['sample_weight'] = np.where(df['is_playoff'] == 1, 3.0, 1.0)

# Effective sample: 94% of regular + 6% * 3 = 18% playoff "effective mass"
# Before weighting: 6% playoff. After: 18% effective playoff mass.
# The model now pays 3x as much attention to playoff patterns.

model.fit(
    df[feature_cols],
    df['home_wins'],
    sample_weight=df['sample_weight'].values,
)

After one real-world case where our NBA live model was badly miscalibrated on playoff games (84% predicted → 44% actual in the 80-90% confidence bucket), weighting playoff rows 3x during training dropped playoff Expected Calibration Error from 21% to 5.5%. Regular-season ECE barely moved.

Computing regime indicators

For NBA/NHL, the regime indicator is "was this game a playoff game?" Derived from game dates against the league's playoff window:

def is_playoff_nba(date) -> int:
    """NBA playoffs: roughly April 20 - June 22 each year."""
    if not (4 <= date.month <= 6):
        return 0
    if date.month == 4 and date.day < 18:
        return 0
    if date.month == 6 and date.day > 22:
        return 0
    return 1

df['is_playoff'] = df['date'].apply(is_playoff_nba)

For ESPN game data, a more reliable method is to hit the ESPN scoreboard endpoint with dates=YYYYMMDD and check each event's season.type field (2 = regular, 3 = postseason).

Case 4: Cost-Weighted Learning

Some errors cost more than others. Misclassifying a fraud transaction costs 100x more than misclassifying a legit one. In trading, misclassifying a high-confidence trade costs 10x more than missing a small edge.

Use sample_weight proportional to each row's cost-weight:

# Fraud detection: weight positive class by transaction amount
df.loc[df['is_fraud'] == 1, 'w'] = df['amount']   # cost = $ at risk
df.loc[df['is_fraud'] == 0, 'w'] = 1.0

model.fit(X, y, sample_weight=df['w'].values)

This turns the classifier into a cost-sensitive learner: it minimizes expected dollar-weighted loss, not raw classification error.

Combining Multiple Weights

You often want multiple corrections simultaneously. Multiply the weights:

df['w_imbalance']  = balanced_sample_weights(df['y'].values)
df['w_recency']    = recency_weights(df['date'].values, half_life_days=730)
df['w_regime']     = np.where(df['is_playoff'] == 1, 3.0, 1.0)
df['w']            = df['w_imbalance'] * df['w_recency'] * df['w_regime']

model.fit(X_train, df['y'], sample_weight=df['w'].values)

Keep the final weights < ~100x and > ~0.01x. Extreme weights cause XGBoost to effectively ignore the low-weighted rows, which is rarely what you want.

Evaluating Weighted Models

A model trained with sample_weight will score differently than one without. Some gotchas:

Evaluate on unweighted data. Your test set should reflect the real-world distribution, not the rebalanced training distribution.
Measure calibration separately per regime. If you weighted playoffs 3x, check ECE on playoffs AND regular season. A model calibrated on both is better than a model calibrated on one and miscalibrated on the other.
Always apply post-hoc calibration after using sample_weight for imbalance or regime emphasis. Isotonic regression on the unweighted validation set.

from sklearn.isotonic import IsotonicRegression

# Train with weights
base_model.fit(X_train, y_train, sample_weight=weights_train)

# Calibrate on unweighted validation set
cal_raw = base_model.predict_proba(X_val)[:, 1]
iso = IsotonicRegression(y_min=0.01, y_max=0.99, out_of_bounds="clip")
iso.fit(cal_raw, y_val)                    # <- no weights here

# At inference
def calibrated_predict(X):
    raw = base_model.predict_proba(X)[:, 1]
    return iso.predict(raw)

When sample_weight Is the Wrong Answer

Three situations where sample_weight is the wrong move:

Tiny minority class (< 50 examples total). No amount of weighting fixes "not enough data." Collect more or accept the limitation.
Data quality issue on rare class. If your rare-class labels are noisy, weighting amplifies the noise. Fix the labels first.
Testing a structural hypothesis. If you want to know whether playoff games behave differently, fit two separate models — one on playoffs, one on regular season — and compare. Single-model weighting blends regimes.

Want to skip the training pipeline? ZenHodl's API ships playoff-aware, calibrated sports predictions across 11 sports — sample-weighted, isotonic-calibrated, production-ready.

See ZenHodl

Further reading: Calibrating XGBoost Probabilities with Isotonic Regression · 15 Features That Matter for Sports Win Probability