Module 1 Deep Dive: Scraping ESPN's Scoreboard Endpoints for Every NBA Game in a Season

May 11, 2026 · 12 min read · Python, ESPN, Web Scraping, Data Pipeline

Building a sports prediction model starts with one question: where does the data come from? For NBA games specifically, the answer is ESPN's scoreboard JSON endpoints — undocumented, free, no auth required, and remarkably stable across years. This is the deep dive on Module 1 of our Polymarket Bot Course: scraping every NBA game in a full season into a clean parquet file ready for downstream ML.

What you will end up with

By the end of this module, you will have a parquet file containing every NBA game from the 2024-25 season — 1,230 regular-season games plus the playoffs — with columns:

game_id — ESPN's stable identifier
game_date — ISO date
home_team, away_team — abbreviation
home_score, away_score — final
status — "STATUS_FINAL" for completed games
venue, attendance — metadata

This dataset is the input to the rest of the course: feature engineering in Module 2, model training in Module 3, backtesting in Module 4, live deployment in Module 5.

The endpoint

ESPN's NBA scoreboard endpoint takes a date parameter:

https://site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard?dates=20250215

The date format is YYYYMMDD with no separators. The response is JSON containing all games on that date. To scrape an entire season, you iterate every day from October (season start) to June (Finals end).

The naive scraper

The simplest version that works:

import requests
from datetime import date, timedelta

def fetch_day(d: date) -> dict:
    url = f"https://site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard?dates={d.strftime('%Y%m%d')}"
    return requests.get(url, timeout=10).json()

start = date(2024, 10, 22)
end = date(2025, 6, 30)
all_games = []
d = start
while d <= end:
    data = fetch_day(d)
    for ev in data.get("events", []):
        all_games.append(ev)
    d += timedelta(days=1)

This runs in about 5 minutes for a full season. It also has every problem a production scraper needs to handle: no rate limiting, no retry, no deduplication, no error handling. Let's fix each one.

Defensive scraping

import requests, time, logging
from datetime import date, timedelta

logger = logging.getLogger(__name__)
SESSION = requests.Session()
SESSION.headers.update({"User-Agent": "MyBot/1.0 (educational)"})

def fetch_day_safe(d: date, retries: int = 3) -> dict:
    url = f"https://site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard?dates={d.strftime('%Y%m%d')}"
    for attempt in range(retries):
        try:
            r = SESSION.get(url, timeout=10)
            if r.status_code == 429:
                wait = int(r.headers.get("Retry-After", 2 ** attempt))
                logger.warning(f"Rate limited on {d}, waiting {wait}s")
                time.sleep(wait)
                continue
            r.raise_for_status()
            return r.json()
        except (requests.RequestException, ValueError) as e:
            logger.warning(f"Fetch failed {d} attempt {attempt}: {e}")
            time.sleep(2 ** attempt)
    return {}

Three things this version does right: respects the Retry-After header on 429 responses, exponentially backs off on transient errors, and returns an empty dict rather than crashing on total failure. The next day continues to scrape.

Parsing one event into a row

def parse_event(ev: dict) -> dict | None:
    try:
        comp = ev["competitions"][0]
        teams = comp["competitors"]
        home = next(t for t in teams if t["homeAway"] == "home")
        away = next(t for t in teams if t["homeAway"] == "away")
        return {
            "game_id": ev["id"],
            "game_date": ev["date"][:10],
            "home_team": home["team"]["abbreviation"],
            "away_team": away["team"]["abbreviation"],
            "home_score": int(home.get("score", 0)),
            "away_score": int(away.get("score", 0)),
            "status": ev["status"]["type"]["name"],
            "venue": comp.get("venue", {}).get("fullName", ""),
            "attendance": comp.get("attendance", 0),
        }
    except (KeyError, IndexError, StopIteration) as e:
        logger.warning(f"Parse failed for event {ev.get('id')}: {e}")
        return None

The try/except is not optional. ESPN's response shape varies slightly between regular season, playoffs, postponed games, and forfeit games. A single malformed event should not crash the entire scrape.

Deduplication

The same game can appear in two days' scoreboards if it spans midnight UTC. Postponed games can have two game_ids. Always deduplicate by game_id before persisting:

seen = set()
unique_games = []
for ev in all_games:
    parsed = parse_event(ev)
    if parsed is None:
        continue
    if parsed["game_id"] in seen:
        continue
    seen.add(parsed["game_id"])
    unique_games.append(parsed)

Politeness: rate-limit yourself

ESPN does not publish a rate limit, but their endpoints have throttled us at sustained rates above ~2 requests per second. Add a small sleep between requests:

import time

DELAY_S = 0.5  # 2 req/sec

while d <= end:
    data = fetch_day_safe(d)
    for ev in data.get("events", []):
        all_games.append(ev)
    d += timedelta(days=1)
    time.sleep(DELAY_S)

For a full season (about 250 days October to June), this adds about 2 minutes of total scrape time. Worth it to never get rate-limited.

Persisting to parquet

import pandas as pd

df = pd.DataFrame(unique_games)
df = df[df["status"] == "STATUS_FINAL"]  # Drop in-progress and postponed
df = df.sort_values(["game_date", "game_id"]).reset_index(drop=True)
df.to_parquet("nba_games_2024_25.parquet")
print(f"Saved {len(df)} completed games")

Parquet is the right format for this dataset because it is columnar (fast to read specific columns) and compressed (small on disk). A full NBA season is around 1.5 MB on disk.

Going beyond scoreboard: the summary endpoint

The scoreboard gives you final scores. For modeling, you typically want play-by-play. ESPN's summary endpoint returns it:

def fetch_summary(game_id: str) -> dict:
    url = f"https://site.api.espn.com/apis/site/v2/sports/basketball/nba/summary?event={game_id}"
    return requests.get(url, timeout=15).json()

summary = fetch_summary("401705412")
plays = summary.get("plays", [])
print(f"{len(plays)} plays in this game")

Each play has period, clock, score after the play, type, and (sometimes) coordinates. This is the input to in-play win-probability models. We cover building those in Module 3.

What can go wrong

Time zones. ESPN's date field is UTC; some games on the West Coast cross midnight UTC and appear on two consecutive scoreboard days. Dedup catches this.
Postponements. Postponed games sometimes get a new game_id when rescheduled. Sometimes they keep the original. Both happen. Dedup catches one but not the other — manual review is required for season-end completeness.
Cancellations. Some games are cancelled outright. Their status will be STATUS_CANCELLED. Filter by status before modeling.
Schema changes. ESPN occasionally changes a field's name or type. We have not seen a breaking change in years, but assume it can happen and write parsers defensively.

The bottom line

A clean NBA season scrape is a foundational dataset for any prediction model. ESPN's free JSON endpoints make it easy. The 100 lines of Python in this post turn into a parquet file you can build the rest of your pipeline on. Module 2 (feature engineering) and Module 3 (model training) both consume this file directly.

The full Polymarket Bot Course

Six Jupyter modules: ESPN scraping, Elo, win-probability models, backtesting, live bot, deployment. $49 standalone or included with every ZenHodl API plan.

Get the course