Module 1 Deep Dive: Scraping ESPN's Scoreboard Endpoints for Every NBA Game in a Season
Building a sports prediction model starts with one question: where does the data come from? For NBA games specifically, the answer is ESPN's scoreboard JSON endpoints — undocumented, free, no auth required, and remarkably stable across years. This is the deep dive on Module 1 of our Polymarket Bot Course: scraping every NBA game in a full season into a clean parquet file ready for downstream ML.
What you will end up with
By the end of this module, you will have a parquet file containing every NBA game from the 2024-25 season — 1,230 regular-season games plus the playoffs — with columns:
game_id— ESPN's stable identifiergame_date— ISO datehome_team, away_team— abbreviationhome_score, away_score— finalstatus— "STATUS_FINAL" for completed gamesvenue, attendance— metadata
This dataset is the input to the rest of the course: feature engineering in Module 2, model training in Module 3, backtesting in Module 4, live deployment in Module 5.
The endpoint
ESPN's NBA scoreboard endpoint takes a date parameter:
https://site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard?dates=20250215
The date format is YYYYMMDD with no separators. The response is JSON containing all games on that date. To scrape an entire season, you iterate every day from October (season start) to June (Finals end).
The naive scraper
The simplest version that works:
import requests
from datetime import date, timedelta
def fetch_day(d: date) -> dict:
url = f"https://site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard?dates={d.strftime('%Y%m%d')}"
return requests.get(url, timeout=10).json()
start = date(2024, 10, 22)
end = date(2025, 6, 30)
all_games = []
d = start
while d <= end:
data = fetch_day(d)
for ev in data.get("events", []):
all_games.append(ev)
d += timedelta(days=1)
This runs in about 5 minutes for a full season. It also has every problem a production scraper needs to handle: no rate limiting, no retry, no deduplication, no error handling. Let's fix each one.
Defensive scraping
import requests, time, logging
from datetime import date, timedelta
logger = logging.getLogger(__name__)
SESSION = requests.Session()
SESSION.headers.update({"User-Agent": "MyBot/1.0 (educational)"})
def fetch_day_safe(d: date, retries: int = 3) -> dict:
url = f"https://site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard?dates={d.strftime('%Y%m%d')}"
for attempt in range(retries):
try:
r = SESSION.get(url, timeout=10)
if r.status_code == 429:
wait = int(r.headers.get("Retry-After", 2 ** attempt))
logger.warning(f"Rate limited on {d}, waiting {wait}s")
time.sleep(wait)
continue
r.raise_for_status()
return r.json()
except (requests.RequestException, ValueError) as e:
logger.warning(f"Fetch failed {d} attempt {attempt}: {e}")
time.sleep(2 ** attempt)
return {}
Three things this version does right: respects the Retry-After header on 429 responses, exponentially backs off on transient errors, and returns an empty dict rather than crashing on total failure. The next day continues to scrape.
Parsing one event into a row
def parse_event(ev: dict) -> dict | None:
try:
comp = ev["competitions"][0]
teams = comp["competitors"]
home = next(t for t in teams if t["homeAway"] == "home")
away = next(t for t in teams if t["homeAway"] == "away")
return {
"game_id": ev["id"],
"game_date": ev["date"][:10],
"home_team": home["team"]["abbreviation"],
"away_team": away["team"]["abbreviation"],
"home_score": int(home.get("score", 0)),
"away_score": int(away.get("score", 0)),
"status": ev["status"]["type"]["name"],
"venue": comp.get("venue", {}).get("fullName", ""),
"attendance": comp.get("attendance", 0),
}
except (KeyError, IndexError, StopIteration) as e:
logger.warning(f"Parse failed for event {ev.get('id')}: {e}")
return None
The try/except is not optional. ESPN's response shape varies slightly between regular season, playoffs, postponed games, and forfeit games. A single malformed event should not crash the entire scrape.
Deduplication
The same game can appear in two days' scoreboards if it spans midnight UTC. Postponed games can have two game_ids. Always deduplicate by game_id before persisting:
seen = set()
unique_games = []
for ev in all_games:
parsed = parse_event(ev)
if parsed is None:
continue
if parsed["game_id"] in seen:
continue
seen.add(parsed["game_id"])
unique_games.append(parsed)
Politeness: rate-limit yourself
ESPN does not publish a rate limit, but their endpoints have throttled us at sustained rates above ~2 requests per second. Add a small sleep between requests:
import time
DELAY_S = 0.5 # 2 req/sec
while d <= end:
data = fetch_day_safe(d)
for ev in data.get("events", []):
all_games.append(ev)
d += timedelta(days=1)
time.sleep(DELAY_S)
For a full season (about 250 days October to June), this adds about 2 minutes of total scrape time. Worth it to never get rate-limited.
Persisting to parquet
import pandas as pd
df = pd.DataFrame(unique_games)
df = df[df["status"] == "STATUS_FINAL"] # Drop in-progress and postponed
df = df.sort_values(["game_date", "game_id"]).reset_index(drop=True)
df.to_parquet("nba_games_2024_25.parquet")
print(f"Saved {len(df)} completed games")
Parquet is the right format for this dataset because it is columnar (fast to read specific columns) and compressed (small on disk). A full NBA season is around 1.5 MB on disk.
Going beyond scoreboard: the summary endpoint
The scoreboard gives you final scores. For modeling, you typically want play-by-play. ESPN's summary endpoint returns it:
def fetch_summary(game_id: str) -> dict:
url = f"https://site.api.espn.com/apis/site/v2/sports/basketball/nba/summary?event={game_id}"
return requests.get(url, timeout=15).json()
summary = fetch_summary("401705412")
plays = summary.get("plays", [])
print(f"{len(plays)} plays in this game")
Each play has period, clock, score after the play, type, and (sometimes) coordinates. This is the input to in-play win-probability models. We cover building those in Module 3.
What can go wrong
- Time zones. ESPN's
datefield is UTC; some games on the West Coast cross midnight UTC and appear on two consecutive scoreboard days. Dedup catches this. - Postponements. Postponed games sometimes get a new game_id when rescheduled. Sometimes they keep the original. Both happen. Dedup catches one but not the other — manual review is required for season-end completeness.
- Cancellations. Some games are cancelled outright. Their
statuswill beSTATUS_CANCELLED. Filter by status before modeling. - Schema changes. ESPN occasionally changes a field's name or type. We have not seen a breaking change in years, but assume it can happen and write parsers defensively.
The bottom line
A clean NBA season scrape is a foundational dataset for any prediction model. ESPN's free JSON endpoints make it easy. The 100 lines of Python in this post turn into a parquet file you can build the rest of your pipeline on. Module 2 (feature engineering) and Module 3 (model training) both consume this file directly.
The full Polymarket Bot Course
Six Jupyter modules: ESPN scraping, Elo, win-probability models, backtesting, live bot, deployment. $49 standalone or included with every ZenHodl API plan.
Get the course