MLB Statcast Real-Time Data Pipeline

Washington University in St. Louis — Fall 2025

Built with co-author Eddy Sul for Washington University's CSE 5114 (Data Manipulation and Management at Scale) course, this project implements a lambda-architecture pipeline for MLB Statcast pitch-level data sourced via pybaseball. Apache Airflow DAGs handle historical batch ingestion into Snowflake, while a Kafka producer paired with Spark Streaming handles live, simulated data.

The result is a Streamlit dashboard with three views: a Game Simulation mode that replays historical games pitch-by-pitch with adjustable playback speed and interactive strike-zone visualizations, a Game Explorer for pitcher analytics (velocity, spin rate, strikeout rate), and a Team Matchups view for head-to-head historical comparisons.

Highlights

  • Lambda architecture combining Airflow batch ingestion with Kafka/Spark streaming
  • Pitch-by-pitch historical game replay with interactive strike-zone visualization
  • Pitcher analytics: velocity, spin rate, and strikeout rate
  • Team head-to-head matchup comparisons across seasons

Technologies Applied

PythonApache AirflowSnowflakeApache KafkaApache SparkStreamlitpybaseballPandas