MLB Statcast Real-Time Data Pipeline
Washington University in St. Louis — Fall 2025
Built with co-author Eddy Sul for Washington University's CSE 5114 (Data Manipulation and Management at Scale) course, this project implements a lambda-architecture pipeline for MLB Statcast pitch-level data sourced via pybaseball. Apache Airflow DAGs handle historical batch ingestion into Snowflake, while a Kafka producer paired with Spark Streaming handles live, simulated data.
The result is a Streamlit dashboard with three views: a Game Simulation mode that replays historical games pitch-by-pitch with adjustable playback speed and interactive strike-zone visualizations, a Game Explorer for pitcher analytics (velocity, spin rate, strikeout rate), and a Team Matchups view for head-to-head historical comparisons.
Highlights
- Lambda architecture combining Airflow batch ingestion with Kafka/Spark streaming
- Pitch-by-pitch historical game replay with interactive strike-zone visualization
- Pitcher analytics: velocity, spin rate, and strikeout rate
- Team head-to-head matchup comparisons across seasons