When I first started in data analysis and completing projects for clients, I primarily used either SQL Server or an application called ACL. I learned data import and ETL workflows way back in the days of SQL Server’s Data Transformation Services (DTS), sometimes even creating packages programmatically. The landscape for data analysis — databases, analytical tools, visualization software, etc. — has changed significantly since those early days, and staying up to date remains an area of interest for me.
I try to keep up with these changes by completing small projects that introduce me to the tool or application or methodology. My list had been piling up, so I gathered some baseball data and went to work.
Across the major sports (and, arguably, all sports) baseball stands out as having the most data-centric culture in both fans and team management. One site that provides access to the rich amount of baseball data is baseballsavant.com. The site is owned by MBLAM and developed and managed by Daren Willman, who works for MLB.com. The site provides a gateway into MLB’s Statcast database. Users can enter queries and filter results right from the page. More importantly for my interests, interested users can download the raw data in csv format. There are just a few of caveats: (1) the size of the download is limited — there’s no ability to download an entire season’s worth of data; (2) the csv format needs some slight touch-up work, depending on how you want to use it; (3) individual records are pitch-by-pitch and contain a number of different values that are not normalized. None of these is too big of an issue and were easily dealt with.
During some recent downtime (read: while watching football), I downloaded the raw pitch-by-pitch data for the 2016 season into weekly csv files and then concatenated them into a full season. The complete dataset is over 725,000 records. (There are more elegant, programmatic ways to do download that data. I didn’t use them.) While certainly not GB-scale, a 700k set of records provides a decent population to work with.
I set a few goals for myself while working with this data:
- Become more fluent in Python, and specifically pandas — SQL is my native language for data analysis, and I wanted to broaden my skills
- Kick the tires of the new version 5 of the ELK stack, including Kibana
- Explore AWS’s new Quicksight product
The best thing to do to accomplish a goal is to just start. I’m prepared for this to be a fairly elementary exercise — no deep analysis or innovative findings. “Keep it simple” is the motto today.
Last fall, the aforementioned Daren Willman posted the following on Twitter:
Although Daren tweets a lot of similarly-themed stats, I focused on this particular one for my exercises. Average fastball spin rate is a pretty easy calculation (if perhaps not “interesting” by itself), and this is what I used as my control total and target “analysis” to accomplish for each of my goals. I say “analysis,” because it’s not really a big deal to calculate the average spin rate and then sort by pitcher. But I’d rather try to replicate something simple, like recalculating an average, when I’m learning new skills and tools.
Accomplishing these three goals was a lot of fun. Gaining a working familiarity with more technologies is always beneficial, and I feel like I did that with these three. I also really liked working with the baseball data set. It provides a variety of different analyses and data attributes to deal with and a volume of records that, while not substantial, provides at least a small degree of real-world relevance.
In the near future, I'd (hopefully) like to take a look at Amazon Athena and dig more into Timelion, which is now integrated into Kibana 5.