Statistical Analysis of the Crossfit Games Open

My personal complaint with most athletic training programs is how little they’re based on real science.  We have tons of anecdotal information but not enough real studies to say for a fact what training program will or won’t work.

With over 200k participants, the 2013 CrossFit Games Open give an amazing opportunity to do analysis into athletic performance — especially as it relates to CrossFit. If you take a look at the CrossFit Games website, we have a huge sample of people who have all performed the same athletic events to exacting standards. These people vary in age, ability, sex, place, strength and more.

But besides the controlled experiment of the real five workouts themselves, these people have self-reported a large amount of data about themselves.   Here’s an example: http://games.crossfit.com/athlete/12764  (Yes, that’s me.)  Besides my age, I have provided my max lifts, scores in certain benchmark workouts, how I eat, how old I am, my sex, place, etc.  Across tens of thousands of people, we have a pretty powerful data set.

Purpose

So, I set out to do some science to try to answer some fundamental questions about human performance and CrossFit:

  • What is the ideal height and/or weight for CrossFit?
  • How does athletic performance degrade over time and vary with age?
  • What other exercises or workouts correlate the most strongly with increased Crossfit performance?  Many people believe it is an increased level of back squat. Others claim it to be the snatch. There is a clear “Church” of Olympic lifting in the CrossFit community that says that people who Olympic lift more will do better CrossFit. Is that true?
  • Can a formula be developed that would project CrossFit performance based on a combination of the most common strength lifts and the most common Crossfit benchmark workouts?

Data Gathering

The first up was assembling the data. Using a series of Python scripts, I assembled every athlete’s profile, removed their names and then associated their scores for all five weeks of the open. That resulted in a comma separated value file that I could then put into the different data analysis tools out there. This was really my first attempt doing Python after a ten-year hiatus from programming (life, by the way, has gotten ridiculously easy for programmers!).

Analysis

The second step was to try to analyze the data using a variety of tools. I ultimately used a combination of Excel, R, an online service called StatWing, and a startup called DataRobot (still in stealth mode).  For full disclosure, my data analys training includes one semester of undergraduate statistics.  It also involves a lot of looking up things on Wikipedia 🙂

An expert in statistical analysis can clearly do some fancier stuff with this data —  or even find where I have made a grievous area. Totally cool!  If you have ability in data analysis, send me an email and I will work on getting you the data so you can do more.  I’m not doing this for-profit but hoping we can learn more from the data and train better.

Michael Girdley