Analyzing 538's March Madness Win Probabilities with Alteryx and Tableau
/This post is a run through of the approach and some techniques utilized to build out the long form story visualization embedded at the bottom of this post.
Data Source:
First and foremost, the data source for this viz (as well as a great deal of inspiration) is the FiveThirtyEight (538) March Madness Predictions site (one of my favorites each year). You will note that the in-game probability line chart is also a re-make of 538’s work and not my original design.
I leveraged Alteryx to pull and parse the data for the analysis (see my workflow below). It just makes things so easy, I had to add this note to the post. There was one point in the project where I decided on a tweak something, it took me a matter of minutes to pull up my flow, add/remove a few tools and generate an updated set of files for Tableau, and everything just worked.
Story:
The concept for the story comes from visually comparing these two games (and several others like them). As I followed the early rounds of the 2017 tournament and tracked the games via 538’s predictions site, it seemed, more often then not that Men’s tournament games were closer then the Women’s.
As I continued to watch the tournaments and eyeball the difference in games between them, this seemed to often be the case. Once I was able to find the data for the 538 charts, I had pretty much sold myself on trying to do a comparison between the two tournaments to see what the actual numbers will tell us.
Viz:
After a few iterations, I realized that I needed to walk through how I was taking the detailed win prediction data and then aggregating it. This turned into the approach section at the top of the viz, shown below:
Here we take a specific game's win probability by team and create four different views from this one set of data.
- A Line chart showing the two teams win probability throughout the game, a re-make of the 538 charts.
- Using Gantt bars to show the difference between the two win probabilities. Three easy steps in Tableau:
- Change the mark type to Gantt
- Remove the secondary axis pill
- Drag axis to size and change pill from team1_prob to –(team1_prob-team2_prob)
- Use a bar chart to show absolute difference with baseline at zero. Three steps:
- Change mark type to Bars
- Remove the secondary axis pill
- Change pill to ABS(team1_prob-team2_prob)
- Use a bar chart to show above transformed into close game index. One step:
- Change pill to 10-(10*ABS(team1_prob-team2_prob))
Bracket:
I wanted to take a look at seeding and how it potentially impacts how close games are. For example, the 1 seed vs 16 seed games will not be as closely contested as the 8 seed vs 9 seed games. I settled on trying to build a heat map view across the bracket, with seeding in tact. This would allow the user to follow seeding from the first round to the final four. For example, we can see the Women’s 1 seeded teams appear to dominate (e.g., close game index is very low) most of their games until the elite eight.
There are a lot of different ways you can go about building a bracket in Tableau and I am definitely not the first one to do it. For example, here is another March Madness viz that Corey Jones just shared a few days ago. With the bracket, one thing that came to me, it is a fixed ancestor tree. Each child has two parents, that is not going to change, two teams play, one loses, one wins. Back when I built my family tree in Tableau, I worked out the logic for the ancestor tree view. I could take that piece of work to create a bracket, then just map the results to the view. If I went that route (and didn’t change any formatting), the result would have looked like this.
Ultimately, I decided to go for a supporting tree structure that was closer to the brackets you are used to seeing (if you happen to follow either one of the tournaments). Here is where I landed with the bracket “heat map”.
This viz is not dynamic (but it could be!), it is built specifically for this bracket, with 32 nodes and the math was pre-calculated and node coordinates fed into Tableau. The main reason behind this? (1) It was easier to build it this way and (2) I don’t have any requirements to change this structure.
The bracket is a dual-axis with one axis for the bracket lines and a second for the hexagon points. The hexagon axis leverages an IF statement which only returns a value for nodes on the coordinates where we want them displayed. This field was also used to place the labels on the sheets above the brackets as well.
Trellis Panel:
You already know that I am a big fan of 538’s win probability line chart. I also wanted to have a chart for the teams scores as often the game was much closer in score than win probability. For example, you may have a heavy favorite down a few points early, but still have a win probability over 80% (with good reason). I think it is pretty interesting to compare the score of the game to the win prediction 538 calculated, thus really wanted to get the two graphs side by side.
Here I ran into some trouble with the data structure I have, trying to get the two graphs in my dynamic trellis and next to one another horizontally. This is the resulting view I came up with and below I walk through some steps on how to build it.
I realized when trying to force pills onto columns that I was going to need to do this all using a single pill. This is by no means a deal breaker, but does add some complexity. Then the same thing happened to me on the rows shelf, I realized I need to have a single scale for the two charts to work off.
For the columns issue, I took a page of out of Noah Salvaterra’s book from his Chord Diagram post. I decided to create an additional “copy” of the data (think union of your data), allowing me to create more layers in the single pill. Noah demonstrates having 5 layers in his post, I only need 2, so a much simpler piece of work for me. Here is an overview of how I got this to work and the corresponding formula below (hardcoded for now based on my specific y-axis scale, but doesn’t need to be).
Now that I have this done, I am left with the issue of rows needing to be on the same scale. For this I used percentage transformation (especially since win probability was already a percentage. The downside here is that you cannot show the axis to the user, you would either need to use mark labels or reference lines to assist with reading your graph. Upside is that vastly different scales can be displayed next to each other using the same axis.
Here are the calculations I leveraged, the steps are:
- Calculate the maximum game score for team 1 and team 2 across the whole data (via Fixed LOD).
- Find the maximum value between the two results above.
- Use that maximum value of the denominator for the percentage transformation
- If we are in copy 1 then use probability, if copy 2 use score percent transformed.
One Big Improvement:
One of the things that bugs me about the view is that the reader has to scroll and transition themselves from one graph to another. This is a situation where a little animation could have gone a long way. Imagine the approach section as a single view with transitions from one step to the next, ultimately transitioning into the summary below… More to come… possibly...
Lastly, thanks to Anya A'Hearn and Corey Jones for the feedback they provided on iterations of the visualization below.