Making picks using statistically similar games

Ok, I'll admit it. I like to follow lines, totals, and results - for entertainment purposes only, of course (insert obvious wink).

It's not like the Worldwide Leader needs the bump, but they have a pretty comprehensive pick center for college football. It's essentially a dashboard of info centered around the gambling lines. While I personally don't place much value on the pick center info, buried towards the bottom of a single game's dashboard is a table listing historical games that are the most statistically similar. That table has always intrigued me, so I've decided to recreate it.

An important part of this exercise is deciding what statistics to use for determining game similarity. Frequent visitors to this site are privy to the importance of the Five Factors - I'll use offensive and defensive statistics representing each of the five (PpP for Explosiveness, success rate for Efficiency, average drive start position for Field Position, points per trip inside 40 for Finishing Drives, and havoc rate for Turnovers). Casinos and sportsbooks don't tend to go bankrupt - unless they're owned by Donald Trump - Vegas knows waaaaay more than the betting public. Therefore, in addition to the O and D stats I'm using to represent the Five Factors, I'll be including the home line and the over under total in my statistics used for determining game similarity.

How do we tackle this? Let's start with the data. The dataset I'm using consists of all games between FBS teams (aren't both divisions "FCS" at this point?) in all weeks 4 through Bowls and in all seasons 2005-2014. For a game in week 8, the Five Factor statistics will be represented by opponent-adjusted statistics through week 7 of the same season. The home spread lines and over under totals are typically near closing values. I'm starting at week 4 games because every team has played a game by that point in the season.

Now for each game we'll need to calculate it's "similarity" to all the other games in the dataset. For this, we'll use the initial step of performing a hierarchical clustering: calculating a distance matrix. The distance matrix contains the calculated distance of every game from every other game. The distance is calculated using Euclidean Distance. Without getting too mathy, I think I can explain Euclidean Distance: in one dimension, the Euclidean Distance is the absolute value between two numbers, in two dimensions it's the length of a straight line between two points. Our Euclidean Distance will be calculated on 22 dimensions - we're using 22 statistics (home line, over under total, 5 Home Offense Five Factor stats, 5 Home Defense FF stats, 5 Away Offense FF stats, and 5 Away Defense FF stats).

Next, we need consider the scale of the variables/statistics being included. Calculating the distance on the raw variable values will give more weight to variables on higher scales. To correct this, I've normalized most variables - each normalized variable is scaled so that the scaled values have a mean of 0 and standard deviation of 1. To illustrate, the distance between the success rate of the home team's offense going into a particular game from other games is between 0 and 1, while the distance between average drive start positions is theoretically between 0 and 100. Using un-normalized values would give much more weight to average drive start position. Because Vegas is always right, I didn't normalize the home line. I DID normalize the over under total for a couple reasons (if you're terribly interested in the reasons why, ask in the comments or ask me on twitter @RadDad_17).

Now that our data is in order and the game distances are calculated on mostly normalized data - pick a game. Then, it's most "similar" games are the games in the dataset that are the shortest distance from the selected game. Here are some examples.

Let's start with a game between elite teams - last year's national semifinal game between Ohio State and Alabama. As a reminder, here was the outcome (click to enlarge):

And the top 5 similar games:

Looking at these games, they do seem like they're really similar. Almost all of them are between highly-ranked teams in bowl games. Four out of five of the similar games had equal results for the winning team, the team that wins against the spread, and the over under total.

For comparison, let's also look at similar games between some bottom-feeders. Here's the similar games for Kent State @ Miami (Ohio) in Week 9 of last year (click to enlarge):

And the top 5 similar games:

Again, these games are pretty similar and mostly contested by bad teams. For this example, all five of the similar games agree with the winning team outcome and four out of five games agree with the winning team against the spread.

If we cherry picked and just used the two examples above, we'd think we could all go running to our nearest sportsbook, or closest friend named "Vinny"... and over the long haul be swimming in gold coins like Scrooge McDuck, right? Sadly, no. I ran the numbers for all games in the dataset and the top 5 similar games will predict the correct outcomes against the spread and on over under totals about 53% of the time. Better than half, but not quite enough to break even when figuring in the vigorish.

However, I still think this is a really useful tool. I've deployed a Shiny app that will let you select every game in the dataset and see the top 200 similar games - view it here. There are two tabs that show the most similar games - one using normalized over under totals and one using raw totals. Once we hit week 4 of the current season, I will update the app weekly to include upcoming games. I'll try to have it updated on Tuesdays - check back in Week 4 or 5 and use it as an aid for making your own picks (for entertainment purposes only, of course).

(This is not the first time I've written about predicting games using the Five Factors)