In my previous posts I looked at the relationship between sack rates and margin of victory, down and distance, and team passing statistics. I now want to determine the best way to predict in season sack rates based only on what a team has done up to that point.
Once again, I used the play by play data from CFB Stats to build my passing play data set. I filtered for non-garbage plays and only included games between FBS opponents from 2005-12. I wanted to be able to split the data into first-half and second-half statistics, but the game files do not include a "game number" that tells you if the game was a team's 1st/2nd/3rd, etc. They do however, include a date for the game and I managed to use R to calculate the week of the year that the game occurred. I used the first reported week of the year as "Week 1" of the season, and then worked from there. Any games that occurred in December or January I left just as "Post-Season" games.
Somehow, this managed to all work out to a 15 week season. This left us with 134,974 plays in the first half of the seasons and 167,449 plays in the second half of the seasons. This discrepancy probably lies in the fact that a lot of teams have bye weeks early and also there are more FCS games in the fist half of the season. I then calculated multiple statistics for each team's half-season in each year that would help us predict sack rates.
Regression to the Mean
Before I even get into calculating correlations and running regressions and doing all that other fun stuff, I wanted to introduce the topic of regression to the mean. Regression to the mean tells us that any team that is performing greatly above average will probably perform at a level between their current production and an average team in the future. The same thing applies to teams performing terribly.
Essentially, its tough to be really, really bad or really, really good over an extended period of time, and when predicting the future performance of something we should add some level of league average performance to our expectations.
But how much league average performance? This excellent and super math and nerdy post from Phil Birnbaum (an excellent sports analytics guy for any aspiring fans out there) goes into the math to tell us just how much of league average performance to add to any statistic to best predict future performance. This next paragraph will go into the math behind it, so feel free to skip it if you don't care at all.
(Before I get started, I want to say that if I miss anything or do something wrong with the math, my bad. Please let me know in the comments if you have any suggestions for improvement.)
For any binomial distribution our observed variance can be expressed by the sum of the true variance and the binomial variance (the variance from a natural binomial process), or mathematically as Obs_Var = True_Var + Bin_Var. We can estimate a a team's sack rate as a binomial distribution if we consider the number of times a team gets sacked over their passing plays as a series of trials with a probability of getting sacked and a complementary probability of not getting sacked.
The average Sack Rate for all teams from 2005-12 was 6.139%, so this is the probability of an event occurring that we can use for our binomial distribution. The binomial variance can be expressed as the probability of the event occurring times the complimentary probability, all divided by the number of trials, expressed mathematically as p*(1-p)/n.
For the number of trials I used the average passing plays for all team-seasons in my data set, 335. The variance occurring from a binomial distribution with p = .06139 is .000172. The Observed variance of sack rates for all team seasons was .000638. This means the True Sack Rate variance is .000466. In Phil Birnbaum's post, he showed that the number of league average plays to add to any variable was equal to p*(1-p)/True_Var. This is the case for any binomial distribution. Because we know the True Variance of a team's Sack Rate is .000466, we simply calculate .06139*(1-.06139)/.000466 to find that we should be adding 124 plays of league average sack rates to any past sack rates in order to accurately predict future sack rates. At 6.139% this would mean adding eight sacks and 124 drop-backs to any team's totals. Wasn't that fun??
Predicting Future Sack Rates
Taking the data I described earlier I calculated each team's 1st half and 2nd half Sacks and Pass Attempts for each season. For example in 2009 Georgia Tech had 9 sacks on 77 drop-backs in the first half of the season, and 3 sacks on 69 drop-backs in the second half. In order to regress their first half performance to the mean I added 124 drop-backs of league average sack performance. Their new sack rate went from 9/77 = .117 to (9+8)/(77 + 124) = .0845. Because Georgia Tech is an option team that doesn't throw very much, the league average rate we added to their performance is a huge factor. For other teams that throw a ton it counts less to their overall average. The key is that we add the same amount of league average performance to each team's sack rate, no matter how many attempts they have. With this we can run some simple correlation studies on how well we can predict a team's sack rate in the 2nd half of the season based only on what information we have from the first half of the season. The correlation between a team's 1st half and 2nd half sack rates? .44. The correlation between a team's 1st half regressed sack rate and their 2nd half of the season performance? .444. So we are splitting hairs here. There is an increase in predictability, but only slightly. There are obviously other factors at play here. What about when we look at other passing information? Here is a table listing the correlation between different 1st half passing statistics that I looked at in my previous post and that same season's 2nd half sack rate:
|1st half Statistic||Correlation with 2nd half Sack Rate|
|Yards per Completion||.067|
|Adj. Net Yards/Attempt||-.127|
|Regressed Sack Rate||.444|
I ran a couple of different regressions with these variables and also interaction terms and there was really no significant change from just using regressed sack rate. A chart showing the correlation among these variables can be seen at this image link.
And again, because plots are the best here is a visual representation of the difference between using Sack Rate and Regressed Sack Rate to predict a team's sack rate in the 2nd half of the season:
As you can see, using regression to the mean restricts the possible values for your 1st half regressed sack rate. We don't have teams with a sack rate of 0.0%, and instead have many teams clustered around average. But with so much variation in sack rates I think this is okay.
There is one last piece to this puzzle.
The last thing that I will test that could impact sack rate are the opponents a team faces. You would expect a team to get sacked more against a tough schedule full of dominating defensive lines, but can we measure this effect? To keep things simple, I simply calculated the average sack rate forced of the defenses that an offense faced in the first half of the season. I then calculated how many sacks over or below expected that your offense allowed for the 1st half of each season. Take it as you will. We are only dealing with six games in the sample, and I am only doing one "loop" of opponent adjustments, so there is plenty of room for variation. I'd love to hear some strategies for improving this section of the analysis.
After doing this, I had a "Sacks Over Expected" and "Sacks over Expected per Play" measure for each offense in the first half of each season in my data set. Before we get into any predictions with this information I want to see how it relates to some other stats. Ah hell, let's just do it all in one graph.
This plot shows an offense's Opponent Average Sack Rate Forced vs other statistics about that offense: their sack rate in the first half of the season, their sack rate in the 2nd half of the season, their Sacks over Expected measure for the first half of the season, and their Sacks over Expected per Play (positive values for these measures means you got sacked more than you should).
- Since your opponent's average sack rate forced is a product of your own sack performance you would expect these two variables to be correlated. Your performance should impact about 20% of the variation in your opponent's defensive rating (1 game out of 5 or so), but it causes 29.6% (.54^2) of the variation in your opponent's defensive rating.
- Your first half opponent's defensive rating doesn't tell you much about your sack rate in the 2nd half of the season, with a correlation of about half that of 1st half sack rate and 2nd half sack rate.
- The Sacks over Expected is just a raw total, so it will also be influenced by the amount of passes a team attempts. Both that measure and Sacks over Expected on a per play basis are loosely correlated with your opponent's average sacks forced. This makes sense, I think; you should be able to perform better than expected regardless of the quality of your opponent.
- When I ran a regression combining first half regressed sack rate and quality of opponent faced, I got basically the exact same result using just regressed sack rate. So once again, all this really doesn't matter.
I think its about time to wrap this one up, don't you? What have we learned through all of this? The best single predictor of a team's future sack rate is their current sack rate, regressed to the league average performance. Other models with more predictors perform only slightly better than just regressed 1st half sack rate. Even taking into account the rate that the defense's you faced forced sacks does not marginally improve your forecasts. Here are some other take aways from these series of posts:
- Down and Distance do impact your team's sack rate on a per play basis, but the effect is so small it doesn't show up much at an overall team level.
- Various passing statistics correlate with your team's sack rate, but none can marginally improve upon using your own team's 1st half sack rate to predict your future sack rate performance.
- Obviously there are a ton of possible improvements on this analysis. Just to name a few I could have better defined a team's 1st half and 2nd half using actual series of games not just the weeks the games occurred, doing proper opponent adjustments, using a true measure of defensive talent instead of just average sack rate forced, predicting future sack rate based on the opponents you will be facing, and I am sure there are many, many others. But I am at 2000 words and I think its time to quit.