My first post looked at sack rate's impact on the point differential for a team. In this post I wanted to take a look at what factors impact sack rate. I will look at Down and Distance, Passing Statistics, and strength of schedule.
As always I have to start by thanking www.cfbstats.com for providing a wealth of play by play information on college football over the years. I grabbed all passing plays from the 2005-2012 seasons. From there I filtered plays for FBS-FBS only games as well as eliminating any plays that occurred in garbage time. This left me with a data set of 319,047 passing plays from the last eight seasons.
Down and Distance
Down and Distance to a first are some of the factors that come to mind that might affect a team's sack rate. The following table shows the amount of plays that occur at each down and some distance ranges (when I use "distance" from here on out I mean distance to a first down).
|1st Down||2nd Down||3rd Down||4th Down|
|All Passing Plays||118,631||103,177||91,625||5,615|
|Distance less than 26||118,569||102,812||91,142||5,574|
|Distance less than 21||118,107||101,478||89,722||5,496|
Since the vast majority of plays occur within 20 yards of a first down I am going to use that cutoff from now on, I don't think I'm losing any valuable information but there may be a small bias introduced from removing those plays. So what do the actual sack rates by down look like? I'm glad you asked. Here is a plot showing the average sack rate by down and distance. On this plot sack rate is measured as a decimal, so a sack rate of 10% is shown as .10.
There are a couple key takeaways from this graph:
- Sack rates on 1st downs and 2nd downs are only slightly linearly increasing, not nearly as much as 3rd and 4th downs.
- 4th Downs have much less data than 3rd downs, but it still follows the same general trend as 3rd downs: increasing at first, then tailing off as the distance to a first gets to be longer than 15 yards.
- I would consider 2nd down and 3rd down to be much more reliable than 1st and 4th, respectively. Very few 1st downs occur anywhere but 1st and 10 so while there are a lot of 1st downs, there are way more 2nd downs and yards to go everywhere but at 10 yards to go. Same with 3rd down and 4th down, there just aren't that many 4th and 12/13/14's to build a predictive model from.
Since there is a clear relationship between sack rate and distance, I wanted to try and estimate this relationship using regression. Instead of running a linear regression on the average sack rate by down and distance, I wanted to run a logistic regression on whether or not a team was sacked on a play. Logistic regression is used when trying to predict the likelihood of an event occurring, a percentage.
The dependent variable, what we are trying to predict, in logistic regression is a binary variable, it either happened or it didn't. In our case we are trying to predict the likelihood of getting sacked given your down and distance (your sack rate) and our binary variable happens at the individual play level, did you get sacked on this play or not? For you data geeks out there the best fit (only by look, I didn't do any model diagnostics) was provided by a polynomial model with Distance and Distance^2 as predictors. I ran this model separately for each subset of downs instead of adding a dummy variable. The fit for 3rd downs can be seen below.
All four downs can be seen at this image link. That is a pretty decent fit in my opinion, I will put the summary output in the comments so you can see the actual equation, but because logistic regression is essentially linear regression on the log of the odds of an event occurring the coefficients don't mean much without transforming them. This graph may be hard to get specifics out of, so I have provided an example of what the regression model could tell us for a specific instance. In this table I compare down and 5 to go for a first to down and 10 to go for a first, with both the observed values and my predicted values:
Predicted / Observed
Sack Rate on Down-and-5
Predicted / Observed
Sack Rate on Down-and-10
Predicted / Observed
|1st||4.13% / 4.60 %||4.91% / 4.91%||18.7% / 6.7%|
|2nd||4.65% / 4.80%||5.31% / 5.24%||14.4% / 9.2%|
|3rd||6.78% / 6.29%||9.66% / 9.33%||42.4% / 48.3%|
|4th||5.92% / 5.99%||8.88% / 8.54%||48.5% / 42.6%|
On 3rd Downs and 4th downs you would expect to see about a 45% increase in your sack rate when you move from 5 to 10 yards for a first. On 1st and 2nd it would only be a 10% increase. A 45% increase in sack rate is still not that much -- a team would have to drop-back 33 times on 3rd/4th downs for this 45% increase to realize one extra sack of value -- but its not insignificant.
While this probably won't change a team's strategy that much (I'm sure staying out of third-and-longs was already part of that strategy), this does allow for someone (me) to compare a team's sack rate to their expected sack rate when taking down and distance into account to see how each team preformed. So I pulled the 2013 passing plays and restricted the data set to FBS-FBS only passing games. For each play I found the difference between their expected number of sacks and the actual sack rate. I also compared each team's sack rate to the average sack rate for all passing plays, 5.86%. This gives me the sacks over expected for taking down and distance into account and sacks over expected just based on the average sack rate for all plays.
The correlation for these two numbers was .997. That took the wind out of my sails a little bit. That is some pretty damning evidence that it isn't necessary to take down and distance into account when looking at sack rate, but this analysis may still be important.
This post has gotten a little longer than I thought it would, so I will leave the analysis of passing statistics and sack rates until next time. If you have any questions or comments then please add to the discussion and comment below.