Some of you may be familiar with my Win Probability Model I developed last season. Basically, I built a logistic regression model that takes in the down, distance to first, yard line, current score, the time remaining in the game, and indicators for home, away, or neutral site games and predicts the probability that the team on offense is going to win the game.
I'll have a post coming out soon that details this model further and includes validation and improvements from last season (you can view a preview here). Today we are going to discuss a little problem with the underlying theory of the model, mainly that in simple regression modeling we assume our observations and errors to be independent.
Unfortunately this isn't the case for a play-by-play win probability model, but through a modified bootstrap procedure I think I can show that is isn't influencing the results too much.
Independence of Errors
I'll preface this by saying that while I technically learned all this I am by no means an expert on the subject of linear regression assumptions. So if I am not explaining things exactly right, please let me know.
There are many assumptions that simple least squares regression modeling makes, and Wikipedia has a lot of decent explanations of them here. But the expectation that your errors in the model are independent is one of the more important ones. What this means is, if the residuals in your model -- the error between what you predict and what was actually observed -- are not independent between your observations, then the estimates that the model produces may be influenced by this dependence and will no longer be valid for use. While the Win Probability Model is technically a logistic regression model the same general modeling principles should still apply (*).
* Author's note: This comment was added to clear things up, thanks to Hermers for pointing it out.
Basically this means the coefficients that the model develops for having the ball on first down, or second down, etc... won't reflect the true relationship between this variable and the dependent variable, whether or not the team on offense ended up winning the game.
Unfortunately this is the case for a win probability model built from play-by-play data. The model attempts to predict, at a per-play level, the outcome of the game, which means that during the fitting of the model I use all plays from a game to predict a single game outcome. So all 120 plays or so have the same dependent variable.
So it may be unclear what is driving the actual outcome -- is it being up 10 in the first quarter or up 14 in the fourth? With enough observations from enough games, the hope is that these things even out and the model will be able to tease out the true main effects from the play by play variables. So the question becomes, how do we determine if our coefficient estimates are garbage or actually reflect the true relationships at play. I think you can answer this using a modified bootstrapping algorithm.
The bootstrap is a nifty statistical procedure that allows one to get estimates of the uncertainty of whatever quantity they are attempting to estimate. Conceptually it is very simple -- all you do is repeatedly sample, with replacement, from whatever data set you have and record the estimate of the quantity you want each time. If you do this say 1,000 times, you will have 1,000 different estimates of the value you want all built off different data sets, allowing you to quantify your uncertainty around this estimate much clearer than you may otherwise be able to.
How does this play into our situation? I'm not going to use the vanilla bootstrap model exactly. What I did was get a list of each college football game from 2010 to 2014. I then randomly sampled one and only one play from each game. This gave me nearly 4000 observations that were all independent. I then estimated a model on this sample data and recorded the coefficient estimates. I then repeated this 200 times. I'm not sure if there is a technical term for this procedure, which is why I just refer to it as a modified bootstrap procedure.
So what does this modified bootstrap procedure allows us to do? Now I have 200 estimates of the coefficients for each variable in my win probability model and I can generate loose confidence intervals on my original coefficient estimates simply by looking at the 5th and 95th percentile of coefficient values I observed in my modified bootstrap results. Here is how those estimates compare to the full fitted values. The blue dots are the training data coefficients and the boxes represent the 5th percentile, median, and 95th percentile bootstrap estimates.
Basically the taller the box the more variance there is in the underlying coefficient estimate. But more importantly, and the whole point of this post, is that our full data coefficient estimates don't vary much, if at all in some cases, from when we fit models with no correlated observations from the same game.
I'd say this is pretty good evidence that we can still draw inference from a win probability model built off play by play data. I am not saying that this is 100% proof that we don't need to look at this issue more but I think it's a great first step.
If you have any questions, concerns, comments, or suggestions for improvements on this approach then I'm happy to discuss them with you. You can comment here or find me on twitter, @millsGT49. I also think you can get my email from my SB Nation fan page.