Hello! December has been Ratings Redesign month, and I wanted to give an update.
I'm getting close to something pretty awesome, a far more significant redesign than I originally intended or expected. The short(ish) version:
Peeking under the S&P+ hood
I'm getting closer and closer to deciding on a Sagarin setup, with a set of ratings to judge your performance to date and another set for predicting the path forward. Talk me out of it.
1. Instead of using a simple combination of Success Rate and PPP (or IsoPPP) to determine ratings, I'm tying the numbers pretty tightly to my Five Factors concept.
Success Rates and IsoPPP still play heavy roles, but so do the components of field position and finishing drives, the propensity for negative plays (a more consistent data field than I expected), and the components of turnovers. (Note: turnovers themselves aren't used; what I've tried to get at is a team's expected turnovers based on other factors -- forced fumbles, sack rates, etc.)
There is a slightly heavier emphasis placed on a team's success rates in scoring opportunities (i.e. first downs inside the opponent's 40). It is more complicated but far more descriptive and comprehensive. And it all comes back to what I've determined are college football's five factors: efficiency, explosiveness, field position, finishing drives, and turnovers.
2. Because of a focus on field position (for which punting, kickoffs, and returns play a large role) and finishing drives (place-kicking), special teams are getting mushed into offensive and defensive categories.
Indirectly, that means that punts and kickoffs have sort of turned into part of offensive ratings (because they create your opponent's field position), while returns have turned into part of defensive ratings (because they create yours). I might try stripping those out at some point, but it isn't a huge concern at the moment.
3. The bell curve has become very, very important.
The 68-95-99.7 rule says that 68 percent of data points within a normal distribution will fall within one standard deviation of the mean, 95 percent will fall within two standard deviations, and 99.7 will fall within three standard deviations. College football is meant for normal distributions.
Range Normal Distribution Points scored in a given game (2014) A team's pct. of points scored over a season (2014) Within 1 Std. Dev. 68.3% 67.9% 70.1% Within 2 Std. Dev. 95.5% 96.8% 95.5% Within 3 Std. Dev. 99.7% 99.5% 99.3%
If game and season output follow a standard distribution, it would stand to reason that a system rating teams that play these games and seasons would, too.
My old ratings incorporated standard deviations into the calculations but didn't adhere to the normal curve in the way that they should have. The new ones do, and it's made a difference.
4. The results thus far are incredibly encouraging. Like, really, really encouraging.
I have gone back and simulated the 2005-14 seasons to check the results the new system is producing. It's a little bit tricky because I've had to use our old F/+ projections for the "preseason projections" portion of the calculations -- coming up with newer, more apt and relevant projections will be next on the To Do list -- but going back to 2011, this new system is just about as accurate as any.
Using the Prediction Tracker, I can compare my results to those of more than 50 other computer rankings. This new S&P+ (or whatever I end up calling it) finishes second over the last three years and third (by a tiny margin) over the last four.
|System||2011-14 ATS||Rk||2012-14 ATS||Rk|
|Computer Adjusted Line||53.6%||1||54.3%||1|
|Least Squares w/ HFA||53.2%||2||53.6%||5|
|Payne Power Ratings||51.1%||19||52.7%||7|
|ARGH Power Ratings||51.5%||12||51.8%||17|
|Daniel Curry Index||51.6%||20|
|Born Power Index||51.5%||10||51.5%||22|
Now, finishing behind the Computer Adjusted Line makes sense. Here's more on that rating:
The computer adjusted line works similarly to the way the sportsbook adjust line. It starts off equal to the line but fractions are added or taken away from the line depending on what proportion of the computer systems are on one side of the line. Where one side means being at least 1.5 points away from the line. I think the fact that this 'system' is beating the line in this category is showing that the computers taken as a whole, do add useful information.
So basically, it is a summation of and reaction to other computer ratings. Hard to compete with that. (Meanwhile, you can find more about the Least Squares Model here. Warning: PDF.)
5. Because of normal distributions, no more clunky win probabilities.
I have come up with a much cleaner way to do them then a shaky formula based on past results.
6. These are thus far only my numbers.
I haven't yet figured out a combination of my S&P+ and Brian's FEI that makes these ratings better. I assume one exists, and there will almost certainly still be a combined F/+ rating moving forward, but for now I've been focused solely on my end of that combo.
7. Uncle Mo has a seat at the table.
In part one of this series, I talked about the difference between predictive and retrodictive numbers. Well, overall, the version of New S&P+ that is most predictive is just about the most retrodictive as well.
Now, as I mentioned in part one of this series, you can get more accurately retrodictive by giving heavy weight to big plays.
[A]s I tinker with a new ratings structure, I find that the most predictive measure I have for explaining previous results is IsoPPP margin, and the most predictive opponent-adjusted measures are offensive and defensive IsoPPP+. This makes sense to a degree -- if you had seven big gainers (however we define that) and your opponent had three, the odds are pretty good that you scored more points.
But there's one problem with IsoPPP and with looking at big plays in this way: big plays are quite random. Your ability to to produce them is key to winning games, but you don't really know when the next one is going to happen. Plus, we're lopping the sample size down by only looking at the magnitude of 35 to 50 percent of a team's non-garbage time plays (since that's where most success rates fall).
Still, by using the weights that are best for prediction, I can get pretty close to that retrodictively successful number. Close enough to satisfy me, anyway. (And that's what counts, right?)
By looking at week-to-week success with the picks, however, I noticed a pretty substantial slide happening late in most seasons. After quite a bit of tinkering, I came up with a pretty interesting, clean adjustment for that. If I use only the last eight weeks of data for picks, the percentages are pretty fantastic.
|Performance against the spread||W||L||T||%|
|Weeks 10-13 (or 14 when applicable)||489||400||21||54.9%|
|Post-season (Championship Week, Bowls)||114||117||5||49.4%|
Things get pretty random when the postseason hits. For Championship Week, be that Week 14 or Week 15, New S&P+ has had semi-crazy results: 44% against the spread in 2011, 45% in 2012, 27% (!) in 2013, and 61% in 2014. Meanwhile, bowls have had an interesting tale to tell as well: 51% in 2011, 54% in 2012, and 34% in 2013. (Also: 71% in 2010! And 43% to date in 2014.) There's still some potential work to be done here in solving the bowl mystery, but for the stretch run of the season, using only eight weeks of data instead of the whole season improves the numbers by a strong amount. This isn't an acknowledgement of momentum as much as it is an acknowledgement that in late-November, you aren't the same team you were in early-September.
(This does cause a bit of awkwardness in presentation, as I will always feel compelled to share a full-season number, too. So there will be an Overall S&P+ figure and a weighted, 8-week number.)
8. The possibilities for presentation are endless
I've also been tinkering a lot with presentation.
One of the steps in creating these new ratings is coming up with a projected score of sorts. Like, you scored 38 points, but based on your success rate, IsoPPP, etc., you could expect a score of about 36.2 points on average. These ratings are grounded somewhat in point totals, so why not present them as such? Instead of worrying about explaining that 100.0 is average (as in the current S&P+ ratings arrangement) or even trying to communicate the idea that a plus-10.0% rating means that a team/offense/defense/whatever is 10 percent better than average, what about presenting the rating based on projected point totals? What makes more sense, a 110.0 rating, a plus-10.0% rating, or saying the offense is worth about 31.5 points per game, that your defense is worth about 23.6 points per game allowed, and that your projected scoring margin -- now your overall rating -- is plus-7.9 points per game?
I'm not completely sold on this, but I'm getting pretty close.
Basically, between this concept, the Last 8 Weeks concept above, and the Second-order Wins idea I discussed a couple of weeks ago, the new S&P+ ratings page could look something like what is below. (I used late-2013 because 2013 Auburn is a pretty fantastic case study.)
(And yes, I'm thinking about even tossing in a general Strength of Schedule figure, even though I hate them.)
|2013 S&P+ after 15 weeks (inc. Championship Week)|
|Team||W-L||Pyth Wins||Diff||S&P+||Rk||Wtd S&P+||Rk||Off. S&P+
Using weighted and overall ratings, you certainly get a better feel for a team like 2013 Auburn, don't you? The Tigers were barely top-10 for the season heading into the BCS Championship game, but in terms of weighted averages, they were up to second overall. Sometimes you need more than one number to talk about a team that changed so starkly from the start of the season to the end. (Baylor is another good example of this. As you might remember, the Bears were dominant in September and October before getting dinged up and fading a bit in November.)
You still have some oddities in there, of course, like Indiana ranking ahead of Michigan State for the season as a whole (but very much not in the weighted average). That will always be there when you break things down into components.
Plus, while I intend to present the offensive and defensive ratings as means for your own projection -- combine Team A's offensive rating and Team B's defensive rating, and you have a projected score for Team A if the two teams played (with a home field adjustment of sorts) -- that obviously gets a little weird when you've got the overall number and a weighted number that doesn't match. The weighted numbers are shared above, but that doesn't feel quite right. And at some point, column space is at a premium.
Still, this is an exciting (to me) format.
9. I can't say I'm really married to the S&P+ moniker anymore.
This has certainly grown beyond the original approach. It would feel strange to change it now, though, wouldn't it?
10. I'm absolutely seeking feedback for all of this, especially when it comes to the layout.
The goal is to design a system that is both wonderfully predictive and easily useful for analysis. (After all, analysis is why I began tinkering with numbers in the first place, not beating Vegas.) Combining an overall ratings layout like this with the same type of breakout for offensive and defensive ratings (rushing, passing, standard downs, passing downs, etc.), and you've got all sorts of powerful tools. But I want to make sure I'm sharing what is most useful, and in the cleanest, most relevant way. The floor's yours.