clock menu more-arrow no yes mobile

Filed under:

Peeking under the S&P+ hood

New, 18 comments

Some notes from a college football data redesign.

If you buy something from an SB Nation link, Vox Media may earn a commission. See our ethics statement.

Wikimedia commons

I'm getting closer and closer to deciding on a Sagarin setup, with a set of ratings to judge your performance to date and another set for predicting the path forward.

Talk me out of it.

I usually view my work at Study Hall and Football Outsiders as work-in-progress stuff. My philosophy has never been to wait until something's perfected to share it and use it; part of that is because nothing is ever perfected, and part is because I want to have a conversation as I go. This backfires on me occasionally, as someone jumps in and assumes that if I'm using a given measure or line of thinking, that I'm sold on it and I'm willing to live and die by the results.

Part of that is an offshoot of the general "Watch the games, nerd" philosophy that tends to end up in my inbox at least once a week, and part is because you do tend to sometimes see stat guys saying "LOOK! NEXT BIG THING RIGHT HERE!" It doesn't happen nearly as much as the anti-stat crowd thinks, but it does happen.

F/+ and S&P+ are good measures. They do just well enough in their predictive ability, and they pass the eyeball test for probably 95 of FBS teams, so I pretty frequently use them as reference points for what my eyes see. But they can always be better, and I'm always looking to make them better. (Quick differentiation for beginners: the S&P+ measure is mine, and F/+ is the combination of my numbers and Brian Fremeau's FEI.)

I mentioned a couple of times in November that I was tinkering with an overall ratings redesign. In early-2013, I rolled out a new approach to S&P+ that involved drive ratings, and with the Five Factors concept I've tossed around over the last year, I've been trying to figure out whether that could not only improve my analysis, but improve my ratings as well.

In November at Football Outsiders, I did some serious over-fitting to figure out how close I could get to explaining a team's percentage of points scored with success rate, IsoPPP, red zone success rate, turnovers and turnovers luck, sacks, and special teams. I got pretty close.

Following up on that pursuit, I've spent a lot of time in my data hole this week. (It's not literally a hole, though our office is indeed downstairs.) What have I been tinkering with so far?

Separating all plays into two silos: Inside the opponent's 40 (or some similar threshold) and outside

I go on and on about how "the game changes when you get closer to the goal line," and I include finishing drives as part of the Five Factors, and it's time to figure out if the stats also change in some way that we should be measuring.

(The early conclusion: explosiveness outside of the 40 and efficiency inside the 40 are the most important things. Makes sense, right?)

Tying measures and projections attempts to normal distributions

Because this started as a "fun with numbers" thing way back in 2007, and because I've never given myself the time to draw up a new data structure from scratch, I basically just build on top of what already exists with each passing year. I use a pretty good structure, and the ratings themselves are pretty mathematically sound. But while standard deviations creep into my work here and there, I've never gone out of my way to tie ratings to normal distributions. And I think that would be a pretty good thing to do because things like points scored and a team's percentage of points scored follow the normal distribution awfully closely.

The 68-95-99.7 rule says that 68 percent of data points within a normal distribution will fall within one standard deviation of the mean, 95 percent will fall within two standard deviations, and 99.7 will fall within three standard deviations. College football is meant for normal distributions.

Range Normal Distribution Points scored in a given game (2014) A team's pct. of points scored over a season (2014)
Within 1 Std. Dev. 68.3% 67.9% 70.1%
Within 2 Std. Dev. 95.5% 96.8% 95.5%
Within 3 Std. Dev. 99.7% 99.5% 99.3%

If game and season output follow a standard distribution, it would stand to reason that a system rating teams that play these games and seasons would, too. And they almost do. Almost, but not quite.

Range Normal
Distribution
F/+ Off. F/+ Def. F/+ ST F/+ S&P+ Off. S&P+ Def. S&P+
Within 1 Std. Dev. 68.3% 57.8% 64.1% 60.9% 74.2% 65.6% 64.1% 68.8%
Within 2 Std. Dev. 95.5% 93.0% 94.6% 96.1% 91.4% 94.5% 95.4% 96.1%
Within 3 Std. Dev. 95.7% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0%

Basically, for most of these measures, there aren't enough teams within one standard deviation, and there are none even close to three standard deviations away. One hundred percent of all data points basically reach about 2.2 standard deviations away. That's not bad, and it's pretty damn consistent, but it's a different curve. So I'm taking the steps to fit the numbers I use onto the right curve. It won't change much, but it could make some important changes when it comes to point distributions and projections of that nature.

Success Rate vs. IsoPPP

My ideas regarding efficiency and explosiveness have shifted pretty significantly over the last year or so. Both will always be important, but my idea of the proportion of each has changed. Compare:

From my 2010 "Four Truths" piece for FO:

Truth No. 2: Big plays win games.

The OPS measure in baseball is a direct split between on-base percentage and slugging percentage. It is a wonderful measure, but it is not quite as accurate as it could be, as on-base percentage matters more to good offense than slugging. The opposite is true in college football -- the explosiveness measure (PPP, the "slugging percentage" of the S&P equation) is almost always tied more closely to winning games than success rates (the 'on-base percentage' piece). Both matter, but on offense and defense, PPP matters more.

Again, this is the way it should be. Nothing is more demoralizing than giving up a 20-play, 80-yard, nine-minute drive. But unless your team is Navy, that doesn't happen too often. Defensive coaches often teach their squads the concept of leverage -- prevent the ball-carrier from getting the outside lane, steer him to the middle, make the tackle, and live to play another down. It is the bend-don't-break style of defense, and it often works because if you give the offense enough opportunities, they might eventually make a drive-killing mistake, especially at the collegiate level. If you allow them 40 yards in one play, their likelihood of making a drive-killing mistake plummets.

QUOTE

From last year's second Five Factors post, which stripped efficiency from explosiveness:

Instead of simply looking at Success Rate and PPP (Equivalent Points Per Play), what if we added together Success Rate and the PPP for only successful plays? It puts efficiency first, which isn't a surefire winner, but it frames things in an interesting way: How efficient are you, and when you're successful, how successful are you?

Using full-season game data from 2012 and 2013 (with FCS games removed), I crafted a new version of S&P using Success Rate and this Isolated PPP idea (PPP on successful plays only). The most effective weights: 86% Success Rate, 14% IsoPPP. With that weighting, I was able to almost exactly recreate the strong correlations between S&P and both points scored and percentage of points scored.

Eighty-six percent efficiency, fourteen percent isolated explosiveness. That makes it sound like efficiency is far and away the most important aspect of college football offense and defense. And I think it is.

However, as I tinker with a new ratings structure, I find that the most predictive measure I have for explaining previous results is IsoPPP margin, and the most predictive opponent-adjusted measures are offensive and defensive IsoPPP+. This makes sense to a degree -- if you had seven big gainers (however we define that) and your opponent had three, the odds are pretty good that you scored more points.

But there's one problem with IsoPPP and with looking at big plays in this way: big plays are quite random. Your ability to to produce them is key to winning games, but you don't really know when the next one is going to happen. Plus, we're lopping the sample size down by only looking at the magnitude of 35 to 50 percent of a team's non-garbage time plays (since that's where most success rates fall).

One way to look at the reliability of a measure is to compare a team's averages in two different clumps and see how well they correlate to each other. Doing that shows us that IsoPPP is far from reliable.

Inside opponent's 40 Outside opponent's 40
Correlation between first and second half of 2014 Success Rate IsoPPP Success Rate IsoPPP
Offense 0.428 0.188 0.384 0.161
Defense 0.174 0.143 0.362 0.116

The difficulty of the schedule between the first and second half a team's season can change pretty drastically. More often than not, the first half of the season features your three to four non-conference games, and for most teams, that means at least one or two relative cupcake games. So the change in schedule strength will tamp down some of these correlations a decent amount.

Still ... what are pretty decent first-half-to-second-half correlations in the success rate categories (aside from defensive success rate inside the 40, which is interesting/odd) are pretty damn weak correlations in the IsoPPP categories.

I can come up with a pretty damn accurate way of summarizing what has happened to date by using IsoPPP and other small-sample things like special teams outcomes. But as I begin to simulate previous seasons to see how different types of ratings would do in real time, I'm thinking using IsoPPP and special teams will hold back the predictiveness of more reliable measures like Success Rate.

And now we basically reach the crossroads of any system of ratings. Are we looking to evaluate a team based on what's happened, or are we specifically trying to predict what will happen moving forward? Ken Pomeroy openly and frequently reminds people that his basketball ratings are intended to predict, but I've liked the way S&P+ and F/+ more or less split the difference, telling you what has happened but giving you a pretty good idea of what will happen, too. When I began all of this almost eight years ago, I wasn't really thinking about differentiating between the two, but I'm almost leaning toward exploring that.

Again, tell me why that's wrong.