Introduction to Mixed Models
By Matthew Buchalter, PlusEV Analytics
I was chatting with a friend about a Bayesian Inference model that I had helped him build. I tried to explain to him how it worked – his response was “that’s tough to grasp”. And he was right – the guy was certainly intelligent enough, it’s just that he had never been exposed to this type of thinking before.
I have many criticisms of the actuarial education system. There are too many exams, they’re too difficult, they’re too expensive and their main purpose seems to be to create barriers to entry to the profession to keep salaries high – a ploy that worked well in the 20th century but is dangerous now that there are all kinds of data scientists out there competing with us. However, there is one subject that seems to be covered more in the actuarial curriculum than anywhere else, and it’s taken me so far in both my career and my hobby that it almost makes it worth having endured all the other stuff. It’s called “credibility theory”.
Credibility theory is a set of tools to help you deal with the common problem of small sample sizes. There are many real-world problems where you have a data set but it’s not as large as you’d ideally like it to be, and both accepting the data and rejecting it would be bad decisions. It opens up the entire spectrum in between by providing a set of formulas to tell you how much weight, between 0% and 100%, you should give to your observed data set. Like a good student I studied the formulas, memorized them and applied them to pass my exams. It took me a good ten years after that to gain a full understanding of what they really mean and how they can be used to quantify the process of learning from information. This, dear reader, is my attempt to save you those ten years by explaining the foundation of credibility theory; the mixed model.
To illustrate the concept, we’re going to build a model to predict NFL quarterbacks’ Expected Points Added (EPA) per game. If I was really interested in modeling QB performance I’d make some slightly different choices, but we’re really here to explain how mixed models work and this is the most straightforward way to do that.
To model any given quarterback’s EPA in any given game, we’re going to use our old reliable friend, the Normal distribution. To build a normal distribution you need two ingredients – a mean (let’s call it M) and a standard deviation (let’s call it S).
EPA per game ~ Normal (M, S)
S is a measure of the random variance that is a natural part of NFL football. Sometimes good QBs have bad games, and sometimes bad QBs have good games. We’re going to simplify the model by assuming that all quarterbacks share the same value of S.
M is the average or expected value of the EPA; as such, it will be higher for better quarterbacks and lower for worse ones. But how do we differentiate between better and worse quarterbacks? We know that Drew Brees (7.28 average EPA over 292 games) is a good quarterback and we’re pretty sure that Mark Sanchez (-2.10 average EPA over 84 games) is a bad one, but what about Drew Lock (5.18 average EPA over 5 games) or Tua Tagovailoa (0 games)?
Here’s the tricky part – we don’t know what each quarterback’s M is. The real world isn’t like a video game where you can look at a player’s profile and read his skill rating. M is an example of something called a “latent variable” – something that definitely is real but cannot he observed directly, only indirectly. Looking at a player’s career average EPA, you get some indication of what their M might be but you see it through a fog of game-to-game randomness that lifts gradually as the sample size increases.
Other examples of latent variables:
- Your driving skill (indirectly observed by your frequency of car accidents and/or traffic tickets)
- Your intelligence (indirectly observed by school grades, IQ tests, job performance, aptitude tests, etc)
- Your golf ability (indirectly measured by your handicap, which is an average of your previous scores adjusted for course difficulty)
So let’s recap.
EPA for a quarterback in a game ~ Normal (that quarterback’s M, every quarterback’s shared S).
Both M and S are unknown. S is easy to figure out because there’s only one S and we have thousands of historical games we can use to estimate it. M is hard to figure out because there are so many Ms – one for each quarterback who’s ever played – and for some of them we have very little data to work with (or none at all).
Now, here’s the fun part: We’re going to model M for each quarterback using ANOTHER Normal distribution:
Quarterback’s M ~ Normal (Quarterback’s M_M, Quarterback’s M_S)
You can think of this as the distribution of all possible (hypothetical) quarterbacks. Some are really good, some are really bad, most are average.
If we know M_M and M_S, we know everything there is to know about M. So, our model has three parameters: M_M, M_S and S. This is called a “normal-normal mixed model”.
Before we fit these parameters, let’s talk about what they mean.
M_M is our best estimate of the quarterback’s skill level.
S is the game-to-game random variance in any quarterback’s performance.
What is M_S? Both S and M_S are standard deviations, what’s the difference between them?
M_S is the standard deviation of our estimate of M. That is, it represents our degree of confidence around the estimate M_M.
S2 is called the “process variance” while M_S2 is called the “parameter variance”.
M_M and M_S are both measures of belief. M_M is an answer to the question “how good do we believe this quarterback is, based on the information available to us?” M_S is an answer to the question “how confident are we in that belief”? For Brees, “he’s very good” and “we’re pretty damn confident in that assessment”. For Tagovailoa, “average, I guess?” and “no confidence at all, maybe he’s the next Brees, maybe he’s the next Sanchez”.
So here’s the magic of the whole thing…both M_M and M_S can learn. As a quarterback accumulates more games and we accumulate more information, two things happen.
- M_M gets updated to reflect the new information. If M_M is 3 EPA per game and our guy puts up 20 EPA in a game, our new estimate of M_M going into the next game will be more than 3. If he puts up -10, our new estimate will be less.
- M_S shrinks, because our new estimate of M_M will necessarily be based on more information than the previous estimate of M_M. Larger sample size means more confidence, which means lower M_S.
But let’s not get ahead of ourselves. Before we start updating, we need to establish a starting point. How can we estimate a quarterback’s EPA per game without knowing any of that quarterback’s history? We can use other things we know, such as the quarterback’s draft position and where they are in their career (QBs tend to significantly improve between their first and second years).
Because we’re working with normal distributions, a plain old linear regression will do the trick here…almost. The only think I had to do was tweak my regression to estimate both the process variance (between games for the same quarterback) and the parameter variance (between quarterbacks) at the same time.
This gives the following:
Initial estimate of M_M = 3.224 – 0.478 * log(draft position) – 1.737 * (rookie year)
Initial estimate of M_S = 3.030
S = 10.034
For Drew Brees (drafted 32nd, not rookie year), initial estimates are M_M = 1.567 and M_S = 3.030.
For Tua Tagovailoa (drafted 5th, rookie year), initial estimates are M_M = 0.718 and M_S = 3.030. Because we have no performance data on Tua, these are also his final estimates!
Now, how do we take these estimates and update them with a quarterback’s actual performance? I’ve written previously about Bayes Theorem and Conjugate Priors, but thanks to Wikipedia someone has already done all the algebra and calculus so we don’t have to. The Wikipedia page on Conjugate Priors is my bible. Looking at the Normal-normal mixture, the (updated estimate of M_M, updated estimate M_S2) are given by:
Translating from Wikipedia’s nomenclature to ours:
µ0 is our initial estimate of M_M
σ0 is our initial estimate of M_S
σ is S
n is the number of games we’ve observed from the quarterback
Σxi is the total EPA we’ve observed from the quarterback
Here’s what it looks like for a few selected QBs:

You can see the “learning” in action here – the veteran quarterbacks have much smaller M_S meaning we can peg them with a much higher degree of confidence than the less experienced quarterbacks. You can visualize this “learning” process by taking one quarterback, let’s say Drew Brees, and recalculating his M_M and M_S throughout the course of his career one game at a time:

The blue line shows how Brees’s M_M evolves from game 1 to game 291. The orange and grey lines show the “95% confidence interval”, i.e. the range from M_M – 2*M_S to M_M + 2*M_S. See how the range gradually narrows over time? That’s because M_S is shrinking as our sample size increases.
Again, the intent here is not to teach you how to model a quarterback’s EPA, it’s to show you how a mixed model works…so there are some nuances that I purposely left out (such as, how to deal with undrafted quarterbacks or how to deal with overall upward drift in league-wide EPAs over the past 20 years) for the sake of keeping things as simple and clear as possible. And again, these concepts are NOT easy to grasp, especially not the first few times you encounter them. But, I would say it’s definitely worth the effort as this opens up a whole world of possibilities for building models that can handle small, large or zero sample sizes and that incorporate new information as it emerges.
If you’ve made it this far, dear reader, congratulations! I’ll leave you with one more neat trick…the “predictive distribution” (that is, the distribution of what a quarterback’s EPA will be in their NEXT game) of a normal-normal mixture is also normal! Hopefully you’ve guessed by now that its mean will be the revised M_M. But what is its standard deviation? It obeys something called the “law of total variance”:
Total variance = process variance + parameter variance
Where variance is standard deviation squared. Remember that S2 is the process variance and M_S2 is the parameter variance.
Here’s our chart above, with the predictive distribution added:

See how the predictive standard deviations are all pretty close to each other? In mathematical terms, this means that for a single game result the process variance dominates the parameter variance. In non-mathematical terms, it translates to one of the most famous expressions in NFL history:
“On any given Sunday, any team in the NFL can beat any other team.”
Copyright in the contents of this blog are owned by Plus EV Sports Analytics Inc. and all related rights are reserved thereto.