DIY Linear (and Nonlinear!) Regression using Maximum Likelihood

By Matthew Buchalter, PlusEV Analytics

Around the house, I am far from a handyman. I can change a light bulb, but anything more difficult than that and I’m gladly paying someone else to do it. Put me in cyberspace though and I become a regular Al Borland. There are plenty of plug & play software tools for building regression-type models (Generalized Linear Models if you’re fancy), they all do pretty much the same thing and I’m going to show you why I reject them all and do it myself from scratch.

Linear regression models look like this:

Thing you’re predicting ~ normal (intercept + bunch of coefficients * bunch of variables, a constant standard deviation that you may or may not care about)

Generalized Linear Models look like this:

Some transformation function(thing you’re predicting) ~ some distribution (intercept + bunch of coefficients * bunch of variables)

I’ll spare you the long-winded explanation because it’s available in a million other places, but basically the GLM comes in handy if you don’t want to use the normal distribution, and/or if the thing you’re predicting only lives in a certain range of values such as “can’t be negative” or “must be between 0 and 1”.

Statisticians talk a lot about “fitting” a model. All this means is finding the parameters (intercept, coefficients, etc) that result in the closest match in your data set between what the model would have predicted and what actually happened. There are several possible choices for how to define “closest match” but the most common, called “maximum likelihood” works as follows:

(Note: The following is simplified for a non-technical audience. If you’re a statistics PhD and you feel the need to nitpick, don’t. I know, and I don’t care.)

The closest match is achieved when the model’s predicted probability of observing exactly what actually happened is at its highest possible point.

Simple example: suppose a team goes 7-3 in its first 10 games and you need to predict the probability of the team winning its next game, without being given any additional information. For any possible choice of your parameter p, between 0 and 1, the probability of observing what actually happened is, in Excel format, =BINOM.DIST(7,10,p,FALSE). This is called the “likelihood function”.

Here is the likelihood function for all possible values of p:

The red X shows where the likelihood is at its highest point; unsurprisingly, it’s at p = 0.70 where the likelihood is approximately 0.27. So our maximum likelihood fitting procedure would result in a selection of p = 0.70.

This works great when our data set has 10 observations and our likelihood function has a simple distribution. In real life with large data sets and complex distributions, your likelihood function will never get anywhere close to 0.27. It may look like something closer to 0.000000(insert a hundred thousand more 0s)27. Which is fine, the math still works perfectly well, it’s just that computers have a hard time handling numbers that small…they’ll just say “meh whatever, close enough” and round the number to zero. But when you’re doing maximum likelihood it’s not meh whatever. It’s not meh whatever at all. So we amend as follows:

The closest match is achieved when the logarithm of the model’s predicted probability of observing exactly what actually happened is at its highest possible point.

Maximizing the logarithm of the likelihood (“loglikelihood”) is mathematically equivalent to maximizing the likelihood itself, but it avoids the tiny-number computer problem because the logarithm of a tiny number is a large, but manageable, negative number. Even though it’s negative, you still want to maximize it; a loglikelihood of -49999 is better than a loglikelihood of -50000. Another handy thing about loglikelihoods is that if your data set contains a bunch of independent observations, the total loglikelihood is the SUM of the loglikelihoods of each of the individual observations.

When you use a pre-built linear regression or GLM function in R, Python, Stata, SAS, etc., the code behind those functions is very likely doing what’s described above – finding the set of parameters that maximizes the loglikelihood. Let’s go through an example of how to do it the DIY way, then I will show you why it’s better.

We’re going to build a simple linear regression model to estimate the relationship between NBA player age and player performance, as measured by Basketball Reference’s “win shares per 48 minutes” metric.

The data from the 2020-2021 season looks like this, for all players who played 500+ total minutes (n = 362):

PlayerAgeWS/48
Precious Achiuwa210.085
Steven Adams270.119
Bam Adebayo230.197
LaMarcus Aldridge350.08
Nickeil Alexander-Walker220.035
Grayson Allen250.101
Jarrett Allen220.166
Kyle Anderson270.143
Giannis Antetokounmpo260.244
Thanasis Antetokounmpo280.057
Carmelo Anthony360.073
Cole Anthony20-0.001
etc…etc…etc…

(Note for nitpickers: this is a terrible way to build a model in the real world because the data set contains selection bias – only the best players stay in the league into their late 30s. But it’s a free, easily accessible data set that I’m using to explain modeling concepts that can be applied anywhere, so let’s just roll with it, ok?)

Step 1: Set up your parameters. This is a simple linear regression so there is an intercept and an age parameter. We also need to specify a standard deviation – that may seem a little more foreign to you because packages do it automatically behind the scenes. For the initial values of the parameters, you can choose anything as long as it doesn’t cause errors (i.e. don’t set the standard deviation as zero)

Step 2: Use your data set and your parameters to set up your loglikelihood.

Step 3: Take the sum of the loglikelihoods.

Step 4: Use Solver (or any other optimization package if you’re not doing this in Excel) to find the parameters that maximize the total loglikelihood.

And, we solve.

Oops! This is a common problem in the world of DIY model fitting in Excel. In exploring the set of possible parameters to see which one maximizes the likelihood, it hits one that’s so terrible that the loglikelihood doesn’t even calculate, which causes the whole process to go to hell. We can use a simple trick to make sure these cases don’t crash the system but they also don’t get selected as the maximum likelihood:

With that unpleasantness out of the way, we can continue with Solver:

And there’s our DIY regression.

So what’s the point of going to all this trouble to get the exact same model that the software package would give, other than the satisfaction of doing it yourself? One word:

To explain, let’s start by recapping the current state of our model. It can be specified as follows:

Player’s WS/48 ~ Normal ( intercept + b * (age – 19) , s )

With fitted values of intercept = 0.079842, b = 0.002513 and s = 0.056919.

Here is a graph of the average actual and fitted values by age:

To paraphrase Dickens, “it was the best of fits, it was the worst of fits”. The orange line is the best possible straight-line fit to the blue data, but it doesn’t fit the data very well at all. The orange line is too flat on the left side, and it’s too steep on the right side. What we have here is a classic “underfit” – our model is too simple to accurately represent the data. We need to add something to spice it up a little bit.

To explore how we can do that, let’s go back to the definition of a linear regression:

Thing you’re predicting ~ normal (intercept + bunch of coefficients * bunch of variables, a constant standard deviation that you may or may not care about)

Our only option for upgrading this thing is to add to the “bunch” of coefficients/variables, as our current total of one is a pretty sorry bunch.

We don’t have any variables in our data set other than age, but we can still beef up our model by including transformations of the age variable.

One common transformation is the polynomial, such as:

Player’s WS/48 ~ Normal ( intercept + b1 * (age – 19) + b2 * (age – 19)^2 + b3 * (age – 19)^3 + b4 * (age – 19)^4 , s )

Looks pretty good, right? Hold on, just a second, I’m getting some breaking news coming across the wire…WOW, Dikembe Mutombo is coming out of retirement next year! Trae Young convinced him to return at age 55 to help lead the Hawks to glory! Ok cool, let’s just add him in:

The job needed a hammer and by using a polynomial, we brought an atomic bomb. Got the job done at the mere cost of blowing up the entire city. The thing about polynomial transformations is that they tend to look really good for “typical” values in your data set, but any outliers can blow you to smithereens – especially outliers that weren’t present when you initially fit the model. So what do you say, Dikembe? Do we like polynomial transformations?

Another, less explodey option would be to use what’s called a “linear spline”. In a linear spline, the predicted values still form a straight line, but that line is broken into segments across the x-axis with each segment having a different slope, with the segments delineated by edge points called “knots”.

Spline, spline, everywhere a spline…

Here’s how we would specify a linear spline with a single knot at age 30:

Player’s WS/48 ~ Normal ( intercept + b1 * (min(age,30) – 19) + b2 * max(age – 30, 0) , s )

Looks pretty good. But how do we know that 30 is the best possible place to put the knot? We could try putting it in a few different places and see which one produces the best loglikelihood…but that’s time consuming. But wait, the method of “try a lot of different values and pick the one that produces the best loglikelihood”, isn’t that how we’re fitting our parameters, but in an automated way using Solver? Can’t we just fit the knot point as another parameter in the model?

You’re damn right we can.

Player’s WS/48 ~ Normal ( intercept + b1 * (min(age,k) – 19) + b2 * max(age – k, 0) , s )

Voila. The best possible knot point is at age 28.

In this final version of the model, I slipped in a MASSIVE change, but I did it so smoothly that you probably didn’t even notice. Let’s take another look at our model:

Player’s WS/48 ~ Normal ( intercept + b1 * (min(age,k) – 19) + b2 * max(age – k, 0) , s )

The parameters are intercept, b1, b2, k and s.

Refer back to the definition of linear regression:

Thing you’re predicting ~ normal (intercept + bunch of coefficients * bunch of variables, a constant standard deviation that you may or may not care about)

Every version of the model we’ve built so far fits nicely into the “bunch of coefficients * bunch of variables” format and could easily be done with any regression software in the world…until now. By sticking that k in there, we just reached escape velocity, broke the surly bonds of linearity and are free floating in the vast space of Generalized Nonlinear Models. Try this in your regression software and it will be somewhere between “ok but it’s difficult and have fun learning a scary new function” and “sorry sir/ma’am, we don’t do that here and I’m going to have to ask you to leave before you make a fuss”. In our DIY regression, it’s so simple that we didn’t even notice (or really care about) the shift into the nonlinear class of models.

And it’s not just splines. Once you cast off the shackles of coefficients * variables, all kinds of possibilities open up. Caps, floors, ballast models, all kinds of parameters that used to have to be set by the modeler making an assumption can now be directly fit to the data.

The Generalized Nonlinear Model:

Thing you’re predicting ~ any distribution you want(any parameters you want, specified in any way you want)

Copyright in the contents of this blog are owned by Plus EV Sports Analytics Inc. and all related rights are reserved thereto.

Leave a Comment

Your email address will not be published. Required fields are marked *

We uses cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.