Python, My New Year’s Resolution?

By Matthew Buchalter, PlusEV Analytics

Yes, it’s that time of year where we look back on what we’ve accomplished and look forward to the future.

Global pandemic aside, it’s been a pretty cool year for Plus EV Analytics. I started my own blog, and I got to appear on some great podcasts including my friends at the Deep Dive, Captain Jack’s Happy Hour and the legendary Gambling With An Edge. My content is now reaching 15,000 site visitors and nearly 8,000 Twitter followers. We found a fun angle on the MLB shortened season that many readers were able to profit from.

So what will 2021 bring? For one thing, I will be turning 40 and crossing the unofficial boundary into “middle aged”. It’s the kind of thing that makes one reflect on one’s purpose in life. And of course I want to be a good husband, good father, blah blah blah but this a math blog…so what’s my purpose in math? I’m fortunate enough over the last couple years, through my day job, my betting pursuits and this blog, to have developed a pretty good sense of my purpose in math.

My purpose is to become one of the world’s leading experts in problems of “small data”.

What is “small data”? Is it a term I just made up? (Obviously, yes.) At a time when the world is increasingly focused on “big data” and machine learning, it’s just not something I have much interest in, or much ability either. You can’t teach a middle aged dog new tricks. To borrow from actuarial terminology, “small data” exists at the intersection of two distinct modeling approaches:

  • “Exposure Rating” is where you predict the value (or distribution) of an unknown outcome using a set of predictor variables (“covariates”) that are correlated with the outcome. Common techniques for exposure rating include regressions, generalized linear models, decision trees, etc.
  • “Experience rating” is where you predict the value (or distribution) of an unknown outcome using a set of previous observations from the same process. Common techniques for experience rating include means, medians, capping or filtering for outliers, interpolation, extrapolation, etc.

Experience rating works best when your sample size is high; i.e. when you have many previous observations of the thing you’re trying to predict and/or when you have few or no or weak covariates available. Exposure rating works best when you have zero previous observations of the thing you’re trying to predict and/or when you have many strong covariates available. My world is the in between: Where you have some previous observations but not enough to rely entirely on experience rating, and you have some covariates but not strong enough to rely entirely on exposure rating. Finding that balance requires a set of tools that don’t seem to exist anywhere in textbooks or academia, as far as I’ve found. So I’ve developed them myself…

…which leads me to my New Year’s Resolution? (emphasis on the ?) My twitter followers know how much of a fan I am of doing data science in Excel and that I have refused to switch to R or Python despite being called a dinosaur many times by very smart people. Some of it is pure stubbornness on my part…yes Excel is clunky and slow, but I love being able to visualize each step of a complex algorithm as cells instead of as lines of code. It allows me to spot and fix errors much more quickly and thoroughly, both logic errors and coding errors. Also, pivot tables are a freaking godsend and I might cry if I lost them. But I’ve experimented a little with Python and it seems like something I would be able to get used to eventually if I put in the effort.

Here’s the main problem though. To solve “small data” problems, I need Excel solver like I need oxygen. The reason is that predefined functions exist (both in Excel and in Python) to fit exposure rating models, and predefined functions exist to fit experience rating models. There are no predefined functions to fit the kind of hybrid models that I develop. So I do it using first principles, finding the set of parameters that maximizes the loglikelihood of the data. This is where Excel shines with its cell structure and its Solver module; and this, loyal readers, is where I need your help if I’m going to make the switch to Python. I need to find a replacement.

Here’s an example of the type of “small data” model I would like to build. In this made-up problem, we are trying to predict the rate of goals scored per game for a set of hockey players using this data:

In the exposure rating module, we are fitting a “prior” as a function of the salary and the rookie indicator in a form similar to a log-link GLM, except that we are adding a “floor” parameter on the salary to account for a non-linearity in the effect:

ln(prior) = P_intercept + P_logsalary * ln(max(salary, P_minsalary)) + P_rookie * (rookie indicator)

In the experience rating module, we are combining our prior with our last year observed using a “ballast” approach to form a posterior:

Posterior = (last year goals + P_ballast * prior) / (last year games played + P_ballast)

That gives us 5 total parameters, that we are going to fit using maximum posterior likelihood:

Loglik = ln(poisson.dist(x = this year goals, mean = this year games played * posterior, cumulative = false))

Find {P_intercept, P_logsalary, P_minsalary, P_rookie, P_ballast} that maximizes loglik.

Here’s how to fit the model using Solver:

Here is the Excel file if you want to see the formulas:

Is there any way to do this in Python without writing hundreds of lines of custom code? If yes, please email me at PlusEVAnalytics@outlook.com or DM me on Twitter.

Thanks, and happy new year everyone!

Copyright in the contents of this blog are owned by Plus EV Sports Analytics Inc. and all related rights are reserved thereto.

Leave a Comment

Your email address will not be published.

We uses cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.