To The Extreme: MLB Season Props

By Matthew Buchalter, PlusEV Analytics

To the extreme, I rock a mic like a vandal
Light up a stage and wax a chump like a candle” – Vanilla Ice

“Any player scores 49+ runs?”

“Any player has 77+ hits?”

“Any player hits 20+ home runs?”

How do we price these “any player” props, adjusting for the fact that this year’s MLB season is 60 games instead of the standard 162?

The natural inclination is to take the expectation for a typical season’s league leading total, and prorate it by a factor of 60/162. That would be wrong, for reasons that were well articulated by one of my Twitter followers:

Pretty much every year, there’s someone early in the season who is on pace to hit .400 / 80 home runs / 180 RBI / etc. They never keep it up through the entire season, because random variance decreases as sample size increases. These guys aren’t 80 home run hitters, they’re 50 home run hitters with a run of favourable variance that put them, temporarily, on an 80 home run pace. There are an equal number of 50 home run hitters with a run of unfavourable variance that put them, temporarily, on a 20 home run pace – but they’re not at the top of the leaderboard so we don’t care so much about them. The bets in question are all about extreme values, and shorter seasons mean more extreme values.

So here’s my modeling approach to this problem. Yours may be different, possibly even better…but we’re talking about props that are mostly only available at 5Dimes and with $250 limit, so I’m going “quick & dirty”…meaning that I’m taking some shortcuts to sacrifice a little bit of predictive accuracy to save a lot of time. I built this model in approximately 2 hours.

Let’s pause here to go over some theory, to lay the foundation for what we’re about to do.

Suppose you have 50 shooters each taking 20 shots at a target, and that each shot hits the bullseye with 10% probability.

For any given shooter, the probability of under 5.5 bullseyes can be calculated using the Binomial distribution. In Excel this is =BINOM.DIST(5.5,20,0.1,TRUE) = 98.9%.

Assuming the shooters are independent of each other, the probability that ALL 50 shooters go under 5.5 bullseyes is =BINOM.DIST(5.5,20,0.1,TRUE)^50 = 56.8%.

The opposite of that, 100% – 56.8% = 43.2%, is the probability that not all shooters go under 5.5 bullseyes; that is, the probability that at least one shooter goes over 5.5.

The probability that top score is exactly 6 bullseyes is the probability that at least one shooter goes over 5.5 minus the probability that at least one shooter goes over 6.5. In Excel, =BINOM.DIST(6.5,20,0.1,TRUE)^50-BINOM.DIST(5.5,20,0.1,TRUE)^50 = 31.95%.

We can use this method to get the complete distribution of the top score:

So this is the functional form of the model we’re going to use in our study of MLB props:

P(top score of x) = BINOM.DIST(x+0.5,n,p,TRUE)^z – BINOM.DIST(x-0.5,n,p,TRUE)^z

Where n the number of at-bats, p is the probability of getting a run/hit/home run/whatever in each at-bat, and z is the number of “contenders” who each plays a full season of n at-bats and each has success probability p. Think of n as the 20 shots per shooter, p as the 10% bullseyes per shot, and z as the 50 shooters.

Of all of these components, z is the toughest one to wrap my head around. Just like in the shooter example, the model assumes that everyone has the same number of attempts and that everyone has the same probability of success. There are hundreds of MLB players, but only a small number of those will play the full season AND are skilled enough to contend for the league lead in whatever category is in question. Assuming that each contender is equally skilled, rather than some kind of hierarchy, is one of those “quick & dirty” shortcuts.

Because z is unknown and can vary from season to season, we’re going to treat it as a random variable with its own distribution. I’m going to start with a pool of 50 potential players and pick a subset of those 50 using another binomial distribution – but truncated because z cannot be 0. The subset will depend on two parameters, a “survival” parameter to indicate the probability of playing a full season and a “skill” parameter to indicate the size of the subset that is skilled enough to contend for the league lead in the stat in question.

P(z) = BINOM.DIST(z,50,survival*skill,FALSE) / (1-BINOM.DIST(0,50,survival*skill,FALSE))

For the survival parameter – in 2019, 101 players played 140 or more games. If we assume that the average AL team has 6 full time starting hitters and the average NL team has 5, we can estimate the survival parameter as 101/(30*5.5) = 0.612, so around 61% of full time starters will go the entire year without a significant injury.

We’re also going to assume that a full season consists of 600 at-bats.

So our complete model is:

P(top score of x, conditional on z) = BINOM.DIST(x+0.5,600,p,TRUE)^z – BINOM.DIST(x-0.5,600,p,TRUE)^z

P(z) = BINOM.DIST(z,50,0.612*skill,FALSE) / (1-BINOM.DIST(0,50,0.612*skill,FALSE))

This can be set up easily in a spreadsheet by having each column represent a different value of z and each row represent a different value of x.

This leaves us with two unknown parameters: p and skill. How do we fit them? I used a method called “maximum likelihood estimation” where I use Excel Solver to find the set of parameters that provides the best fit to a set of historical data. For my data set I’m using the last 12 years of league leaders, 2008-2019, roughly corresponding to the post-steroid era in MLB.

Once we have estimated p and skill, the last thing we have to do is convert the model from a 162 game season to a 60 game season:

  • The number of at-bats changes from 600 to 600*60/162 = 222.
  • Because it’s easier to survive a short season than to survive a long season, the injury risk reduced by 60/162, which changes the survival parameter from 0.612 to 0.856.
  • p and skill are unchanged.

P(top score of x, conditional on z, 60 game season) = BINOM.DIST(x+0.5,222,p,TRUE)^z – BINOM.DIST(x-0.5,222,p,TRUE)^z

P(z, 60 game season) = BINOM.DIST(z,50,0.856*skill,FALSE) / (1-BINOM.DIST(0,50,0.856*skill,FALSE))

Ready? Let’s go.

All lines are from 5dimes as of July 21. Note: If you bet these, it’s at your own risk. I’m confident in this model but it’s not perfect, it’s quick & dirty. Feel free to agree or disagree with it, feel free to bet it or not bet it. I’m not a tout, I’m just a math guy.

Hits

Line: Any player has 77+ hits -145 / No player has 77+ hits +115

12 year history:

2019206
2018192
2017213
2016216
2015205
2014225
2013199
2012216
2011213
2010214
2009225
2008213
Average211.4
Average prorated to 60 games78.3
2019 leader after 60 games80

See how the 2019 leader after 60 games had more than the prorated average? That’s evidence of the “fewer games = more variance” phenomenon described in the tweet at the top of this article. It’s going to be a recurring theme as we go. The fact that this prop is lined at 76.5, even below the prorated average (which we know is already too low) is a good sign.

Fitting our model gives:

  • Skill = 0.118. This is a pretty low number, it means that on average there are 50 * 0.612 * 0.118 = 3.6 contenders for the league lead in hits each year. This makes sense because hits are not very random over the course of a full season – it’s very unlikely for a bad hitter or even an average hitter to have a stretch of good luck that’s enough to lead the league in hits for a full year.
  • p = 0.336. This means that each contender will average 0.336 hits per at-bat.

For the full year, we get a distribution like this:

The median is 212, seems in line with the historical numbers, so far so good.

Converting to a 60 game schedule by making the adjustments to the parameters for total at-bats and survival as described above:

The median is 82.5 – that’s where I would put the number if I were the bookie.

Over 76.5 has a projected probability of 87.2%, which at -145 odds yields a tidy +47% EV.

Runs

Line: Any player scores 49+ runs -150 / No player scores 49+ runs +120

12 year history:

2019135
2018129
2017137
2016123
2015122
2014115
2013126
2012129
2011136
2010115
2009124
2008125
Average126.3
Average prorated to 60 games46.8
2019 leader after 60 games53

Skill = 0.185. Runs are a little more random than hits, as is to be expected because scoring runs has a dependency on your teammates to drive you in.

p = 0.191 runs per at-bat.

Median, full season: 126.5 runs.

Median, 60 game season: 50.5 runs.

Probability of over 48.5 runs: 71.2%. At -150 odds, that’s a +19% EV.

Home Runs

Line: Any player hits 20+ home runs -125 / No player hits 20+ home runs -105.

This is a tougher one because there’s been a substantial variation in the home run rate over the past few years due to the composition of the ball and some physics stuff that is way beyond my comprehension (google it). I’m going to run my model as normal, ignoring the ball stuff, but interpret the results with a big grain of salt.

12 year history:

201953
201848
201759
201647
201547
201440
201353
201244
201143
201054
200947
200848
Average48.6
Average prorated to 60 games18.0
2019 leader after 60 games22

Skill = 0.163

p = 0.069 HR per at-bat

Median, full season): 48.5 HR

Median, 60 game season: 20.5 HR

Probability of over 19.5 HR: 62.7%. At -125 that’s a +13% EV. If you think we’re getting the 2019 juiced ball, it’s better than that.

RBI

Line: Any player records 48+ RBI -150 / No player records 48+ RBI +120

Technically, you can’t use our binomial-based model for RBI because you can get multiple of them at one time. Practically, I don’t think it makes a significant difference – so we press on.

12 year history:

2019126
2018130
2017132
2016133
2015130
2014116
2013138
2012139
2011126
2010126
2009141
2008146
Average131.9
Average prorated to 60 games48.9
2019 leader after 60 games53

Skill = 0.143

p = 0.203 RBI per at-bat.

Median, full season: 132 RBI

Median, 60 game season: 52.5 RBI.

Probability of over 47.5 RBI: 88.8%. At -150 that’s a whopping +48% EV.

Stolen Bases

Line: Any player steals 19+ bases -145 / No player steals 19+ bases +115

Again, the binomial model isn’t totally correct because stolen bases aren’t a subset of at-bats. Again, we don’t care. Again, we press on. Quick & dirty, my friends.

12 year history:

201946
201845
201760
201662
201558
201464
201352
201249
201161
201068
200970
200868
Average58.6
Average prorated to 60 games21.7
2019 leader after 60 games21

What I’m more concerned about in applying this model is the general decreasing trend in stolen bases from the Rickey Henderson era through the Moneyball era to the present. So, proceed with extreme caution.

Skill = 0.022. You don’t win the stolen base title by accident. You basically know before the start of the season which one or two guys will definitely win it if they stay healthy. The number of skilled “contenders” is quite low.

p = 0.096 SB per at-bat.

Median, full season: 59 SB.

Median, 60 game season: 22.5 SB.

Probability of over 18.5: 81.6%. At -145 this would be a +38% EV. But I’m passing on this one. With a skill parameter this low, the “randomness” impact that gives us our edge is diminished. The model is giving an edge purely because it’s assuming no change in the stolen base rate over the past 12 years, where the book is assuming there is a change. I agree with the book on this one.

Doubles:

Line: Any player hits 21+ doubles -150 / No player hits 21+ doubles +120

12 year history:

201958
201851
201756
201648
201545
201453
201355
201251
201148
201049
200956
200854
200868
Average52.8
Average prorated to 60 games19.6
2019 leader after 60 games21

Skill = 0.358. This is where the randomness really starts to get cranked up. There are players who specialize in home runs, stolen bases, even singles…but not really players who specialize in doubles. Any good hitter has a chance to lead the league in doubles in any given year.

p = 0.070 doubles per at-bat.

Median, full season: 52 doubles.

Median, 60 game season: 22.5 doubles.

Probability of over 20.5 doubles: 79.2%. At -150 this is a +32% EV.

Triples

Line: Any player hits 5+ triples -215 / No player hits 5+ triples +170.

12 year history:

201910
201812
201714
201611
201515
201412
201311
201215
201116
201014
200913
200819
Average13.5
Average prorated to 60 games5.0
2019 leader after 60 games8

Skill = 0.258. More random than home runs, less random than doubles. If you want to lead the league in triples, you’re going to need some wheels.

p = 0.015 triples per at-bat.

Median, full season: 13.5 triples.

Median, 60 game season: 6.5 triples.

Probability of over 4.5 triples: 94.2%. At -215 that’s a +38% EV.

Now, before you rush to bet these, we should discuss some of the limitations of this model and why it may not be a perfect representation of the real world.

Things not considered in the model, that could HURT the overs:

  • A return to the 2018 “dead” ball, especially for home runs / runs / RBI.
  • The strain of a compressed schedule with very little training camp may lead to more injuries.
  • COVID may lead to more games lost due to illness, whether from COVID itself or cold/flu.
  • If a team is out of playoff contention or has clinched a playoff spot, it may sit the regulars for the last few games. In a 60 game schedule that would be a meaningful percentage of the season that would be lost.

Things not considered in the model, that could HELP the overs:

  • A return to the 2019 “juiced” ball, especially for home runs / runs / RBI.
  • DH in the National League.
  • A compressed schedule means that each game is more meaningful, so players may get fewer rest days.
  • The number of “contenders” was assumed to be fixed, but it may be larger in a shorter season due to random variance. Maybe an average hitter CAN lead the league in hits, etc.
  • Because we’re dealing in extreme values, there is an element of antifragility that you get with betting overs. Meaning, any kind of unexpected situation or event will tend to cause the extreme values to be more extreme, rather than less extreme. You can never tell what these surprises are going to be ahead of time, but it’s a good spot to know that they will help you more likely than hurt you!

All in all, I think the “things that could help” at least balance, if not outweigh, the “things that could hurt”.

We’ll check back during and after the season to see how we do on these. If you bet them, good luck!

 

Copyright in the contents of this blog are owned by Plus EV Sports Analytics Inc. and all related rights are reserved thereto.

Leave a Comment

Your email address will not be published. Required fields are marked *

We uses cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.