Horse Racing: Sport of Kings, Sport of Quants
By Matthew Buchalter, PlusEV Analytics

June 8, 2012
Shakopee, Minnesota
It was a regular Friday evening card at Canterbury Park, one of America’s less prestigious racetracks. My syndicate placed a couple hundred dollars in bets. The horse we liked best – ironically named “All Bets Are Off” – came in second, and the horse we liked second best – unironically named “Gracias” – came in first. We had a few bucks on the winning exacta and a few more on the winning trifecta. We netted maybe $500 on the race. An unremarkable result at an unremarkable track on an unremarkable day…except for one thing. In our second year of operation, one of those dollars of profit was our one millionth. This is pocket change compared to some of the most successful horse bettors, and my share was only a small fraction, but nevertheless it remains the proudest accomplishment of my professional life to date. To build something from scratch and have it make a million dollars is an amazing feeling. I’m going to share some of the lessons I learned working as the modeler for a North American horse racing syndicate for the first half of the 2010s.
As with most of my ventures in the gambling space, it started with me sharing some content in public and a heavy hitter seeing it and contacting me to work together. In this case, I became a trusted expert on a now-defunct Canadian AP forum by posting some of my analytical work. One of the members contacted me privately to tell me that he was starting a horse racing syndicate and to ask if I would be interested in doing the modeling. My only experience with horse racing was as a purely recreational bettor and fan, and I was skeptical that anything I built would be able to beat both the high track takeout (vig) and the competition of other more experienced, better resourced quants already established in the space.
What convinced me to accept the project was when I learned the best-kept (to casual fans) / worst-kept (to insiders) secret in the horse racing industry: the ability of high-volume players to negotiate “rebates” where a portion of the takeout on each bet would be refunded, win or lose. It makes sense from an economic perspective for the same reason that a 24 pack of Coca-Cola costs less than 24 x the price of a single can – the costs of operating a horse track should not be shared by the players in 100% proportion to dollars bet. I no longer had to be +EV at track odds, I just had to be small enough -EV that the rebates would put us over the top. Still, my outlook was pessimistic enough that I negotiated instead of a lower up-front fee to receive a higher up-front fee but have it offset against my share of profits (if any!)
Armed with a huge data set that my contact had procured, it was time to get to work. My tool of choice? SAS, because that’s what I had been using at work at the time to build models…it was before R really became popular. Funny thing about SAS, when you try to buy their software they get extremely nosy about how you’re going to use it. First I said “sports analytics” but that didn’t fly because they already had contracts with all of the local pro teams with some kind of non-compete in them. So I said “sports betting analytics” which led to “our risk management team thinks you’re a bookie”. So I had to call them and explain exactly what I was doing and convince them that it wasn’t illegal so that they would grant me the privilege of buying their stupidly expensive software. Good start!
My data set contained basically the same information you could get from the daily racing form, compiled in a database. For each horse in each race, you’d get:
- Race characteristics (track, distance, condition, purse, etc)
- Horse’s timing splits and finishing position
- Horse’s recent workouts
- Final odds and results
- Past performance stats for the horse, the jockey and the trainer
- Breeding info (how the horse’s parents and siblings performed)
I tried two model form, a multinomial logistic regression, and a probit model. I read that some other modelers preferred probit, but I found that my model worked best with logistic.
Without giving too much away (in case I ever get back into the game), here’s a random list of some of the most important things I had to figure out when building this model.
Offset the Public Odds
When I first built the model, I tried all kinds of variables in all kinds of combinations. I would fit a model, back test it on historical races and every time it would lose money (even after rebates). Then I tried something that made all the difference. Instead of fitting a regular logistic model:
ln(win prob / (1-win prob)) = intercept + coefficients * variables
I took the track odds, converted them to implied probabilities and used them as “offsets”:
ln(win prob / (1-win prob)) = ln(odds-implied prob / (1-odds implied prob)) + intercept + coefficients * variables
What does that do? It changes the entire purpose of the model. Instead of building a probability from scratch, I am starting with the assumption that the market odds are an efficient predictor of win probability, and finding variables that provide residual signal to those odds. It’s a much easier task. Suppose my model is missing an important variable that is uncorrelated with the other variables in my model. In a model from scratch, all my predictions will be wrong. In an odds offset model, as long as that variable is priced into the public odds it will end up in my model implicitly; I won’t find an edge, but I won’t lose one either.
Interpretation of parameters is a little different in an odds offset model:
Variable | Interpretation (regular model) | Interpretation (odds offset model) |
With positive coefficient | Positive contribution to win probability | Positive contribution to win probability that is underpriced by market odds OR Negative contribution to win probability that is overpriced by market odds |
With negative coefficient | Negative contribution to win probability | Negative contribution to win probability that is underpriced by market odds OR Positive contribution to win probability that is overpriced by market odds |
Rejected from model | Not predictive of win probability | Not predictive of win probability OR Predictive signal is accurately priced into market odds |
Take, for example, the variable “horse won its previous race = YES”. This is obviously a positive predictor of its win probability in this race. But, it’s also the most obvious stat to a casual reader of the racing form, and if the betting public tends to overvalue it then it’s possible that it could be significant with a negative coefficient in an odds offset model.
Normalize everything
Knowing that horse A had his last workout in 59.3 seconds and horse B had his last workout in 57.2 means very little. Knowing that horse A had his last workout in 0.976 x the average workout time for that length on that surface at that track in that condition and horse B had his last workout in 1.015 x the average workout time for that length on that surface at that track in that condition means a lot more.
Timing is Everything
As with any other gambling proposition, you calculate your EV using your estimated win probability along with the payout odds that the bet would receive. The challenge is that North American horse racing operates under a “parimutuel” system, where your payout odds are not locked in when you place your bet and can change substantially as other people bet into the pool. You can see what the odds are at any given point in time, but many large players don’t place their bets until the last possible second. Also, the races never start on time – a race with a post time of 8:30 would typically have the process of loading the horses into the starting gate begin around 8:32 and end whenever it ends, depending on how well behaved the horses happen to be. While this is happening, bets are being added to the pool every second because the betting remains open until the instant the race actually begins.
Because the information that impacts your bets changes every second, you generally want to bet as late as possible. Even then, there’s some movement in the odds because bets that were submitted before the start of the race continue to get processed by the system for around 30 additional seconds after the race begins. So now you need a second model – one that predicts how the odds will move in between the time your bet is made and the time the final odds are tallied, which requires you to basically reverse engineer these last second bettors’ strategies. Not easy!
Exotics
Being able to predict the winner of the race only opens up a small portion of the betting opportunities. You also have place bets (on a horse to finish in the top 2), show bets (on a horse to finish in the top 3) and the so-called “exotics”: exacta (on a combination of horses to finish 1-2 in order), trifecta (on a combination of horses to finish 1-2-3 in order) and superfecta (on a combination of horses to finish 1-2-3-4 in order). To evaluate these, you need to generate probabilities not only of winning, but also of finishing 2nd/3rd/4th.
For this, I was pointed to a series of academic papers written by, among others, the legendary Dr. William Ziemba. Their approach, called “Discounted Harville”, begins with the assumption that the probability of a horse finishing 2nd is proportional to that horse’s win probability. So if the #5 horse has a 60% win probability and the #6 horse has a 20% win probability, the conditional probability of #6 finishing 2nd given that #5 wins is 20% / (1-60%) = 50% and and the probability of a 5-6 exacta is 60% x 50% = 30%.
Then, they make the following adjustment…this is what makes it “discounted”. If a heavy favourite doesn’t win, it’s usually because something unexpected happens – getting bumped, blocked, the jockey falling off, etc. For this reason, heavy favourites tend to either win or finish at the back…they don’t finish 2nd as often as the formula above would predict. So instead of making the 2nd place probability proportional to the win probability, it becomes proportional to the win probabilityk for some exponent k between 0 and 1. The lower the k, the more “randomness” is injected where the 2nd place probabilities are divorced from the original win probabilities.
More info on this here.
Multi-Race Bets
In addition to win/place/show and exotics, there are “multi-race bets” such as daily double, pick-3, pick-4, pick-5 and pick-6 where you have to pick the winners of multiple consecutive races. Because the races are independent events, these behave much more like a regular parlay – simply multiply the win probabilities together. However, despite multiple attempts, I was never able to come up with a model that made money at multi-race bets. This is because the strengths of the “odds offset” approach become weaknesses when you’re trying to make probabilities for future races.
More recently, tracks have invented the “rainbow pick-6” where a progressive jackpot is paid if there is exactly one winning ticket in the pool. This would have made for an even more difficult modeling challenge.
Optimal bet sizing
In a parimutuel system, your bets have a direct influence on the payout odds. For small bets and/or large pools this influence is negligible; however, professional syndicates generally bet large enough that this makes a big difference. Every dollar you add to your bet size diminishes the EV of all of the previous dollars in your bet. This is why “staking methods change your variance but not your EV” is correct everywhere BUT here, and why using the Kelly Criterion alone can turn a winning model into a losing result.
When you factor in the impact of your bet on the final odds, the EV is no longer just win probability x odds – 1, it becomes a fairly complex function of the win probability, the odds-implied probability, the pool size, the takeout and the rebate. Hey, remember when you were in high school thinking “calculus is stupid, I’m never going to use this”?

The optimal bet size is the point where each successive dollar added to your bet stops adding to your EV and before it starts reducing your EV. That is, the solution to
d(EV) / d(bet size) = 0
Yay calculus!
This works well for everything except huge pools like the triple crown races, where the optimal bet size could be millions of dollars. So we took the lesser of the optimal bet size according to this formula and the optimal bet size according to the Kelly Criterion.
The end…or is it?
Around 4 years into our journey, we stopped winning. We broke even for a while, then slowly started losing. Just like any other prediction market, horse racing is a data science arms race. The competition is improving their models and getting sharper every day. They have armies of quants, and I even heard rumours that some of them have spotters at the tracks to look for things that are not captured in the data. I was one guy with a full time job doing this on evenings and weekends, plus I had married my wife and had our first child during this period. I just couldn’t keep up…so we called it quits. During our run we bet a total of approximately $100 million, with a return after rebates of around 3%. Being a part of it is one of the coolest things I’ve ever done.
One interesting tidbit…the program that took in the data and calculated the bets was set up to hold up to 16 horses for each race, which is sufficient for every race at every track in North America…except one. We never bet the Kentucky Derby, because its 20-horse field didn’t fit into our program. Derby day was still our busiest day of the year, because of the huge pools for the undercard races at Churchill and at other tracks around the country. One of our biggest ever wins was the 6th race at Churchill on Derby Day, May 5 2012. We placed a total of $39,606 on 866 different bets, including:
- $8,678 on #2 to win at 5.40 odds
- $3,630 on #2 to place at 2.30
- $2,868 on #2 to show at 1.80
- $1,038 on 2-4 exacta at 16.00
- $375 on 2-4-3 trifecta at 208.40
- $33 on 2-4-3-10 superfecta at 670.70
For a total return of $177,264. My dad was pretty impressed when I told him about it
Having sharpened my Bayesian analysis skills between my horse racing days and now, I think I would approach this problem differently if I had to start again. Instead of a logistic regression on win probability, I would base my model on either normalized finish time or one of the calculated speed scores that already exists. I would think of each horse’s “ability” as their theoretical best race they could ever possibly run. The universe of all horses would probably look something like a bell curve from Secretariat to Zippy Chippy. Then each horse’s result in any particular race would follow some kind of probability distribution (their “trip”) bounded by their ability and influenced by the jockey and other factors as well as plain randomness. Each race’s result could be used to update a posterior distribution of each horse’s ability – if a horse with estimated ability of 1:10 on a time scale puts up a 1:08, we now know that the horse’s ability must be 1:08 or better. If a horse with an estimated ability of 1:10 runs a 1:16, it gets chalked up to a “bad trip” and it probably doesn’t change our assessment of the horse’s ability too much (but may change our assessment of the jockey). We then combine the distribution of each horse’s ability with the distribution of each horse’s trip to simulate the probability distribution of each race’s outcome. Could be a fun project some day. when I have more free time. There’s nothing more exciting than watching a race with money on the line, and it would be cool to get that feeling back.
Gracias for reading!
Copyright in the contents of this blog are owned by Plus EV Sports Analytics Inc. and all related rights are reserved thereto.