Producing Win Probabilities for Professional
Tennis Matches from any Score
Jacob Gollub
Harvard College
Cambridge, Massachusetts
Thesis Submitted in Partial Fulfillment of the
Requirements for the Degree of
Bachelor of Arts in Statistics And Computer Science
supervised by
Kevin Rader
November 6, 2017
Acknowledgements
After coming across Michael Sipko’s “Machine Learning for the Prediction of Professional Tennis
Matches,” I realized that I could write my senior thesis on the sport I am most passionate
about. While I was abroad this past summer, much of this project’s implementation took place
in cafés around Tokyo, Japan. Thank you to my advisor, Kevin Rader, for his enthusiasm and
support. Thank you to my parents, for pushing me to make the most out of my academic
opportunities at college. Finally, thank you to my former coaches Joaqúın, Stewart, and Paul
for their lessons and wisdom over the years. My lifelong obsession with tennis persists through
this project.
1
Contents
1 Introduction 4
1.1 In-game Sports Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 History of Tennis Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Match/Point-by-Point Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Scoring 9
2.1 Modeling games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Modeling sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Modeling a tiebreak game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Modeling a best-of-three match . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Win Probability Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Pre-game Match Prediction 15
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Elo Ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 ATP Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Point-based Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 James-Stein Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6 Opponent-Adjusted Serve/Return Statistics . . . . . . . . . . . . . . . . . . . . 22
3.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2
3.7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 In-game Match Prediction 27
4.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.1 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Hierarchical Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Beta Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.2 Elo-induced Serve Probabilities . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.3 Importance of fij + fji . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.1 Model Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.2 By Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.3 By Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Visualizing Win Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Conclusion 44
5.1 Applications to Betting Markets . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Future Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Visualizing Grand Slam Matches . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.1 Roger Federer vs. Rafael Nadal Australian Open 2017 Final . . . . . . . 46
5.3.2 Stanislas Wawrinka vs. Novak Djokovic French Open 2015 Final . . . . 47
5.3.3 Novak Djokovic vs. Roger Federer US Open 2011 Semi-Final . . . . . . 48
3
1. Introduction
1.1 In-game Sports Analytics
“The win probability graphic/discussion on ESPN is literally taking a sword and sticking it
through the chest of any fun left in baseball”
— Kenny Ducey (@KennyDucey) April 2, 2017
“ESPN showing win probability is extremely good. Next up, run expectancy!”
— Neil Weinberg (@NeilWeinberg44) April 4, 2017
“Thank the Lord we have the ESPN Win Probability stat to tell us the team that’s ahead has
a good chance to win.”
— Christian Schneider (@Schneider CM) April 17, 2017
In recent years, win probability has become increasingly prevalent in sports broadcasting. Brian
Burke, an NFL commentator, has posted live win-probability graphs of NFL playoff games on
his website for the past few years (Burke, 2014). Earlier this year, ESPN began to post win
probabilities atop the score box in televised Major League Baseball games (Steinberg, 2017).
Despite mixed reactions from fans, as shown above, these developments represent a transition
in sports broadcasting’s modern narrative.
4
A win probability from any scoreline communicates how much a team or player is favored to
win. While one can produce this from any model of choice, those in sports analytics strive
to produce the most well-informed estimate available. With the recent proliferation of online
betting, in-match win probability dictates an entire market of its own. Betfair’s platform
matched over 40 million euro during the 2011 French Open Final (Huang et al., 2011); other
high-profile matches often draw comparable volume over betting exchanges. While the majority
of tennis prediction papers concern pre-match prediction, around 80% of online betting occurs
while matches are in progress (Sipko and Knottenbelt, 2015). At the very least, in-match win
probability concerns all participants in this betting market.
1.2 History of Tennis Forecasting
Statistical prediction models have been applied to tennis matches for over twenty years. Klaassen
and Magnus tested the assumption that points in a tennis match are independent and identi-
cally distributed, or i.i.d. (Klaassen and Magnus, 2001). While the assumption is false, they
concluded that deviations are small enough that constant probabilities provide a reasonable
approximation. Under this assumption, they construct a hierarchical Markov model in con-
junction with tennis’ scoring system. From this model, they offer an analytical equation for
match-win probability from any score, given each individual player’s probability of winning
a point on serve (Klaassen and Magnus, 2003). Barnett and Clarke then offer a method of
estimating each player’s serve probability from historical data (Barnett and Clarke, 2005).
Years later, Bevc proposed updating each of these serve probabilities with a beta distribution
between each point and computing the corresponding win probability with the above model
(Bevc, 2015). Recently, Kovalchik assessed performance of 11 different pre-match prediction
models on 2014 ATP tour-level matches. While the market-based Bookmaker Consensus Model
performed best overall, Elo ratings proved most effective among methods that rely on historical
data (Kovalchik, 2016).
Over the past several years, Jeff Sackmann has released the largest publicly available library of
5
tennis datasets via github. This collection contains match summaries of every ATP and WTA
match in the Open Era, point-by-point summaries of nearly 100,000 tour-level and satellite
matches, and a crowd-sourced match-charting project spanning over 3,000 matches, where
volunteers record each shot’s type and direction over an entire match. While FiveThirtyEight
and Kovalchik have used Jeff Sackmanns match data to generate Elo ratings, none of the
aforementioned papers have tested models with his point-by-point dataset.
As past papers have spanned several decades, datasets and model evaluation are not consistent.
Klaassen and Magnus’ original point-based model and Bevc’s beta experiments both apply
methods to around 500 Wimbledon matches from 1991-94. Barnett explores point-based models
in great depth, yet primarily applies them to a single marathon match between Andy Roddick
and Younes El Aynaoui from the 2002 Australian Open (Barnett et al., 2006). Except for Bevc,
none of these papers assess in-game match prediction with metrics such as accuracy or cross-
entropy. While Bevc does record accuracy of results over 500 Wimbledon matches, he only
reports accuracy of predictions in the third set and onwards, a subset of the entire dataset. In
other words, no one has taken all in-game prediction models and tested them at a large scale.
For this reason, I will test all relevant in-match prediction methods on thousands of matches
within the past seven years.
This paper combines Elo ratings, a wealth of data, and current technology to provide a similar
survey of which in-match prediction methods perform the best. We build upon past research
by testing variations of previous state-of-the-art methods, and applying new concepts to these
datasets, from Random Forest models used in football to elo-induced serving percentages.
1.3 Match/Point-by-Point Datasets
This project uses two different types of datasets: one with match summary statistics and one
with point-by-point information. Both are publicly available on github, courtesy of Jeff Sack-
mann (https://github.com/JeffSackmann). We use the matches dataset under “tennis atp”
to test pre-match prediction methods. This dataset covers over 150,000 tour-level matches, dat-
6
ing back to 1968. Relevant features include:
• match info (date, tournament, player names)
• match statistics (serve/return match statistics for each player)
The data in “tennis pointbypoint” offers more granular detail about a single match’s point
progression. Each match contains a string listing the order in which players won points and
switched serve. While atpworldtour.com does not publicly list point-by-point summaries,
Jeff Sackmann has made this information available from a betting website. Unlike the match
dataset, this dataset covers around 12,000 tour-level matches dating back to 2010. Relevant
features include:
• match info (date, tournament, player names)
• point-by-point (a string detailing the sequence in which points were won)
1.4 Implementation
With the above datasets serving as a basis for this project, thorough re-formatting of the data
was required in order to connect point-by-point strings to their corresponding observations in
the match dataset. As both datasets included player names, year, and score, we connected
matches across datasets with a hashing scheme.1 Due to observed inconsistencies, we stan-
dardized player names between match and point-by-point datasets in order to maximize the
number of available matches with point-by-point summaries.2 While a portion of the point-by-
point summaries are from satellite events, akin to the minor leagues of tennis, we were able
to match around 10,600/12,000 point-by-point strings to their respective tour-level matches.
Then, we could associate information generated from the match dataset’s entirety (Elo ratings,
year-long adjusted serve stats, etc) with corresponding point-by-point strings.
To view this process in more depth or access any of the resulting datasets, visit https://
1eg. hash(matchX) = “Roger Federer Tomas Berdych 2012 3-6 7-5 7-5”
2eg. “Stan Wawrinka” → “Stanislas Wawrinka”, “Federico Del Bonis” → “Federico Delbonis”
7
github.com/jgollub1/tennis_match_prediction. Implementations of each method in this
project3, and instructions on how to generate all relevant statistics, are also provided.
3All methods covered in this project required feature construction across the two datasets.
8
2. Scoring
Tennis’ scoring system consists of three levels: sets, games, and points. Consider a tennis match
between two entities, pi and pj . We can represent any score as (si, sj , gi, gj , xi, xj) where pi is
serving and sk, gk, xk represent pk’s score in sets, games, and points, respectively.
1 The players
alternate serve each game and continue until someone clinches the match by winning two sets
(best-of-three) or three sets (best-of-five).2
The majority of in-play tennis models utilize a graphical structure that embodies the levels
within tennis’ scoring system. Madurska referred to this as a hierarchical Markov Model
(Madurska, 2012). Barnett formally defines this representation for tennis scores (Barnett
et al., 2002). With pi and pj winning points on serve with probabilities fij , fji, each in-match
scoreline (si, sj , gi, gj , xi, xj) progresses to one of its two neighbors (si, sj , gi, gj , xi + 1, xj) and
(si, sj , gi, gj , xi, xj+1) with transition probabilities dependent on the current server. Assuming
all points in a match are i.i.d. between servers, we can then use the below model to recursively
determine win probability. Consider the scoreboard below, with Rafael Nadal serving to Juan
Martin Del Potro 7-6 3-6 6-6 2-1, in the 2011 Wimbledon 4th round. Representing the score
from the server’s point of view, we find:
1While tennis officially refers to a game’s first three points as 15,30,40 we will call them 1,2,3 for simplicity’s
sake.
2The best-of-five format is typically reserved for men’s grand slam and Davis Cup events.
9
Figure 2.1: Nadal serving to Del Potro in a third set tiebreakerc
Pm(1, 1, 6, 6, 2, 1) = P(pi wins the match, serving from this scoreline)
Pm(1, 1, 6, 6, 2, 1) = fij ∗ Pm(1, 1, 6, 6, 3, 1) + (1− fij)Pm(1, 1, 6, 6, 2, 2)
In the following sections, we specify boundary values to each level of our hierarchical model.
2.1 Modeling games
Within a game, either pi or pj serves every point. Every game starts at (0,0) and to win a
game, a player must win four or more points by a margin of at least two. Consequently, all
games with valid scores (xi, xj) where xi + xj >6, |xi − xj | ≤ 1 are reduced to (3,3), (3,2),
or (2,3). Furthermore, the win probability at (3,3) can be calculated directly. From (3,3),
the server wins the next two points with probability (f 2ij) , the returner wins the next two
points with probability (1 − f 2ij) , or both players split the two points and return to (3,3)
with probability 2fij(1− fij). Relating the game’s remainder to a geometric sequence, we find
(f )2ij
Pg(3, 3) = . As the probability of winning each point remains constant for
(f 2ij) + (1− f 2ij)
both players, this graph assumes that points within a game are independent. In the following
levels, we assume that games within a set and sets within a match are independent.
Possible sequences of point scores in a game:
a - pi wins the following point
b - pj wins the following point
10
G
a 1
a
3, 0
a b a
2, 0 3, 1
a b a b
a
1, 0 2, 1 3, 2
a b a b a b
0, 0 1, 1 2, 2 3, 3
b a b a b a
0, 1 1, 2 2, 3 b
b a b a
0, 2 1, 3
b a b
b
0, 3
b
Gc1
Boundary values:



1, if x1 = 4, x2 ≤ 20, if x2 = 4, x1 ≤ 2
Pg(xi, xj)
 (fij)
2
 , if x1 = x 2 − 2 2
= 3
 (fij) + (1 fij)fij ∗ Pg(si, sj , gi, gj , xi + 1, xj) + (1− fij)Pg(si, sj , gi, gj , xi, xj + 1), otherwise
(2.1)
With the above specifications, we can efficiently compute pi’s win probability from any score
Pg(xi, xj).
2.2 Modeling sets
Within a set, pi or pj alternate serve every game. Every set starts at (0,0). To win a set, a
player must win six or more games by a margin of at least two. If the set score (6, 6) is reached,
a special tiebreaker game is played to determine the outcome of the match.
11
Possible sequences of point scores in a game:
a - player 1 wins the following game
b - player 2 wins the following game
a′ - player 1 wins the tiebreaker game
b′ - player 2 wins the tiebreaker game
S
a 1
a′
a
5, 0
a b a
a
4, 0 5, 1
a b a b aa
3, 0 4, 1 5, 2
a b a b a b
2, 0 3, 1 4, 2 5, 3
b b a b a b a b
1, 0 2, 1 3, 2 4, 3 5, 4 6, 5
a a a b a b a b a b a b
0, 0 1, 1 2, 2 3, 3 4, 4 5, 5 6, 6
b a b a b a b a b a b a
0, 1 1, 2 2, 3 3, 4 4, 5 5, 6
b a b a b a b a
0, 2 1, 3 2, 4 3, 5
b a b a b a
0, 3 1, 4 2, 5
b a b a b
b b
0, 4 1, 5
b a b
b
0, 5
b b
′
Sc1
12
Boundary values:


1, if g1 ≥ 6, g1 − g 2
≥ 2
0, if g2 ≥ 6, g2 − g1 ≥ 2
Ps(g1, g2)
Ptb(s1, s2), if g1 = g2 = 6
Pg(0, 0)(1− Ps(g2, g1 + 1)) + (1− Pg(0, 0))(1− Ps(g2 + 1, g1)), otherwise
(2.2)
2.3 Modeling a tiebreak game
Within a tiebreak game, one player serves the first point and players alternate serve every two
points from then on. The game starts at (0,0) and a player must win seven points by a margin
of two or more to win.
Boundary values:
1, if p ≥ 7, p − p ≥ 2
1 1 2
0, if p2 ≥ 7, p2 − p1 ≥ 2Pt(p1, p2)fijPt(g1 + 1, g2) + (1− fij)Pt(g1, g2 + 1), if (p1 + p2) mod 2 = 1fij(1− Pt(g2, g1 + 1)) + (1− fij)(1− Pt(g2 + 1, g1)), if (p1 + p2) mod 2 = 0
(2.3)
2.4 Modeling a best-of-three match
In a best-of-three match, the first player to win two sets claims the match. This project
primarily applies models to best-of-three matches. In a best-of-five match, the graph would
have boundary values at s1 ≥ 3, s2 ≥ 3 at 3, since a player must win three sets to finish the
match.
13
a - player 1 wins the following set
b - player 2 wins the following set
a
1, 0 W1
a b a
0, 0 1, 1
b a b
b
0, 1 W c1
Boundary values:
1, if s1 ≥ 2
Pm(s1, s2)0, if s ≥ 2 (2.4) 2Ps(0, 0)(Pm(s1 + 1, s2)) + (1− Ps(0, 0))(Pm(s1, s2 + 1)), otherwise
2.5 Win Probability Equation
Combining the above equations, we can recursively calculate pi’s win probability when serving
from (si, sj , gi, gj , xi, xj) as:
Pm(si, sj , gi, gj , xi, xj) = fij ∗ Pm(si, sj , gi, gj , xi + 1, xj) + (1− fij)Pm(si, sj , gi, gj , xi, xj + 1)
Unless (xi, xj) satisfies a boundary condition, Pm(si, sj , gi, gj , xi, xj) will recursively call itself
on the two possible following scores. Since we can compute Pg(3, 3), we can eventually represent
each recursive probability in terms of boundary values, which allows us to compute the proba-
bility of winning the current game. By traversing through the set and match models in similar
recursive fashion, we can eventually determine pi’s probability of winning the match.
14
3. Pre-game Match Prediction
3.1 Overview
Before play has started, an in-match prediction model cannot draw on information from the
match itself. Then, before a match between pi and pj commences, the most well-informed
pre-match forecast π̂ij(t) should serve as a basis for in-match prediction. Therefore, we first
explore pre-match models.
Earlier this year, Kovalchik released a survey of eleven different pre-match prediction models,
assessing them side-by-side in accuracy, log-loss, calibration, and discrimination. FiveThir-
tyEight’s Elo-based model and the Bookmaker Consensus Model (BCM) performed the best.
Elo-based prediction incorporates pi and pj ’s entire match histories, while the BCM model
incorporates information encoded in various betting markets. In this section, we explore the
following methods and variations to the most successful models in her survey:
• Elo Ratings
• ATP Rank
• Point-based Models
• James-Stein Estimator
• Opponent-Adjusted Serve/Return Statistics
15
3.2 Elo Ratings
Elo was originally developed as a head-to-head rating system for chess players (Elo, 1978).
Recently, FiveThirtyEight’s Elo variant has gained prominence in the media (Bialik et al.,
2016). For a match at time t between pi and pj with Elo ratings Ei(t) and Ej(t), pi is
forecasted to win with probability:
Ej(t)−Ei(t)
π̂ij(t) = (1 + 10 400 )
−1
pi’s rating for the following match t+ 1 is then updated accordingly:
Ei(t+ 1) = Ei(t) +Kit ∗ (Wi(t)− π̂ij(t))
Wi(t) is an indicator for whether pi won the given match, while Kit is the learning rate for pi at
time t. According to FiveThirtyEight’s analysts, Elo ratings perform optimally when allowing
Kit to decay slowly over time (Bialik et al., 2016). With mi(t) representing pi’s career matches
played at time t we update our learning rate:
250
Kit =
(5 +mi(t)).4
This variant updates a player’s Elo most quickly when we have no information about a player
and makes smaller changes as mi(t) accumulates. To apply this Elo rating method to our
dataset, we initalize each player’s Elo rating at Ei(0) = 1500 and match history mi(0) = 0.
Then, we iterate through all tour-level matches from 1968-2017 in chronological order, storing
Ei(t), Ej(t) for each match and updating each player’s Elo accordingly.
1
3.3 ATP Rank
While Klaassen incorporated ATP rank into his prediction model (Klaassen and Magnus, 2003),
Kovalchik and FiveThirtyEight concur that Elo outperforms ranking-based methods (Kovalchik,
2016). On ATP match data from 2010-present, we found:
1tennis’ Open Era began in 1968, when professionals were allowed to enter grand slam tournaments. Before
then, only amateurs played these events
16
method accuracy
ATP Rank 66.5 %
Standard elo 69.0 %
With each rating system, we forecast the higher-ranked player as the winner. Accuracy rep-
resents the proportion of correctly forecasted matches. As the ATP does not provide fans a
corresponding win-probability equation to go along with their ranking system, we cannot com-
pare the approaches in terms of log loss or calibration. Still, considering its superior accuracy
to ATP rank in recent years, models in this paper use Elo ratings to represent a player’s ability
in place of official tour rank.
3.4 Point-based Model
The hierarchical Markov Model offers an analytical solution to win probability π̂ij(t) between
players pi and pj , given serving probabilities fij ,fji. This hinges on the Markov assumption
that transition probabilities to any state only depend on the current state. In tennis, this
means all points are independent and the probability of winning a given point only depends
on the current server. While this is counter-intuitive, Klaassen and Magnus demonstrated that
deviations from the i.i.d. assumption within matches are small enough to reasonably justify
point-based models (Klaassen and Magnus, 2001). Later on, Barnett and Clarke demonstrated a
method to estimate fij , fji, given players’ historical serve/return averages (Barnett and Clarke,
2005).
fij = ft + (fi − fav)− (gj − gav)
fji = ft + (fj − fav)− (gi − gav)
Each player’s serve percentage is a function of their own serving ability and their opponent’s
returning ability. ft denotes the average serve percentage for the match’s given tournament,
while fi, fj and gi, gj represent pi and pj ’s percentage of points won on serve and return,
respectively. fav, gav are tour-level averages for serve and return percentage. Since all points
are won by either server or returner, fav = 1− gav. By combining differences in ability relative
17
to the average, this formula assumes that their effects are additive.
As per Barnett and Clarke’s formula, we use the previous year’s tournament serving statistics
to calculate ft for a given tournament and year, where (w, y) represents the set of all matches
played at tournament w in y∑ear y.
k∈(w,y−1) # of points won on serve in match k
ft(w, y) = ∑
k∈(w,y−1) # of points played in match k
In their paper, Barnett and Clarke only apply this method to a single match: Roddick vs.
El Aynaoui Australian Open 2003. Furthermore, their ability to calculate serve and return
percentages is limited by aggregate statistics supplied by atpworldtour.com. That is, they can
only use year-to-date serve and return statistics to calculate fi, gi, fj , gj . Since the statistics do
not list corresponding sample sizes, they must assume that each best-of-three match lasts 165
points, which adds another layer of uncertainty in estimating players’ abilities.
Implementing this method with year-to-date statistics proves troublesome because fi, gi de-
crease in uncertainty as pi accumulates matches throughout the year. Due to availability of
data, match forecasts in September will then be far more reliable than ones made in January.
However, with our tour-level match dataset, we can keep a year-long tally of serve/return statis-
tics for each player at any point in time. Where (pi, y,m) represents the set of pi’s matches in
year y, month m, we∑obtain∑the following statistics:2∑ 12t=1∑ k∈(i,y−1,m+t) # of points won on serve by pi in match kfi(y,m) = ∑12t=1 k∈(i,y−1,m+t) # of points played on serve by pi in match k∑ 12 ∑t=1∑ k∈(i,y−1,m+t) # of points won on return by pi in match kgi(y,m) = 12
t=1 k∈(i,y−1,m+t) # of points played on return by pi in match k
Keeping consistent with this format, we also calculate fav, gav where (y,m) represents the set
of tour-level matc∑hes p∑layed in year y, month m:12
t=∑1 k∑∈(y−1,m+t) # of points won on serve in match kfav(y,m) = 12 = 1− gav(y,m)
t=1 k∈(y−1,m+t) # of points played in match k
Now, variance of fi, gi no longer depends on time of year. Since the number of points won
on serve are recorded in each match, we also know the player’s number of serve/return points
played. Below, we combine player statistics over the past 12 months to produce fij , fji for the
2For current month m, we only collect month-to-date matches.
18
3rd round match between Kevin Anderson (RSA) and Fernando Verdasco (ESP) at the 2013
Australian Open.
player name s points won s points fi r points won r points gi
Kevin Anderson 3292 4842 .6799 1726 4962 .3478
Fernando Verdasco 2572 3981 .6461 1560 4111 .3795
From 2012 Australian Open statistics, ft = .6153. From tour-level data spanning 2010-2017,
fav = 0.6468; gav = 1− fav = .3532 Using the above serve/return statistics from 02/12-01/13,
we can calculate:
fij = ft + (fi − fav)− (gj − gav) = .6153 + (.6799− .6468)− (.3795− .3532) = .6221
fji = ft + (fj − fav)− (gi − gav) = .6153 + (.6461− .6468)− (.3478− .3532) = .6199
With the above serving percentages, Kevin Anderson is favored to win the best-of-five match
with probability Mp(0, 0, 0, 0, 0, 0) = .5139
3.5 James-Stein Estimator
Decades ago, Efron and Morris described a method to estimate groups of sample means (Efron
and Morris, 1977). The James-Stein estimator shrinks sample means toward the overall mean,
with shrinkage proportional to its estimator’s variance. Regardless of the value of θ, this method
produces results superior in expectation to Maximum-Likelihood Estimation.
To estimate serve/return parameters for players who do not regularly play tour-level events,
fi, gi must be calculated from limited sample sizes. Consequently, match probabilities based
off these estimates may be skewed by noise. The James-Stein estimators offer a more reason-
able estimate of serve and return ability for players with limited match history. To shrink
serving percentages, we compute the variance of all recorded fi statistics in our match dataset
Dm.
3 ∑
(f − f )2
2 f ∈D i avτ̂ = i m
|Dm| − 1
3Each fi is computed from the previous twelve months of player data.
19
Then, each estimator fi is based off ni service points. With each estimator fi representing fi/ni
points won on serve, we can compute estimator fi’s variance and a corresponding normalization
coefficient:
2 fi(1− fi)σ̂i =
ni
σ̂ 2i
Bi =
τ̂2 + σ̂ 2i
Finally, the James-Stein estimator takes the form:
JS(fi) = fi +Bi(fav − fi)
We repeat the same process with gi to obtain James-Stein estimators for return statistics. To
see how shrinkage makes our model robust to small sample sizes, consider the following example.
When Daniel Elahi (COL) and Ivo Karlovic (CRO) faced off at ATP Bogota 2015, Elahi had
played only one tour-level match in the past year. From a previous one-sided victory, his year-
long serve percentage, fi = 51/64 = .7969, was abnormally high compared to the year-long
tour-level average of fav = .6423.
player name s points won s points fi r points won r points gi elo rating
Daniel Elahi 51 64 .7969 22 67 .3284 1516.92
Ivo Karlovic 3516 4654 .7555 1409 4903 .2874 1876.95
fij = ft + (fi − fav)− (gj − gav) = .6676 + (.7969− .6423)− (.2874− .3577) = .8925
fji = ft + (fj − fav)− (gi − gav) = .6676 + (.7555− .6423)− (.3284− .3577) = .8101
Following Klaassen and Magnus’ method of combining player outcomes, we compute Elahi’s
service percentage to be 89.3%. This is extremely high, and eclipses Karlovic’s 81.01% serve
projection. This is strange, given that Karlovic is one of the most effective servers in the
history of the game. From the serving stats, our hierarchical Markov Model computes Elahi’s
win probability as Mp(0, 0, 0, 0, 0, 0) = .8095. This forecast seems unreasonably confident of
Elahi’s victory, despite only having collected his player statistics for one match. Karlovic’s
360-point Elo advantage calculates Elahi’s win probability as
1876.95−1516.92
π̂ij(t) = (1 + 10 400 )
−1 = .1459
20
which leads us to further question the validity of this approach when using limited histori-
cal data. Thus, we turn to the James-Stein estimator to shrink Elahi’s serving and return
probabilities toward overall means fav, gav.
JS(fi) = fi +Bi(fav − fi) = .7969 + .7117(.6423− .7969) = .6869
JS(gi) = gi +Bi(gav − gi) = .3284 + .7624(.3577− .3284) = .3507
JS(fj) = fj +Bj(fav − fj) = .7555 + .0328(.6423− .7555) = .7518
JS(gj) = gi +Bj(gav − gj) = .2874 + .0420(.3577− .2874) = .2904
JS(fij) = ft+(JS(fi)−fav)−(JS(gj)−gav) = .6676+(.6869− .6423)−(.2904− .3577) = .7795
JS(fji) = ft+(JS(fj)−fav)−(JS(gi)−gav) = .6676+(.7518− .6423)−(.3507− .3577) = .7841
Above, we can see that the James-Stein estimator shrinks Elahi’s stats far more than Karlovic’s,
since Karlovic has played many tour-level matches in the past year. Given JS(fi), JS(fj), we
compute Mp(0, 0, 0, 0, 0, 0) = .4806. By shrinking the serve/return statistics, our model lowers
Elahi’s inflated serve percentage and becomes more robust to small sample sizes.
As point-based forecasts on limited data threaten model performance, especially with respect to
cross entropy, the James-Stein estimator allows a safer way to predict match outcomes. Later
on, we will use the James-Stein estimator to normalize not only year-long serve/return statistics,
but also surface-specific and opponent-adjusted percentages. For a player who has averaged
fi = .64 over n points in the past twelve months, this diagram illustrates how normalization
coefficient Bi varies with n in our dataset:
21
3.6 Opponent-Adjusted Serve/Return Statistics
While Barnette and Clarke’s equation does consider opponent’s serve and return ability, it
does not track average opponents’ ability within a player’s history. This is important, as a
player’s serve/return percentages may become inflated from playing weaker opponents or vice
versa. In this section, we propose a variation to Barnette and Clarke’s equation which replaces
fav, gav with opponent-adjusted averages 1− gi opp av, 1− fi opp av for pi. The equations then
become:
fij = ft + (fi − (1− gi opp av))− (gj − (1− fj opp av))
fji = ft + (fj − (1− gj opp av))− (gi − (1− fi opp av))
gi opp av represents the average return ability of opponents that pi has faced in the last twelve
months. To calculate this, we weight each opponent’s return ability gj by number of points in
their respective match.
∑12 ∑
t=1 k∈(i,y−1,m∑+t) (#∑of pi’s return points in match k)*(pj ’s return ability before match k)gi opp av = 12
t=1 k∈(i,y−1,m+t) # of pi’s return points in match k
To illustrate the effect of tracking opponent-adjusted statistics, we consider the 2014 US Open
22
first-round match between Mikhail Youzhny (RUS) and Nick Kyrgios (AUS).
player name s points won s points fi r points won r points gi elo rating
Mikhail Youzhny 1828 2960 .6176 1145 2947 .3885 1749.77
Nick Kyrgios 900 1370 .6569 424 1323 .3205 1645.03
From the standard approach, with ft = .6583, we compute fij = .6446, fji = .6159. Then, to
calculated adjusted statistics, we consider the expected number of points won on serve/return,
given Youzhny and Kyrgios’ past opponents. As we can observe from fi opp av, gi opp av, Kyrgios
has faced faced slightly stronger opponents than Youzhny over the past twelve months.
player name gi opp av s expected s adj fi opp av r expected r adj
Mikhail Youzhny .4156 1729.74 .0332 .6812 939.60 .0697
Nick Kyrgios .4313 779.19 .0882 .7060 388.92 .0265
sadj i, radj i, and adjusted serve probabilities adj(fij), adj(fji) are calculated as follows:
s = # points won on serveadj i # points on serve − (1− gi opp av) = fi − (1− gi opp av)
r = # points won on returnadj i # points on return − (1− fi opp av) = gi − (1− fi opp av)
adj(fij) = ft + (fi − (1− gi opp av))− (gj − (1− fj opp av)) = ft + sadj i − radj j = .6288
adj(fji) = ft + (fj − (1− gj opp av))− (gi − (1− fi opp av)) = ft + sadj j − radj i = .6406
Using regular serve probabilities, Youzhny is favored to win with Pm(0, 0, 0, 0, 0, 0) = .6410.
With adjusted serving percentages, we factor in the stronger opponents in Kyrgios’ match
history and Youzhny’s win probability drops to Mp(0, 0, 0, 0, 0, 0) = .4411. While adjusted
serve/return statistics do not always cause significant shifts in win probability, they consistently
account for relative ability of players’ opponents in producing serve probabilities fij , fji.
3.7 Results
The following results were obtained using ATP best-of-three matches in our point-by-point
dataset, excluding Davis Cup.4 We observe performance across variants of Elo-based predictions
and point-based models. Since all original implementations provide explicit formulas with no
optimization, we directly assess their performance on 2014 tour-level data (2,409 matches).
4The Davis Cup is a year-long competition between teams organized by country. Since Kovalchik excluded
Davis Cup matches from her evaluation (Kovalchik, 2016) and many match-ups involve relatively unknown
players, we exclude all matches.
23
In the case of logistic regression, the model was trained on tour-level data from 2011-2013
(7,828 matches). We measure each model’s performance in terms of accuracy, as in section
3.3. We also consider log loss, a measure of overall likelihood of a model’s predictions. Log
loss is important because it penalizes models that incorrectly assign over-confident or under-
confident win probabilities. As we are hoping to optimize in-match prediction models’ log loss,
it is important to consider the likelihood of various pre-match forecasts. The following terms
express variations to point-based and Elo models:
KM - point-based hierarchical Markov models with combined serve/return
percentages from Barnett/Clarke
James-Stein - version of a KM model, with all serve/return percentages nor-
malized with James-Stein estimator
surface - version of model where all ratings and percentages are stratified by
surface (Hard, Clay, Grass)
538 - specifically denotes Five-Thirty-Eight’s “decaying k-factor” method in
computing Elo ratings
Table 3.1: Results Table
Method Variation Accuracy Log Loss
KM 64.8 .649
KM James-Stein 65.4 .616
KM surface 63.3 .707
KM surface James-Stein 63.6 .639
KM adjusted 67.8 .632
KM adjusted James-Stein 67.9 .617
Elo 69.1 .586
Elo surface 68.4 .591
Elo 538 69.2 .587
Elo surface 538 69.4 .592
Logit Elo 538, surface Elo 538 69.5 .577
24
3.7.1 Discussion
As expected, James-Stein normalization significantly improves each point-based model’s log
loss. While surface-based Elo is competitive with regular Elo, restricting Klaassen and Mag-
nus’ point-based method to surface-specific data clearly hurts performance. This is presumably
due to limited surface-specific data. As many players perform differently across surfaces, com-
puting serve percentages from surface-specific data offers the potential to express how players’
performance varies by surface. Yet, even with James-Stein normalization, the loss in sample
size appears to outweigh the benefits obtained from surface-specific data. As we did not rig-
orously explore possibilities for weighting players’ surface-specific statistics, there may still be
potential for a point-based model that effectively considers surface in computing serve/return
percentages. Finally, using opponent-adjusted serve-return statistics further improved perfor-
mance. By fully incorporating each opponent’s ability in a player’s match history, this model
approached Elo ratings in likeness of predictions, with win-probability forecasts between the
two sharing R2 = .76.
By plugging Elo and surface Elo into a logistic regression model, we achieve a log loss of .577.
After fitting to the training data, this model learned coefficients for each variable that assigned a
56% weight to regular Elo ratings and 44% weight to surface Elo ratings. The near-even weight
to both types of ratings suggests how important surface-specific ability is toward forecasting
matches. Aside from models which draw information directly from the betting market, no
other model has documented preferable log loss. Kovalchik reported 70% accuracy using 538’s
Elo method (Kovalchik, 2016), calculating Elo ratings using satellite events (ATP Challengers
and Futures) in addition to tour-level events 5. While this accounted for a small increase in
accuracy, her method achieved a log loss of .59, which does not outperform our implementation.
For simplicity’s sake, we calculated all Elo ratings and statistics from tour-level matches alone.6
5One can observe at https://github.com/skoval/deuce.
6There are about 3,000 tour-level matches each year.
25
In the end, our Elo-based logistic regression’s incremental improvement over singular Elo ratings
implies that an effective rating system should consider a overall ability as well as surface-specific
ability.
Sipko recently explored machine-learning methods for pre-match prediction, surveying logistic
regression, the common-opponent model, and an artificial neural net (Sipko and Knottenbelt,
2015). While he claimed to have achieved 4.3% return-on-investment off the betting market with
an artificial neural net, his machine learning models did not beat a log loss of .61 when predicting
2013-2014 ATP matches7. Despite the current deep-learning craze throughout industry and
academia (Lewis-Kraus, 2016), no one has published findings to suggest their superiority in
predicting tennis matches. Short of market-based models, it appears most effective to stick
with Elo’s “ruthlessly Bayesian design” which FiveThirtyEight so frequently touts (Bialik et al.,
2016). Transitioning onward, we will consider how these findings may inform an effective in-
match prediction model.
7The models were trained on 2004-2010 ATP matches and validated on 2011-2012 ATP matches.
26
4. In-game Match Prediction
The following methods will be tested on tour-level matches for which we have point-by-point
data. The matches span 2010-2017, accounting for nearly half of all tour-level matches within
this time. Point-by-point records in Sackmann’s dataset take the form of the following string:
Mikhail Youzhny vs. Evgeny Donskoy (Australian Open 2013, Round 2)
P=‘‘SSRSS;RRRR;SRSSS;SRRSRSSS;SRSSRS;RSRSSS;SRSRSS;RSRSRSSS;SSS
S.SSSRRRSS;RSSSS;SSRSS;SSSRS;SSSS;RRRSSSSRRSSRRSRSSS;SRSRSS;SSS
RS;RSRSSRSS;SSSS;SRSSS;RSRSSRRSSS;R/SR/SS/RR/RS/SR.RSRRR; ...’’
S denotes a point won by the server and R a point won by the returner. Individual games are
separated by “;”, sets by “.”, and service changes in tiebreaks by “/”. By iterating through
the string, one can construct n data points {P0, P1, ..., Pn−1} from a match with n total points,
with Pi representing the subset of the match after which i points have been played.
P0 = “”
P1 = “S”
P2 = “SS”
P3 = “SSR”
...
With M = {M1,M2, ...Mk} representing complete match-strings in our point-by-point dataset,
27
∑k
the size of our enumerated dataset becomes i=1 |Mi|. In total, this sums to over 1.2 million
individual points within ATP matches spanning 2010-2017. While many in-match prediction
models utilize the hierarchical Markov Model structure, we will first test several machine-
learning methods as a baseline. Now, we will explore the following methods:
• Logistic Regression
• Random Forests
• Hierarchical Markov Model
• Beta Experiments
• Elo-induced Serve percentages
4.1 Logistic Regression
Consider a Logistic Regression model where each observation xi = [xi1, xi2, ..., xik] has k fea-
tures. When P represents pi’s win probability,
log P1−P = β0 + β1(xi1) + β2(xi2) + ...+ βk(xik)
From any scoreline (si, sj , gi, gj , xi, xj), we can simply feed these values into our model as
features of xi. Logistic Regression’s structure makes it easy to consider additional features
for each player, such as Elo difference, surface Elo difference, or break advantage. Before
adding more features to the model, we consider two baselines: a model using score differentials
(si − sj , gi − gj , xi − xj) and another model trained on Elo differences and lead heuristic Lij .
This heuristic estimates pi’s total lead in sets, games, and points:
L 1 1ij = (si − sj) + 6 (gi − gj) + 24 (xi − xj)
The coefficients preserve order between sets, games, and points, as one cannot lead by six games
without winning a set or four points without winning a game. While it assumes these relative
28
orderings, the heuristic is meant to approximate how much pi leads pj from any scoreline. In
this model, we consider the following features in predicting the winner of the match:
Table 4.1: Logistic Regression Features
V ariable Description
lead margin lead heuristic Lij
eloDiff Elo(p0) - Elo(p1)
s eloDiff s Elo(p0) - s Elo(p1)
setDiff SetsWon(p0) - SetsWon(p1)
gameDiff inSetGamesWon(p0) - inSetGamesWon(p1)
pointDiff inGamePointsWon(p0) - inGamePointsWon(p1)
breakAdv ServeGamesWon(p0) - ServeGamesWon(p1) + I(currently serving)
brkPointAdv I(holding break point) - I(facing break point)
sv points pct 0 percentage of points won on p0’s serve thus far
sv points pct 1 percentage of points won on p1’s serve thus far
Next, we test the following combinations of features:
1) setDiff + gameDiff + pointDiff (Differentials)
2) lead margin + eloDiff + s eloDiff (Elo)
3) all features (All)
4.1.1 Cross Validation
Each match in our best-of-three dataset has around 160 points on average. In tuning hyper
parameters for machine learning models, we implement k-fold group validation with k = 5. This
randomly splits the training set into five subsets of roughly equal size, while always grouping
points from the same match in the same fold. By grouping points from the same match together,
this method prevents overlap across train, validation, and test sets. Treating each subset as
a validation set, we train all models on the remaining data before testing performance on the
corresponding validation set. Then we average performance across all k validation sets and
select the hyper parameters whose models performed best. When we assess model performance
on validation and test sets, results then reflect performance on matches a model has never seen
before.
29
4.2 Random Forests
Brian Burke’s win-probability models are among the most well-known in sports (Burke, 2014).
They calculate a team’s win probability at any point in the match based on historical data
through a combination of binning inputs and smoothing their resulting probabilities. Nettleton
and Lock built on this method by using a Random Forest approach (Lock and Nettleton,
2014).
A Random Forest consists of an ensemble of randomly generated classification trees. Each
tree forms decision functions for a subset of features with splits that generate maximum dis-
criminatory ability. As Nettleton and Lock do so with football-specific features such as down,
points, and yards to goal, we do so with analogous tennis features and train on our validation
set.
Table 4.2: Random Forest features
Variable Description
surface hard, clay, grass
set first, second, third
eloDiff Elo(p0) - Elo(p1)
setDiff SetsWon(p0) - SetsWon(p1)
gameDiff inSetGamesWon(p0) - inSetGamesWon(p1)
pointDiff inGamePointsWon(p0) - inGamePointsWon(p1)
breakAdv ServeGamesWon(p0) - ServeGamesWon(p1) + I(currently serving)
brkPointAdv I(holding break point) - I(facing break point)
4.3 Hierarchical Markov Model
With serving percentages already calculated from historical data, our hierarchical Markov model
is well-equipped to produce in-match win probability estimates. Using the analytical equation
with players’ serving abilities fij , fji, we compute Pm(si, sj , gi, gj , xi, xj) from every scoreline
(si, sj , gi, gj , xi, xj) in a match. To assess this model’s performance, we repeat this on every
match in our dataset, with fij , fji computed with one’s method of choice.
30
4.3.1 Beta Experiments
The above approach only takes into account the current score and pre-calculated serve percent-
ages when computing win probability. However, in many cases, relevant information may be
collected from Pk. Consider the following in-match substring,
P = “SSSS;RSSSRRSS;SSSS;SRRSRSRSSS;SSSS;RRRSSSRSRSSS; ”
The above sequence demonstrates a current scoreline of three games all. However, pi has
won 12/12 service points, while pj has won 18/30 service points. If both players continue
serving at similar rates, pi is much more likely to break serve and win the match. Since original
forecasts fij , fji are based on historical serving percentages, it makes sense that in-match serving
percentages may help us better determine each player’s serving ability on a given day. To do
this, we can update fij , fji at time t of the match to factor in each player’s serving performance
thus far.
Bevc explored this method by modeling each player’s serving percentage fij with a beta prior
bprior that could be updated mid-match to yield posterior estimates. Through beta-binomial
conjugacy, we can update bprior when a player has won swon points on serve out of stotal
trials.
bprior ∼ Beta(a, b)
bposterior ∼ Beta(a+ swon, b+ stotal − swon)
If fij is our prior, a, b must satisfy E(b
a
prior) = a+b = fij . Then a is a hyperparameter that
determines the strength of our prior. To find a suitable prior strength, we will test various
values of a via cross-validation on our training set. Regardless of a, the match’s influence on
our posterior serve estimates always grows as more points have been played. At any point in the
match, we can always obtain a posterior serving percentage for either player of the form
a+ swon
f ′ij = E(bposterior) = a+ b+ stotal
31
4.3.2 Elo-induced Serve Probabilities
Earlier on, Klaassen and Magnus suggested a method to infer serving probabilities from a
pre-match win forecast πij . By imposing a constraint fij + fji = t, we can then create a
one-to-one function S : S(πij , t) → (fi, fj), which generates serving probabilities f̂ij , f̂ji for
both players such that Pm(0, 0, 0, 0, 0, 0) = π
1
ij . As this paper was published in 2002, Klaassen
and Magnus inverted their match probability equations to produce serve probabilities for ATP
rank-based forecasts. However, since Elo outperforms ATP rank, we apply this method to Elo
forecasts.
Due to branching complexity (see section 2.5), our hierarchical Markov model’s match prob-
ability equation has no analytical solution to its inverse, even when we specify fij + fji = t.
Therefore, we turn to the following approximation algorithm to generate serving percentages
that correspond to a win probability within  of our Elo forecast’s:
Algorithm 1 elo-induced serve probabilities
procedure EloInducedServe(prob,sum,)
s0← sum/2
currentProb← .5
diff← sum/4
while |currentProb - prob| >  do:
if currentProb < prob then
s0 += diff
else
s0 -= diff
diff = diff/2
currentProb← matchProb(s0,sum-s0)
return s0, sum-s0
To generate elo-induced serve probabilities for a given match, we run the above algorithm
with PROB=πij , SUM=fij + fji, and  set to a desired precision level.
2 At each step, we call
matchProb() to compute the win probability from the start of the match if pi and pj had serve
probabilities fij = s0, fji = sum− s0, respectively. Then we compare currentProb to prob and
1fij and fji are computed as specified in 3.4 with James-Stein normalization, so as to prevent extreme
results.
2For purposes of this project, setting epsilon=.001 is sufficiently accurate.
32
increment s0 by diff, which halves at every iteration. This process continues until the serve
probabilities s0, sum-s0 produce a win probability within  of PROB, taking O(log 1 ) calls to
matchProb.
This inverse algorithm is useful for several reasons. Given any effective pre-match forecast πij ,
we can produce serve probabilities that are consistent with πij according to our hierarchical
Markov model. By setting the constraint fij + fji = t, we also ensure that the sum of our
players’ serve probabilities agrees with historical data. 3 While Klaassen and Magnus argue that
t = fij + fji is largely independent of πij , t holds greater importance when predicting specific
scorelines a match (Barnett et al., 2006). Using the hierarchical Markov Model’s equations,
Barnette specifically computed the probability of reaching any set score given fij , fji. When
comparing probabilities across matches, one can observe that as t increases, closer set scores and
tiebreak games become more likely.4 That is, t encodes information regarding likely trajectories
of a match scoreline and relative importance of winning each game on serve. For higher t,
service games are won more often and service breaks hold relatively more importance, while the
opposite holds for lower t. Now, given πij and t, we can produce elo-induced serve probabilities
for any two players.
4.3.3 Importance of fij + fji
To illustrate the importance of t = fij + fji, we consider the two following matches:
match pi pj tny name surface πij t
m1 Feliciano Lopez John Isner ATP Shanghai 2014 hard .478 1.55
m2 Andreas Seppi Juan Monaco ATP Kitzbühel 2014 clay .476 1.14
In both matches, πij is approximately the same while t differs significantly. To visualize how
service breaks can have varying importance, consider win-probability graphs corresponding to
the following progression of points, P . In each scenario, pi and pj ’s serve probabilities are
calculated from method 4.3.2.
3While it is just an estimate, t = fij +fji is our best predictor of overall serving ability from methods covered
in this project.
47-6, 6-7, 7-5, 5-7, etc.
33
P = “SSSS;SSSS;SSSS;SSSS;SSSS;RRRR”
Over the first five games, both players win every point on serve. When t = 1.55, service holds
are expected and and win probability hardly moves. After pi breaks serve in the sixth game,
his win probability suddenly spikes up, indicating a substantial advantage. When t = 1.14, our
model assumes that players win service games with much lower probability. Therefore, each
time a player holds serve, his win probability noticeably increases. When pi breaks in the sixth
game, his win probability climbs, but not as dramatically as in the previous case. When service
games are won at a lower probability, service break advantages become weaker.
t’s influence on the trajectory of a match suggests that prediction from any in-match scoreline
will be most effective when t accurately reflects the players’ overall serving ability. While Elo
ratings or betting odds yield the most effective pre-match predictions (Kovalchik, 2016), their
implied odds offer no information about t in a given match-up. As t holds significance for
in-match forecasting from specific scorelines, method 4.3.2 allows us to combine any effective
pre-match forecast πij with t to produce new serving probabilities.
34
4.4 Results
The following models were trained on 2011-2013 point-by-point data (4584 matches) and tested
on 2014 point-by-point data (1855 matches). For consistency, we only used best-of-three ATP
tour-level matches and excluded matches from the Davis Cup. From cross-validation on the
training set, we determined that setting hyper-parameter a = 300 yielded optimal performance
across beta experiments. We then added a beta variation to KM elo-induced and KM logit-
induced, our two best-performing point-based models. As observations corresponds to a single
point in each match, longer matches contribute more to our dataset and have larger influ-
ence on evaluation metrics. Since addressing this outsized influence could complicate results’
interpretability, we present model evaluation on a point-by-point basis.
LR - Logistic Regression model with feature set specified in 4.1
RF - Random Forest model with feature set specified in 4.2
Equivalent - KM model with serve probabilities set to tour-level averages (assumes
equal player ability)
elo-induced - KM model with serve probabilities generated from elo-based win
probability
logit-elo - KM model with serve probabilities generated from logistic regression on
Elo and surface-based elo
35
Table 4.3: Results Table
Method Variation Accuracy Log Loss Calibration
LR Differentials 71.4 .540 .993
LR Elo 75.2 .505 1.006
LR All 76.2 .488 .993
RF 70.1 .562 .902
KM Equivalent 71.5 .538 1.040
KM James-Stein 74.9 .501 1.002
KM adjusted James-Stein 75.9 .498 .992
KM elo-induced 76.2 .486 1.001
KM elo-induced (a=300) 76.3 .481 1.003
KM logit-elo 76.5 .481 .996
KM logit-elo (a=300) 76.5 .477 .999
Of our models, Random Forest had the least predictive power. While this is perhaps due to
difficulty in constructing an effective feature space, nearly all features fed into each machine
learning model had an additive effect on win probability. Positive score differentials, break
advantages, and Elo differences will all boost a player’s expected win probability. The only
non-additive features considered were “set” and “surface.” While the random forest model was
able to consider various situations among these two features, we conclude that its inclination to
group similar examples together, rather than estimate linear relationships between variables,
ultimately accounted for its performance.
Logistic Regression, on the other hand, is designed to learn additive relationships between fea-
tures and its probability output. It’s performance with the full feature set was competitive with
our point-based models, despite their superior grasp of tennis’ scoring system.5 Presumably,
this model holds an advantage over point-based models in its ability to fit itself to the training
data. Aside from our hyper-parameter search of a in beta experiments, the point-based models
offer no comparable method to optimize performance with respect to a training set.
5Consider, for example, how logistic regression has no knowledge of when tiebreak games occur.
36
Of the point-based models, KM logit-elo (a=300) achieved the best performance, although the
improvement from running beta experiments was marginal. Since Elo ratings produce more
reliable pre-match win forecasts πij than estimates of fij , fji from historical serve/return data,
using serve probabilities induced from Elo ratings and then from our logit Elo model improved
our models performance by significant increments. Over the next few sections, we will examine
our models’ calibration and performance across subsets of our dataset.
37
4.4.1 Model Calibration
Calibration represents the ratio between observed rate of success and mean predicted rate of
success. From a perfectly calibrated model that predicts a win probability of p, we can always
expect a win to occur with probability p. While the Random Forest model under-predicted
wins, the rest of our models closely resemble a perfectly calibrated model. From our table,
38
KM logit elo and KM JS exhibited near-perfect calibration of 1.001 and .999, while Logistic
Regression reported .993. As there is degree of uncertainty in measuring calibration, we could
not reasonably expect to produce more well-calibrated models. Given win probability forecast
πij at any point of a match within our test set, we expect pi to win with probability πij .
4.4.2 By Set
Since we have less information about a match early on, it is always most difficult to forecast
matches during the first set. However, across both metrics, models are most effective during the
second set. By this time in the match, the winner of the first set is often a clear favorite to win
the match, unless their opponent has a significantly higher Elo rating or serve percentage fji.
As 64.3% of matches in our test set are straight-sets victories, many matches offer relatively
easy predictions for our model during the second set.6
Uncertainty increases once the match enters a third set because the set score has evened at
1-1. While the set score is tied in both first and third sets, it is easier to predict the outcome
during the third set because its outcome corresponds directly to the match outcome. When
6Best-of-three matches that end in two sets are straight-sets.
39
predicting in the first set, the player who wins the first set can still go on to lose the match,
which increases uncertainty.
4.4.3 By Surface
By surface, models predicted grass court matches most effectively and clay court matches least
effectively. As one can observe from historical tournament averages ft across surfaces, grass
courts tend to yield higher serving percentages fij , fji than clay courts. Presumably, this is
because the ball bounces more quickly off grass courts than clay courts. One explanation for
observed differences is that service breaks become more likely on clay courts because fij , fji
are lower. With service breaks more likely, it becomes harder for our models to predict the
match because advantages from a current scoreline are more likely to be turned around. Hard
courts, where the majority of matches took place, ultimately proved in between grass and clay
in predictability. Further insight into the volatility of matches may formally shed light on
differing results across surfaces.7
7We may calculate a measure of volatility from point-by-point sequences or a model’s win probability output.
40
4.5 Visualizing Win Probability
As the most effective Logistic Regression and KM models were not far off in predictive perfor-
mance, we now visualize their forecasts with several tour-level matches.
Over the first thirty points of the match, the contrast between our models is reminiscent of
4.3.3. Since our Logistic Regression model has learned specific coefficients for the “point diff”
feature, Reister’s win probability spikes up and down noticeably as he and Gasquet exchange
games on serve. Our KM logit elo model seems to have set t = fij + fji to a high value,
expecting frequent service holds and producing a relatively static win probability over this time
frame. While we can communicate the relative importance of service breaks with constraint
t = fij + fji, there is no direct method for controlling probability fluctuations with Logistic
Regression.
Over the last thirty points of the match, consider how Reister’s win probability slowly declines
according to the KM model, while it remains approximately constant according to Logistic
Regression. One drawback of our Logistic Regression feature sets was that they could not
distinguish between situations whose score differentials are equivalent. As Gasquet maintains a
break advantage through the end of third set, this explains logistic regression’s behavior, since
41
score differentials remain similar.
In general, consider a player serving at (1,1,5,4,3,0) and one serving at (1,1,1,0,3,0). In the first
situation, the player serving wins the match if he wins any of the next three points. From the
second scenario, the player serving holds a break advantage early in the final set, from which the
returner has many more chances to come back. Assuming each player serves at fi = fj = .64,
our point-based win-probability equation suggests a substantial difference between these two
scenarios:
Pm(1, 1, 5, 4, 3, 0) = .994
Pm(1, 1, 1, 0, 3, 0) = .800
Although the first situation is clearly favorable, Logistic Regression will observe equivalent dif-
ferentials (set diff, game diff, and point diff) and compute approximately the same probability
in both scenarios.8
In a similar example, Logistic Regression may also fail to notice when a higher-ranked player
is about to lose a close match.
Simon leads for most of the match until Gabashvili comes back to force the third set. At the
(β0+β1(s −s8 i j
)+β2(gi−gj)+β3(xi−xj)
After fitting coefficients, P(x) = P(s ei − sj , gi − gj , xi − xj) = (β ,
1+e 0
+β1(si−sj)+β2(gi−gj)+β3(xi−xj)
and P(1,0,5,4,3,0) = P(1,0,1,0,3,0) by symmetry.
42
end of the third set, Gabashvili serves out the match from 5-4, yet the model fails to detect
that Simon is about to lose. This is because, with sets at 1-1 and game differentials nearly
even, Simon’s Elo advantage outweighs his current scoreline disadvantage. Consequentially, his
win-probability hovers between 50% as he loses the match, rather than fall toward 0. While we
considered several score-based interaction terms to prevent scenarios like this, adding them to
our feature set did not significantly change results. Still, when comparing model performance in
specific scenarios, it is important to note that the KM model was specifically designed for tennis
prediction while our Logistic Regression model was simply fed score-related features that seemed
helpful. Despite apparent faults, observed performance suggests potential in further exploring
machine learning methods.
43
5. Conclusion
For the first time, we have documented the performance of in-match prediction methods on
thousands of matches. While previous work only referenced individual matches or the 1992-1995
Wimbledon match dataset (Klaassen and Magnus, 2003), we have applied methods to thousands
of matches from recent years. Results from this project and Jeff Sackmann’s point-by-point
dataset should serve as an appropriate benchmark for future work in this field.
Although many papers have referenced Klaassen and Magnus’ point-based model, none has
explicitly outlined a way to correct for uncertainty stemming from small sample sizes in players’
serve/return history. Applying a normalization method, such as the James-Stein estimator in
3.5, significantly boosts the performance of point-based models. By factoring in the overall
ability of a player’s opponents in the past twelve months, calculating fij , fji from the opponent-
adjusted method in 3.6 offers additional predictive power. Still, none of these variations produce
pre-match forecasts as effectively as Elo ratings. Using our approximation algorithm in 4.3.2, we
can derive serve probabilities from any pre-match forecast πij and historical serve parameter t =
fij+fji. A combination of this method, using πij derived from an Elo-based logistic regression,
and our KM model ultimately produced the most reliable in-match prediction model.
Still, logistic regression’s performance with a score-based feature set proved noteworthy. Machine-
learning models offer potential in their ability to learn from a training set. While serve param-
eters fij , fji may be obtained from a machine-learning model’s forecast, πij , our traditional
point-based hierarchical Markov Model offers no comparable method to “train” itself from a
dataset. Future work with point-based models that offers some aspect of optimization with
44
respect to a training set may be of much interest to fans and bettors alike. For the time be-
ing, however, we have outlined several approaches to produce well-calibrated win probability
forecasts at any point in a tennis match.
5.1 Applications to Betting Markets
Methods explored in this project may be applied to betting markets in several ways. First, we
can take any pre-match forecast πij , derived from implied market odds, with historical serve
parameter t = fij + fji and produce corresponding serve probabilities fij , fji from method
4.3.2. As Kovalchik found that the Bookmaker Consensus Model outperformed Elo ratings
(Kovalchik, 2016), we could likely further improve in-match prediction models by using pre-
match betting odds. Next, we may back test any of our in-match prediction techniques against
historical in-match odds. Betfair, an online gambling company, offers historical market data
which is time-stamped yet lacks corresponding match scores. While mapping time-stamped
market odds to corresponding scorelines across our point-by-point dataset proved out of scope,
analyses with in-match betting data do exist. Huang demonstrated score inference from betting
odds (Huang et al., 2011), and market-implied odds may serve as a benchmark against which
to compare models. Finally, one may also test betting strategies with any of these models in
real time, provided with up-to-date Elo ratings and fij , fji.
5.2 Future Steps
Aside from beta experiments, all point-based models in this paper operated under the assump-
tion that points played by each server are i.i.d. While Klaassen and Magnus concluded that
deviations from i.i.d. are small enough for such models to provide a reasonable approximation
(Klaassen and Magnus, 2001), the truth is that points within a match are not independent.
Although players train themselves to ignore the score board, no player is truly capable of main-
taining a constant level of play throughout every match. As soon as a player’s level fluctuates,
45
sequential points become dependent, suggesting dips or spikes in form throughout the match.
Future work exploring momentum through dependencies between points, games, and sets could
add another dimension of prediction to our point-based models. To start, one could quantify
how much winning a point on serve affects the probability of winning the next point on serve. In
total, Jeff Sackmann’s point-by-point dataset offers over 100,000 matches over which to explore
effects of momentum.1
Explored methods in this paper were tested exclusively on ATP matches. All methods and
analyses conducted in this paper may also be applied to women’s WTA matches in Sack-
mann’s point-by-point dataset and contrasted to illustrate match predictability across men’s
and women’s tours.
5.3 Visualizing Grand Slam Matches
Below, we view win-probability graphs for several historical matches. Probabilities were gen-
erated by KM logit-elo (a=300).
5.3.1 Roger Federer vs. Rafael Nadal Australian Open 2017 Final
Earlier this year, Roger Federer overcame Rafael Nadal in the Australian Open final 6-4 3-6
6-1 3-6 6-3.
1While we examined ATP tour-level matches, there are also corresponding WTA matches and satellite-level
events for each tour.
46
5.3.2 Stanislas Wawrinka vs. Novak Djokovic French Open 2015 Final
With Djokovic facing pressure complete a career Grand Slam with a French Open title for years,2
the title finally appeared within his grasp as he knocked out Nadal, then a 9-time French Open
champion, en route to the 2015 final. However, Wawrinka had other plans on this day, turning
around the match and overpowering Djokovic with unwavering determination.
2This entails winning each of the grand slams once: Australian Open, French Open, Wimbledon, US Open.
47
5.3.3 Novak Djokovic vs. Roger Federer US Open 2011 Semi-Final
In one of the most famous comebacks in tennis history, Djokovic recovered from a two-set deficit
before escaping defeat at 3-5 15-40 on Federer’s serve in the final set.3
3If one were to consider Djokovic’s decision to hit an all-or-nothing return winner at 3-5 15-40 in the fifth
set, we could have computed his win probability to be even less than p = .015, as Federer would have surmised
(https://www.theguardian.com/sport/2011/sep/11/us-open-2011-federer-djokovic). Calculating in-match
win probability within points based on shot-selection is realistic nowadays, thanks to Jeff Sackmann’s match
charting project and recent data analyses with Hawkeye ball-tracking data.
48
Bibliography
Barnett, T. and Clarke, S. R. (2005). Combining player statistics to predict outcomes of tennis
matches. IMA Journal of Management Mathematics, 16(2):113–120.
Barnett, T. J., Clarke, S. R., et al. (2002). Using microsoft excel to model a tennis match. In 6th
Conference on Mathematics and Computers in Sport, pages 63–68. Queensland, Australia:
Bond University.
Barnett, T. J. et al. (2006). Mathematical modelling in hierarchical games with specific reference
to tennis. PhD thesis, Swinburne University of Technology.
Bevc, M. (2015). Predicting the outcome of tennis matches from point-by-point data.
Bialik, C., Morris, B., and Boice, J. (2016). How we’re forecasting the 2016 u.s. open. http://
fivethirtyeight.com/features/how-were-forecasting-the-2016-us-open/. Accessed:
2017-10-30.
Burke, B. (2014). Win probability and wpa. http://www.advancedfootballanalytics.com/
index.php/home/stats/stats-explained/win-probability-and-wpa. Accessed: 2017-
10-30.
Efron, B. and Morris, C. N. (1977). Stein’s paradox in statistics. WH Freeman.
Elo, A. E. (1978). The rating of chessplayers, past and present. Arco Pub.
Huang, X., Knottenbelt, W., and Bradley, J. (2011). Inferring tennis match progress from in-
49
play betting odds. Final year project), Imperial College London, South Kensington Campus,
London, SW7 2AZ.
Klaassen, F. J. and Magnus, J. R. (2003). Forecasting the winner of a tennis match. European
Journal of Operational Research, 148(2):257–267.
Klaassen, F. J. G. M. and Magnus, J. R. (2001). Are points in tennis independent and identically
distributed? evidence from a dynamic binary panel data model. Journal of the American
Statistical Association, 96(454):500–509.
Kovalchik, S. A. (2016). Searching for the goat of tennis win prediction. Journal of Quantitative
Analysis in Sports, 12(3):127–138.
Lewis-Kraus, G. (2016). The great ai awakening. The New York Times Magazine.
Lock, D. and Nettleton, D. (2014). Using random forests to estimate win probability before
each play of an nfl game. Journal of Quantitative Analysis in Sports, 10(2):197–205.
Madurska, A. M. (2012). A set-by-set analysis method for predicting the outcome of professional
singles tennis matches. Imperial College London, Department of Computing, Tech. Rep.
Sipko, M. and Knottenbelt, W. (2015). Machine learning for the prediction of professional
tennis matches.
Steinberg, D. (2017). Why espn uses those in-game win probability stats that drive some base-
ball fans nuts. https://www.washingtonpost.com/news/dc-sports-bog/wp/2017/04/23/
why-espn-uses-those-in-game-win-probability-stats-that-drive-some-baseball-fans-nuts/
?utm_term=.8163323a5ee2. Accessed: 2017-10-30.
50