Producing Win Probabilities for Professional Tennis Matches from any Score Jacob Gollub Harvard College Cambridge, Massachusetts Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Bachelor of Arts in Statistics And Computer Science supervised by Kevin Rader November 6, 2017 Acknowledgements After coming across Michael Sipko’s “Machine Learning for the Prediction of Professional Tennis Matches,” I realized that I could write my senior thesis on the sport I am most passionate about. While I was abroad this past summer, much of this project’s implementation took place in cafés around Tokyo, Japan. Thank you to my advisor, Kevin Rader, for his enthusiasm and support. Thank you to my parents, for pushing me to make the most out of my academic opportunities at college. Finally, thank you to my former coaches Joaqúın, Stewart, and Paul for their lessons and wisdom over the years. My lifelong obsession with tennis persists through this project. 1 Contents 1 Introduction 4 1.1 In-game Sports Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 History of Tennis Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Match/Point-by-Point Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Scoring 9 2.1 Modeling games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Modeling sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Modeling a tiebreak game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Modeling a best-of-three match . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Win Probability Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Pre-game Match Prediction 15 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Elo Ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 ATP Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4 Point-based Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.5 James-Stein Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.6 Opponent-Adjusted Serve/Return Statistics . . . . . . . . . . . . . . . . . . . . 22 3.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2 3.7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4 In-game Match Prediction 27 4.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.1.1 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 Hierarchical Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.1 Beta Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3.2 Elo-induced Serve Probabilities . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.3 Importance of fij + fji . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4.1 Model Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4.2 By Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4.3 By Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.5 Visualizing Win Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5 Conclusion 44 5.1 Applications to Betting Markets . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Future Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.3 Visualizing Grand Slam Matches . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3.1 Roger Federer vs. Rafael Nadal Australian Open 2017 Final . . . . . . . 46 5.3.2 Stanislas Wawrinka vs. Novak Djokovic French Open 2015 Final . . . . 47 5.3.3 Novak Djokovic vs. Roger Federer US Open 2011 Semi-Final . . . . . . 48 3 1. Introduction 1.1 In-game Sports Analytics “The win probability graphic/discussion on ESPN is literally taking a sword and sticking it through the chest of any fun left in baseball” — Kenny Ducey (@KennyDucey) April 2, 2017 “ESPN showing win probability is extremely good. Next up, run expectancy!” — Neil Weinberg (@NeilWeinberg44) April 4, 2017 “Thank the Lord we have the ESPN Win Probability stat to tell us the team that’s ahead has a good chance to win.” — Christian Schneider (@Schneider CM) April 17, 2017 In recent years, win probability has become increasingly prevalent in sports broadcasting. Brian Burke, an NFL commentator, has posted live win-probability graphs of NFL playoff games on his website for the past few years (Burke, 2014). Earlier this year, ESPN began to post win probabilities atop the score box in televised Major League Baseball games (Steinberg, 2017). Despite mixed reactions from fans, as shown above, these developments represent a transition in sports broadcasting’s modern narrative. 4 A win probability from any scoreline communicates how much a team or player is favored to win. While one can produce this from any model of choice, those in sports analytics strive to produce the most well-informed estimate available. With the recent proliferation of online betting, in-match win probability dictates an entire market of its own. Betfair’s platform matched over 40 million euro during the 2011 French Open Final (Huang et al., 2011); other high-profile matches often draw comparable volume over betting exchanges. While the majority of tennis prediction papers concern pre-match prediction, around 80% of online betting occurs while matches are in progress (Sipko and Knottenbelt, 2015). At the very least, in-match win probability concerns all participants in this betting market. 1.2 History of Tennis Forecasting Statistical prediction models have been applied to tennis matches for over twenty years. Klaassen and Magnus tested the assumption that points in a tennis match are independent and identi- cally distributed, or i.i.d. (Klaassen and Magnus, 2001). While the assumption is false, they concluded that deviations are small enough that constant probabilities provide a reasonable approximation. Under this assumption, they construct a hierarchical Markov model in con- junction with tennis’ scoring system. From this model, they offer an analytical equation for match-win probability from any score, given each individual player’s probability of winning a point on serve (Klaassen and Magnus, 2003). Barnett and Clarke then offer a method of estimating each player’s serve probability from historical data (Barnett and Clarke, 2005). Years later, Bevc proposed updating each of these serve probabilities with a beta distribution between each point and computing the corresponding win probability with the above model (Bevc, 2015). Recently, Kovalchik assessed performance of 11 different pre-match prediction models on 2014 ATP tour-level matches. While the market-based Bookmaker Consensus Model performed best overall, Elo ratings proved most effective among methods that rely on historical data (Kovalchik, 2016). Over the past several years, Jeff Sackmann has released the largest publicly available library of 5 tennis datasets via github. This collection contains match summaries of every ATP and WTA match in the Open Era, point-by-point summaries of nearly 100,000 tour-level and satellite matches, and a crowd-sourced match-charting project spanning over 3,000 matches, where volunteers record each shot’s type and direction over an entire match. While FiveThirtyEight and Kovalchik have used Jeff Sackmanns match data to generate Elo ratings, none of the aforementioned papers have tested models with his point-by-point dataset. As past papers have spanned several decades, datasets and model evaluation are not consistent. Klaassen and Magnus’ original point-based model and Bevc’s beta experiments both apply methods to around 500 Wimbledon matches from 1991-94. Barnett explores point-based models in great depth, yet primarily applies them to a single marathon match between Andy Roddick and Younes El Aynaoui from the 2002 Australian Open (Barnett et al., 2006). Except for Bevc, none of these papers assess in-game match prediction with metrics such as accuracy or cross- entropy. While Bevc does record accuracy of results over 500 Wimbledon matches, he only reports accuracy of predictions in the third set and onwards, a subset of the entire dataset. In other words, no one has taken all in-game prediction models and tested them at a large scale. For this reason, I will test all relevant in-match prediction methods on thousands of matches within the past seven years. This paper combines Elo ratings, a wealth of data, and current technology to provide a similar survey of which in-match prediction methods perform the best. We build upon past research by testing variations of previous state-of-the-art methods, and applying new concepts to these datasets, from Random Forest models used in football to elo-induced serving percentages. 1.3 Match/Point-by-Point Datasets This project uses two different types of datasets: one with match summary statistics and one with point-by-point information. Both are publicly available on github, courtesy of Jeff Sack- mann (https://github.com/JeffSackmann). We use the matches dataset under “tennis atp” to test pre-match prediction methods. This dataset covers over 150,000 tour-level matches, dat- 6 ing back to 1968. Relevant features include: • match info (date, tournament, player names) • match statistics (serve/return match statistics for each player) The data in “tennis pointbypoint” offers more granular detail about a single match’s point progression. Each match contains a string listing the order in which players won points and switched serve. While atpworldtour.com does not publicly list point-by-point summaries, Jeff Sackmann has made this information available from a betting website. Unlike the match dataset, this dataset covers around 12,000 tour-level matches dating back to 2010. Relevant features include: • match info (date, tournament, player names) • point-by-point (a string detailing the sequence in which points were won) 1.4 Implementation With the above datasets serving as a basis for this project, thorough re-formatting of the data was required in order to connect point-by-point strings to their corresponding observations in the match dataset. As both datasets included player names, year, and score, we connected matches across datasets with a hashing scheme.1 Due to observed inconsistencies, we stan- dardized player names between match and point-by-point datasets in order to maximize the number of available matches with point-by-point summaries.2 While a portion of the point-by- point summaries are from satellite events, akin to the minor leagues of tennis, we were able to match around 10,600/12,000 point-by-point strings to their respective tour-level matches. Then, we could associate information generated from the match dataset’s entirety (Elo ratings, year-long adjusted serve stats, etc) with corresponding point-by-point strings. To view this process in more depth or access any of the resulting datasets, visit https:// 1eg. hash(matchX) = “Roger Federer Tomas Berdych 2012 3-6 7-5 7-5” 2eg. “Stan Wawrinka” → “Stanislas Wawrinka”, “Federico Del Bonis” → “Federico Delbonis” 7 github.com/jgollub1/tennis_match_prediction. Implementations of each method in this project3, and instructions on how to generate all relevant statistics, are also provided. 3All methods covered in this project required feature construction across the two datasets. 8 2. Scoring Tennis’ scoring system consists of three levels: sets, games, and points. Consider a tennis match between two entities, pi and pj . We can represent any score as (si, sj , gi, gj , xi, xj) where pi is serving and sk, gk, xk represent pk’s score in sets, games, and points, respectively. 1 The players alternate serve each game and continue until someone clinches the match by winning two sets (best-of-three) or three sets (best-of-five).2 The majority of in-play tennis models utilize a graphical structure that embodies the levels within tennis’ scoring system. Madurska referred to this as a hierarchical Markov Model (Madurska, 2012). Barnett formally defines this representation for tennis scores (Barnett et al., 2002). With pi and pj winning points on serve with probabilities fij , fji, each in-match scoreline (si, sj , gi, gj , xi, xj) progresses to one of its two neighbors (si, sj , gi, gj , xi + 1, xj) and (si, sj , gi, gj , xi, xj+1) with transition probabilities dependent on the current server. Assuming all points in a match are i.i.d. between servers, we can then use the below model to recursively determine win probability. Consider the scoreboard below, with Rafael Nadal serving to Juan Martin Del Potro 7-6 3-6 6-6 2-1, in the 2011 Wimbledon 4th round. Representing the score from the server’s point of view, we find: 1While tennis officially refers to a game’s first three points as 15,30,40 we will call them 1,2,3 for simplicity’s sake. 2The best-of-five format is typically reserved for men’s grand slam and Davis Cup events. 9 Figure 2.1: Nadal serving to Del Potro in a third set tiebreakerc Pm(1, 1, 6, 6, 2, 1) = P(pi wins the match, serving from this scoreline) Pm(1, 1, 6, 6, 2, 1) = fij ∗ Pm(1, 1, 6, 6, 3, 1) + (1− fij)Pm(1, 1, 6, 6, 2, 2) In the following sections, we specify boundary values to each level of our hierarchical model. 2.1 Modeling games Within a game, either pi or pj serves every point. Every game starts at (0,0) and to win a game, a player must win four or more points by a margin of at least two. Consequently, all games with valid scores (xi, xj) where xi + xj >6, |xi − xj | ≤ 1 are reduced to (3,3), (3,2), or (2,3). Furthermore, the win probability at (3,3) can be calculated directly. From (3,3), the server wins the next two points with probability (f 2ij) , the returner wins the next two points with probability (1 − f 2ij) , or both players split the two points and return to (3,3) with probability 2fij(1− fij). Relating the game’s remainder to a geometric sequence, we find (f )2ij Pg(3, 3) = . As the probability of winning each point remains constant for (f 2ij) + (1− f 2ij) both players, this graph assumes that points within a game are independent. In the following levels, we assume that games within a set and sets within a match are independent. Possible sequences of point scores in a game: a - pi wins the following point b - pj wins the following point 10 G a 1 a 3, 0 a b a 2, 0 3, 1 a b a b a 1, 0 2, 1 3, 2 a b a b a b 0, 0 1, 1 2, 2 3, 3 b a b a b a 0, 1 1, 2 2, 3 b b a b a 0, 2 1, 3 b a b b 0, 3 b Gc1 Boundary values:    1, if x1 = 4, x2 ≤ 20, if x2 = 4, x1 ≤ 2 Pg(xi, xj)  (fij) 2  , if x1 = x 2 − 2 2 = 3  (fij) + (1 fij)fij ∗ Pg(si, sj , gi, gj , xi + 1, xj) + (1− fij)Pg(si, sj , gi, gj , xi, xj + 1), otherwise (2.1) With the above specifications, we can efficiently compute pi’s win probability from any score Pg(xi, xj). 2.2 Modeling sets Within a set, pi or pj alternate serve every game. Every set starts at (0,0). To win a set, a player must win six or more games by a margin of at least two. If the set score (6, 6) is reached, a special tiebreaker game is played to determine the outcome of the match. 11 Possible sequences of point scores in a game: a - player 1 wins the following game b - player 2 wins the following game a′ - player 1 wins the tiebreaker game b′ - player 2 wins the tiebreaker game S a 1 a′ a 5, 0 a b a a 4, 0 5, 1 a b a b aa 3, 0 4, 1 5, 2 a b a b a b 2, 0 3, 1 4, 2 5, 3 b b a b a b a b 1, 0 2, 1 3, 2 4, 3 5, 4 6, 5 a a a b a b a b a b a b 0, 0 1, 1 2, 2 3, 3 4, 4 5, 5 6, 6 b a b a b a b a b a b a 0, 1 1, 2 2, 3 3, 4 4, 5 5, 6 b a b a b a b a 0, 2 1, 3 2, 4 3, 5 b a b a b a 0, 3 1, 4 2, 5 b a b a b b b 0, 4 1, 5 b a b b 0, 5 b b ′ Sc1 12 Boundary values:   1, if g1 ≥ 6, g1 − g 2 ≥ 2 0, if g2 ≥ 6, g2 − g1 ≥ 2 Ps(g1, g2) Ptb(s1, s2), if g1 = g2 = 6 Pg(0, 0)(1− Ps(g2, g1 + 1)) + (1− Pg(0, 0))(1− Ps(g2 + 1, g1)), otherwise (2.2) 2.3 Modeling a tiebreak game Within a tiebreak game, one player serves the first point and players alternate serve every two points from then on. The game starts at (0,0) and a player must win seven points by a margin of two or more to win. Boundary values: 1, if p ≥ 7, p − p ≥ 2 1 1 2 0, if p2 ≥ 7, p2 − p1 ≥ 2Pt(p1, p2)fijPt(g1 + 1, g2) + (1− fij)Pt(g1, g2 + 1), if (p1 + p2) mod 2 = 1fij(1− Pt(g2, g1 + 1)) + (1− fij)(1− Pt(g2 + 1, g1)), if (p1 + p2) mod 2 = 0 (2.3) 2.4 Modeling a best-of-three match In a best-of-three match, the first player to win two sets claims the match. This project primarily applies models to best-of-three matches. In a best-of-five match, the graph would have boundary values at s1 ≥ 3, s2 ≥ 3 at 3, since a player must win three sets to finish the match. 13 a - player 1 wins the following set b - player 2 wins the following set a 1, 0 W1 a b a 0, 0 1, 1 b a b b 0, 1 W c1 Boundary values: 1, if s1 ≥ 2 Pm(s1, s2)0, if s ≥ 2 (2.4) 2Ps(0, 0)(Pm(s1 + 1, s2)) + (1− Ps(0, 0))(Pm(s1, s2 + 1)), otherwise 2.5 Win Probability Equation Combining the above equations, we can recursively calculate pi’s win probability when serving from (si, sj , gi, gj , xi, xj) as: Pm(si, sj , gi, gj , xi, xj) = fij ∗ Pm(si, sj , gi, gj , xi + 1, xj) + (1− fij)Pm(si, sj , gi, gj , xi, xj + 1) Unless (xi, xj) satisfies a boundary condition, Pm(si, sj , gi, gj , xi, xj) will recursively call itself on the two possible following scores. Since we can compute Pg(3, 3), we can eventually represent each recursive probability in terms of boundary values, which allows us to compute the proba- bility of winning the current game. By traversing through the set and match models in similar recursive fashion, we can eventually determine pi’s probability of winning the match. 14 3. Pre-game Match Prediction 3.1 Overview Before play has started, an in-match prediction model cannot draw on information from the match itself. Then, before a match between pi and pj commences, the most well-informed pre-match forecast π̂ij(t) should serve as a basis for in-match prediction. Therefore, we first explore pre-match models. Earlier this year, Kovalchik released a survey of eleven different pre-match prediction models, assessing them side-by-side in accuracy, log-loss, calibration, and discrimination. FiveThir- tyEight’s Elo-based model and the Bookmaker Consensus Model (BCM) performed the best. Elo-based prediction incorporates pi and pj ’s entire match histories, while the BCM model incorporates information encoded in various betting markets. In this section, we explore the following methods and variations to the most successful models in her survey: • Elo Ratings • ATP Rank • Point-based Models • James-Stein Estimator • Opponent-Adjusted Serve/Return Statistics 15 3.2 Elo Ratings Elo was originally developed as a head-to-head rating system for chess players (Elo, 1978). Recently, FiveThirtyEight’s Elo variant has gained prominence in the media (Bialik et al., 2016). For a match at time t between pi and pj with Elo ratings Ei(t) and Ej(t), pi is forecasted to win with probability: Ej(t)−Ei(t) π̂ij(t) = (1 + 10 400 ) −1 pi’s rating for the following match t+ 1 is then updated accordingly: Ei(t+ 1) = Ei(t) +Kit ∗ (Wi(t)− π̂ij(t)) Wi(t) is an indicator for whether pi won the given match, while Kit is the learning rate for pi at time t. According to FiveThirtyEight’s analysts, Elo ratings perform optimally when allowing Kit to decay slowly over time (Bialik et al., 2016). With mi(t) representing pi’s career matches played at time t we update our learning rate: 250 Kit = (5 +mi(t)).4 This variant updates a player’s Elo most quickly when we have no information about a player and makes smaller changes as mi(t) accumulates. To apply this Elo rating method to our dataset, we initalize each player’s Elo rating at Ei(0) = 1500 and match history mi(0) = 0. Then, we iterate through all tour-level matches from 1968-2017 in chronological order, storing Ei(t), Ej(t) for each match and updating each player’s Elo accordingly. 1 3.3 ATP Rank While Klaassen incorporated ATP rank into his prediction model (Klaassen and Magnus, 2003), Kovalchik and FiveThirtyEight concur that Elo outperforms ranking-based methods (Kovalchik, 2016). On ATP match data from 2010-present, we found: 1tennis’ Open Era began in 1968, when professionals were allowed to enter grand slam tournaments. Before then, only amateurs played these events 16 method accuracy ATP Rank 66.5 % Standard elo 69.0 % With each rating system, we forecast the higher-ranked player as the winner. Accuracy rep- resents the proportion of correctly forecasted matches. As the ATP does not provide fans a corresponding win-probability equation to go along with their ranking system, we cannot com- pare the approaches in terms of log loss or calibration. Still, considering its superior accuracy to ATP rank in recent years, models in this paper use Elo ratings to represent a player’s ability in place of official tour rank. 3.4 Point-based Model The hierarchical Markov Model offers an analytical solution to win probability π̂ij(t) between players pi and pj , given serving probabilities fij ,fji. This hinges on the Markov assumption that transition probabilities to any state only depend on the current state. In tennis, this means all points are independent and the probability of winning a given point only depends on the current server. While this is counter-intuitive, Klaassen and Magnus demonstrated that deviations from the i.i.d. assumption within matches are small enough to reasonably justify point-based models (Klaassen and Magnus, 2001). Later on, Barnett and Clarke demonstrated a method to estimate fij , fji, given players’ historical serve/return averages (Barnett and Clarke, 2005). fij = ft + (fi − fav)− (gj − gav) fji = ft + (fj − fav)− (gi − gav) Each player’s serve percentage is a function of their own serving ability and their opponent’s returning ability. ft denotes the average serve percentage for the match’s given tournament, while fi, fj and gi, gj represent pi and pj ’s percentage of points won on serve and return, respectively. fav, gav are tour-level averages for serve and return percentage. Since all points are won by either server or returner, fav = 1− gav. By combining differences in ability relative 17 to the average, this formula assumes that their effects are additive. As per Barnett and Clarke’s formula, we use the previous year’s tournament serving statistics to calculate ft for a given tournament and year, where (w, y) represents the set of all matches played at tournament w in y∑ear y. k∈(w,y−1) # of points won on serve in match k ft(w, y) = ∑ k∈(w,y−1) # of points played in match k In their paper, Barnett and Clarke only apply this method to a single match: Roddick vs. El Aynaoui Australian Open 2003. Furthermore, their ability to calculate serve and return percentages is limited by aggregate statistics supplied by atpworldtour.com. That is, they can only use year-to-date serve and return statistics to calculate fi, gi, fj , gj . Since the statistics do not list corresponding sample sizes, they must assume that each best-of-three match lasts 165 points, which adds another layer of uncertainty in estimating players’ abilities. Implementing this method with year-to-date statistics proves troublesome because fi, gi de- crease in uncertainty as pi accumulates matches throughout the year. Due to availability of data, match forecasts in September will then be far more reliable than ones made in January. However, with our tour-level match dataset, we can keep a year-long tally of serve/return statis- tics for each player at any point in time. Where (pi, y,m) represents the set of pi’s matches in year y, month m, we∑obtain∑the following statistics:2∑ 12t=1∑ k∈(i,y−1,m+t) # of points won on serve by pi in match kfi(y,m) = ∑12t=1 k∈(i,y−1,m+t) # of points played on serve by pi in match k∑ 12 ∑t=1∑ k∈(i,y−1,m+t) # of points won on return by pi in match kgi(y,m) = 12 t=1 k∈(i,y−1,m+t) # of points played on return by pi in match k Keeping consistent with this format, we also calculate fav, gav where (y,m) represents the set of tour-level matc∑hes p∑layed in year y, month m:12 t=∑1 k∑∈(y−1,m+t) # of points won on serve in match kfav(y,m) = 12 = 1− gav(y,m) t=1 k∈(y−1,m+t) # of points played in match k Now, variance of fi, gi no longer depends on time of year. Since the number of points won on serve are recorded in each match, we also know the player’s number of serve/return points played. Below, we combine player statistics over the past 12 months to produce fij , fji for the 2For current month m, we only collect month-to-date matches. 18 3rd round match between Kevin Anderson (RSA) and Fernando Verdasco (ESP) at the 2013 Australian Open. player name s points won s points fi r points won r points gi Kevin Anderson 3292 4842 .6799 1726 4962 .3478 Fernando Verdasco 2572 3981 .6461 1560 4111 .3795 From 2012 Australian Open statistics, ft = .6153. From tour-level data spanning 2010-2017, fav = 0.6468; gav = 1− fav = .3532 Using the above serve/return statistics from 02/12-01/13, we can calculate: fij = ft + (fi − fav)− (gj − gav) = .6153 + (.6799− .6468)− (.3795− .3532) = .6221 fji = ft + (fj − fav)− (gi − gav) = .6153 + (.6461− .6468)− (.3478− .3532) = .6199 With the above serving percentages, Kevin Anderson is favored to win the best-of-five match with probability Mp(0, 0, 0, 0, 0, 0) = .5139 3.5 James-Stein Estimator Decades ago, Efron and Morris described a method to estimate groups of sample means (Efron and Morris, 1977). The James-Stein estimator shrinks sample means toward the overall mean, with shrinkage proportional to its estimator’s variance. Regardless of the value of θ, this method produces results superior in expectation to Maximum-Likelihood Estimation. To estimate serve/return parameters for players who do not regularly play tour-level events, fi, gi must be calculated from limited sample sizes. Consequently, match probabilities based off these estimates may be skewed by noise. The James-Stein estimators offer a more reason- able estimate of serve and return ability for players with limited match history. To shrink serving percentages, we compute the variance of all recorded fi statistics in our match dataset Dm. 3 ∑ (f − f )2 2 f ∈D i avτ̂ = i m |Dm| − 1 3Each fi is computed from the previous twelve months of player data. 19 Then, each estimator fi is based off ni service points. With each estimator fi representing fi/ni points won on serve, we can compute estimator fi’s variance and a corresponding normalization coefficient: 2 fi(1− fi)σ̂i = ni σ̂ 2i Bi = τ̂2 + σ̂ 2i Finally, the James-Stein estimator takes the form: JS(fi) = fi +Bi(fav − fi) We repeat the same process with gi to obtain James-Stein estimators for return statistics. To see how shrinkage makes our model robust to small sample sizes, consider the following example. When Daniel Elahi (COL) and Ivo Karlovic (CRO) faced off at ATP Bogota 2015, Elahi had played only one tour-level match in the past year. From a previous one-sided victory, his year- long serve percentage, fi = 51/64 = .7969, was abnormally high compared to the year-long tour-level average of fav = .6423. player name s points won s points fi r points won r points gi elo rating Daniel Elahi 51 64 .7969 22 67 .3284 1516.92 Ivo Karlovic 3516 4654 .7555 1409 4903 .2874 1876.95 fij = ft + (fi − fav)− (gj − gav) = .6676 + (.7969− .6423)− (.2874− .3577) = .8925 fji = ft + (fj − fav)− (gi − gav) = .6676 + (.7555− .6423)− (.3284− .3577) = .8101 Following Klaassen and Magnus’ method of combining player outcomes, we compute Elahi’s service percentage to be 89.3%. This is extremely high, and eclipses Karlovic’s 81.01% serve projection. This is strange, given that Karlovic is one of the most effective servers in the history of the game. From the serving stats, our hierarchical Markov Model computes Elahi’s win probability as Mp(0, 0, 0, 0, 0, 0) = .8095. This forecast seems unreasonably confident of Elahi’s victory, despite only having collected his player statistics for one match. Karlovic’s 360-point Elo advantage calculates Elahi’s win probability as 1876.95−1516.92 π̂ij(t) = (1 + 10 400 ) −1 = .1459 20 which leads us to further question the validity of this approach when using limited histori- cal data. Thus, we turn to the James-Stein estimator to shrink Elahi’s serving and return probabilities toward overall means fav, gav. JS(fi) = fi +Bi(fav − fi) = .7969 + .7117(.6423− .7969) = .6869 JS(gi) = gi +Bi(gav − gi) = .3284 + .7624(.3577− .3284) = .3507 JS(fj) = fj +Bj(fav − fj) = .7555 + .0328(.6423− .7555) = .7518 JS(gj) = gi +Bj(gav − gj) = .2874 + .0420(.3577− .2874) = .2904 JS(fij) = ft+(JS(fi)−fav)−(JS(gj)−gav) = .6676+(.6869− .6423)−(.2904− .3577) = .7795 JS(fji) = ft+(JS(fj)−fav)−(JS(gi)−gav) = .6676+(.7518− .6423)−(.3507− .3577) = .7841 Above, we can see that the James-Stein estimator shrinks Elahi’s stats far more than Karlovic’s, since Karlovic has played many tour-level matches in the past year. Given JS(fi), JS(fj), we compute Mp(0, 0, 0, 0, 0, 0) = .4806. By shrinking the serve/return statistics, our model lowers Elahi’s inflated serve percentage and becomes more robust to small sample sizes. As point-based forecasts on limited data threaten model performance, especially with respect to cross entropy, the James-Stein estimator allows a safer way to predict match outcomes. Later on, we will use the James-Stein estimator to normalize not only year-long serve/return statistics, but also surface-specific and opponent-adjusted percentages. For a player who has averaged fi = .64 over n points in the past twelve months, this diagram illustrates how normalization coefficient Bi varies with n in our dataset: 21 3.6 Opponent-Adjusted Serve/Return Statistics While Barnette and Clarke’s equation does consider opponent’s serve and return ability, it does not track average opponents’ ability within a player’s history. This is important, as a player’s serve/return percentages may become inflated from playing weaker opponents or vice versa. In this section, we propose a variation to Barnette and Clarke’s equation which replaces fav, gav with opponent-adjusted averages 1− gi opp av, 1− fi opp av for pi. The equations then become: fij = ft + (fi − (1− gi opp av))− (gj − (1− fj opp av)) fji = ft + (fj − (1− gj opp av))− (gi − (1− fi opp av)) gi opp av represents the average return ability of opponents that pi has faced in the last twelve months. To calculate this, we weight each opponent’s return ability gj by number of points in their respective match. ∑12 ∑ t=1 k∈(i,y−1,m∑+t) (#∑of pi’s return points in match k)*(pj ’s return ability before match k)gi opp av = 12 t=1 k∈(i,y−1,m+t) # of pi’s return points in match k To illustrate the effect of tracking opponent-adjusted statistics, we consider the 2014 US Open 22 first-round match between Mikhail Youzhny (RUS) and Nick Kyrgios (AUS). player name s points won s points fi r points won r points gi elo rating Mikhail Youzhny 1828 2960 .6176 1145 2947 .3885 1749.77 Nick Kyrgios 900 1370 .6569 424 1323 .3205 1645.03 From the standard approach, with ft = .6583, we compute fij = .6446, fji = .6159. Then, to calculated adjusted statistics, we consider the expected number of points won on serve/return, given Youzhny and Kyrgios’ past opponents. As we can observe from fi opp av, gi opp av, Kyrgios has faced faced slightly stronger opponents than Youzhny over the past twelve months. player name gi opp av s expected s adj fi opp av r expected r adj Mikhail Youzhny .4156 1729.74 .0332 .6812 939.60 .0697 Nick Kyrgios .4313 779.19 .0882 .7060 388.92 .0265 sadj i, radj i, and adjusted serve probabilities adj(fij), adj(fji) are calculated as follows: s = # points won on serveadj i # points on serve − (1− gi opp av) = fi − (1− gi opp av) r = # points won on returnadj i # points on return − (1− fi opp av) = gi − (1− fi opp av) adj(fij) = ft + (fi − (1− gi opp av))− (gj − (1− fj opp av)) = ft + sadj i − radj j = .6288 adj(fji) = ft + (fj − (1− gj opp av))− (gi − (1− fi opp av)) = ft + sadj j − radj i = .6406 Using regular serve probabilities, Youzhny is favored to win with Pm(0, 0, 0, 0, 0, 0) = .6410. With adjusted serving percentages, we factor in the stronger opponents in Kyrgios’ match history and Youzhny’s win probability drops to Mp(0, 0, 0, 0, 0, 0) = .4411. While adjusted serve/return statistics do not always cause significant shifts in win probability, they consistently account for relative ability of players’ opponents in producing serve probabilities fij , fji. 3.7 Results The following results were obtained using ATP best-of-three matches in our point-by-point dataset, excluding Davis Cup.4 We observe performance across variants of Elo-based predictions and point-based models. Since all original implementations provide explicit formulas with no optimization, we directly assess their performance on 2014 tour-level data (2,409 matches). 4The Davis Cup is a year-long competition between teams organized by country. Since Kovalchik excluded Davis Cup matches from her evaluation (Kovalchik, 2016) and many match-ups involve relatively unknown players, we exclude all matches. 23 In the case of logistic regression, the model was trained on tour-level data from 2011-2013 (7,828 matches). We measure each model’s performance in terms of accuracy, as in section 3.3. We also consider log loss, a measure of overall likelihood of a model’s predictions. Log loss is important because it penalizes models that incorrectly assign over-confident or under- confident win probabilities. As we are hoping to optimize in-match prediction models’ log loss, it is important to consider the likelihood of various pre-match forecasts. The following terms express variations to point-based and Elo models: KM - point-based hierarchical Markov models with combined serve/return percentages from Barnett/Clarke James-Stein - version of a KM model, with all serve/return percentages nor- malized with James-Stein estimator surface - version of model where all ratings and percentages are stratified by surface (Hard, Clay, Grass) 538 - specifically denotes Five-Thirty-Eight’s “decaying k-factor” method in computing Elo ratings Table 3.1: Results Table Method Variation Accuracy Log Loss KM 64.8 .649 KM James-Stein 65.4 .616 KM surface 63.3 .707 KM surface James-Stein 63.6 .639 KM adjusted 67.8 .632 KM adjusted James-Stein 67.9 .617 Elo 69.1 .586 Elo surface 68.4 .591 Elo 538 69.2 .587 Elo surface 538 69.4 .592 Logit Elo 538, surface Elo 538 69.5 .577 24 3.7.1 Discussion As expected, James-Stein normalization significantly improves each point-based model’s log loss. While surface-based Elo is competitive with regular Elo, restricting Klaassen and Mag- nus’ point-based method to surface-specific data clearly hurts performance. This is presumably due to limited surface-specific data. As many players perform differently across surfaces, com- puting serve percentages from surface-specific data offers the potential to express how players’ performance varies by surface. Yet, even with James-Stein normalization, the loss in sample size appears to outweigh the benefits obtained from surface-specific data. As we did not rig- orously explore possibilities for weighting players’ surface-specific statistics, there may still be potential for a point-based model that effectively considers surface in computing serve/return percentages. Finally, using opponent-adjusted serve-return statistics further improved perfor- mance. By fully incorporating each opponent’s ability in a player’s match history, this model approached Elo ratings in likeness of predictions, with win-probability forecasts between the two sharing R2 = .76. By plugging Elo and surface Elo into a logistic regression model, we achieve a log loss of .577. After fitting to the training data, this model learned coefficients for each variable that assigned a 56% weight to regular Elo ratings and 44% weight to surface Elo ratings. The near-even weight to both types of ratings suggests how important surface-specific ability is toward forecasting matches. Aside from models which draw information directly from the betting market, no other model has documented preferable log loss. Kovalchik reported 70% accuracy using 538’s Elo method (Kovalchik, 2016), calculating Elo ratings using satellite events (ATP Challengers and Futures) in addition to tour-level events 5. While this accounted for a small increase in accuracy, her method achieved a log loss of .59, which does not outperform our implementation. For simplicity’s sake, we calculated all Elo ratings and statistics from tour-level matches alone.6 5One can observe at https://github.com/skoval/deuce. 6There are about 3,000 tour-level matches each year. 25 In the end, our Elo-based logistic regression’s incremental improvement over singular Elo ratings implies that an effective rating system should consider a overall ability as well as surface-specific ability. Sipko recently explored machine-learning methods for pre-match prediction, surveying logistic regression, the common-opponent model, and an artificial neural net (Sipko and Knottenbelt, 2015). While he claimed to have achieved 4.3% return-on-investment off the betting market with an artificial neural net, his machine learning models did not beat a log loss of .61 when predicting 2013-2014 ATP matches7. Despite the current deep-learning craze throughout industry and academia (Lewis-Kraus, 2016), no one has published findings to suggest their superiority in predicting tennis matches. Short of market-based models, it appears most effective to stick with Elo’s “ruthlessly Bayesian design” which FiveThirtyEight so frequently touts (Bialik et al., 2016). Transitioning onward, we will consider how these findings may inform an effective in- match prediction model. 7The models were trained on 2004-2010 ATP matches and validated on 2011-2012 ATP matches. 26 4. In-game Match Prediction The following methods will be tested on tour-level matches for which we have point-by-point data. The matches span 2010-2017, accounting for nearly half of all tour-level matches within this time. Point-by-point records in Sackmann’s dataset take the form of the following string: Mikhail Youzhny vs. Evgeny Donskoy (Australian Open 2013, Round 2) P=‘‘SSRSS;RRRR;SRSSS;SRRSRSSS;SRSSRS;RSRSSS;SRSRSS;RSRSRSSS;SSS S.SSSRRRSS;RSSSS;SSRSS;SSSRS;SSSS;RRRSSSSRRSSRRSRSSS;SRSRSS;SSS RS;RSRSSRSS;SSSS;SRSSS;RSRSSRRSSS;R/SR/SS/RR/RS/SR.RSRRR; ...’’ S denotes a point won by the server and R a point won by the returner. Individual games are separated by “;”, sets by “.”, and service changes in tiebreaks by “/”. By iterating through the string, one can construct n data points {P0, P1, ..., Pn−1} from a match with n total points, with Pi representing the subset of the match after which i points have been played. P0 = “” P1 = “S” P2 = “SS” P3 = “SSR” ... With M = {M1,M2, ...Mk} representing complete match-strings in our point-by-point dataset, 27 ∑k the size of our enumerated dataset becomes i=1 |Mi|. In total, this sums to over 1.2 million individual points within ATP matches spanning 2010-2017. While many in-match prediction models utilize the hierarchical Markov Model structure, we will first test several machine- learning methods as a baseline. Now, we will explore the following methods: • Logistic Regression • Random Forests • Hierarchical Markov Model • Beta Experiments • Elo-induced Serve percentages 4.1 Logistic Regression Consider a Logistic Regression model where each observation xi = [xi1, xi2, ..., xik] has k fea- tures. When P represents pi’s win probability, log P1−P = β0 + β1(xi1) + β2(xi2) + ...+ βk(xik) From any scoreline (si, sj , gi, gj , xi, xj), we can simply feed these values into our model as features of xi. Logistic Regression’s structure makes it easy to consider additional features for each player, such as Elo difference, surface Elo difference, or break advantage. Before adding more features to the model, we consider two baselines: a model using score differentials (si − sj , gi − gj , xi − xj) and another model trained on Elo differences and lead heuristic Lij . This heuristic estimates pi’s total lead in sets, games, and points: L 1 1ij = (si − sj) + 6 (gi − gj) + 24 (xi − xj) The coefficients preserve order between sets, games, and points, as one cannot lead by six games without winning a set or four points without winning a game. While it assumes these relative 28 orderings, the heuristic is meant to approximate how much pi leads pj from any scoreline. In this model, we consider the following features in predicting the winner of the match: Table 4.1: Logistic Regression Features V ariable Description lead margin lead heuristic Lij eloDiff Elo(p0) - Elo(p1) s eloDiff s Elo(p0) - s Elo(p1) setDiff SetsWon(p0) - SetsWon(p1) gameDiff inSetGamesWon(p0) - inSetGamesWon(p1) pointDiff inGamePointsWon(p0) - inGamePointsWon(p1) breakAdv ServeGamesWon(p0) - ServeGamesWon(p1) + I(currently serving) brkPointAdv I(holding break point) - I(facing break point) sv points pct 0 percentage of points won on p0’s serve thus far sv points pct 1 percentage of points won on p1’s serve thus far Next, we test the following combinations of features: 1) setDiff + gameDiff + pointDiff (Differentials) 2) lead margin + eloDiff + s eloDiff (Elo) 3) all features (All) 4.1.1 Cross Validation Each match in our best-of-three dataset has around 160 points on average. In tuning hyper parameters for machine learning models, we implement k-fold group validation with k = 5. This randomly splits the training set into five subsets of roughly equal size, while always grouping points from the same match in the same fold. By grouping points from the same match together, this method prevents overlap across train, validation, and test sets. Treating each subset as a validation set, we train all models on the remaining data before testing performance on the corresponding validation set. Then we average performance across all k validation sets and select the hyper parameters whose models performed best. When we assess model performance on validation and test sets, results then reflect performance on matches a model has never seen before. 29 4.2 Random Forests Brian Burke’s win-probability models are among the most well-known in sports (Burke, 2014). They calculate a team’s win probability at any point in the match based on historical data through a combination of binning inputs and smoothing their resulting probabilities. Nettleton and Lock built on this method by using a Random Forest approach (Lock and Nettleton, 2014). A Random Forest consists of an ensemble of randomly generated classification trees. Each tree forms decision functions for a subset of features with splits that generate maximum dis- criminatory ability. As Nettleton and Lock do so with football-specific features such as down, points, and yards to goal, we do so with analogous tennis features and train on our validation set. Table 4.2: Random Forest features Variable Description surface hard, clay, grass set first, second, third eloDiff Elo(p0) - Elo(p1) setDiff SetsWon(p0) - SetsWon(p1) gameDiff inSetGamesWon(p0) - inSetGamesWon(p1) pointDiff inGamePointsWon(p0) - inGamePointsWon(p1) breakAdv ServeGamesWon(p0) - ServeGamesWon(p1) + I(currently serving) brkPointAdv I(holding break point) - I(facing break point) 4.3 Hierarchical Markov Model With serving percentages already calculated from historical data, our hierarchical Markov model is well-equipped to produce in-match win probability estimates. Using the analytical equation with players’ serving abilities fij , fji, we compute Pm(si, sj , gi, gj , xi, xj) from every scoreline (si, sj , gi, gj , xi, xj) in a match. To assess this model’s performance, we repeat this on every match in our dataset, with fij , fji computed with one’s method of choice. 30 4.3.1 Beta Experiments The above approach only takes into account the current score and pre-calculated serve percent- ages when computing win probability. However, in many cases, relevant information may be collected from Pk. Consider the following in-match substring, P = “SSSS;RSSSRRSS;SSSS;SRRSRSRSSS;SSSS;RRRSSSRSRSSS; ” The above sequence demonstrates a current scoreline of three games all. However, pi has won 12/12 service points, while pj has won 18/30 service points. If both players continue serving at similar rates, pi is much more likely to break serve and win the match. Since original forecasts fij , fji are based on historical serving percentages, it makes sense that in-match serving percentages may help us better determine each player’s serving ability on a given day. To do this, we can update fij , fji at time t of the match to factor in each player’s serving performance thus far. Bevc explored this method by modeling each player’s serving percentage fij with a beta prior bprior that could be updated mid-match to yield posterior estimates. Through beta-binomial conjugacy, we can update bprior when a player has won swon points on serve out of stotal trials. bprior ∼ Beta(a, b) bposterior ∼ Beta(a+ swon, b+ stotal − swon) If fij is our prior, a, b must satisfy E(b a prior) = a+b = fij . Then a is a hyperparameter that determines the strength of our prior. To find a suitable prior strength, we will test various values of a via cross-validation on our training set. Regardless of a, the match’s influence on our posterior serve estimates always grows as more points have been played. At any point in the match, we can always obtain a posterior serving percentage for either player of the form a+ swon f ′ij = E(bposterior) = a+ b+ stotal 31 4.3.2 Elo-induced Serve Probabilities Earlier on, Klaassen and Magnus suggested a method to infer serving probabilities from a pre-match win forecast πij . By imposing a constraint fij + fji = t, we can then create a one-to-one function S : S(πij , t) → (fi, fj), which generates serving probabilities f̂ij , f̂ji for both players such that Pm(0, 0, 0, 0, 0, 0) = π 1 ij . As this paper was published in 2002, Klaassen and Magnus inverted their match probability equations to produce serve probabilities for ATP rank-based forecasts. However, since Elo outperforms ATP rank, we apply this method to Elo forecasts. Due to branching complexity (see section 2.5), our hierarchical Markov model’s match prob- ability equation has no analytical solution to its inverse, even when we specify fij + fji = t. Therefore, we turn to the following approximation algorithm to generate serving percentages that correspond to a win probability within  of our Elo forecast’s: Algorithm 1 elo-induced serve probabilities procedure EloInducedServe(prob,sum,) s0← sum/2 currentProb← .5 diff← sum/4 while |currentProb - prob| >  do: if currentProb < prob then s0 += diff else s0 -= diff diff = diff/2 currentProb← matchProb(s0,sum-s0) return s0, sum-s0 To generate elo-induced serve probabilities for a given match, we run the above algorithm with PROB=πij , SUM=fij + fji, and  set to a desired precision level. 2 At each step, we call matchProb() to compute the win probability from the start of the match if pi and pj had serve probabilities fij = s0, fji = sum− s0, respectively. Then we compare currentProb to prob and 1fij and fji are computed as specified in 3.4 with James-Stein normalization, so as to prevent extreme results. 2For purposes of this project, setting epsilon=.001 is sufficiently accurate. 32 increment s0 by diff, which halves at every iteration. This process continues until the serve probabilities s0, sum-s0 produce a win probability within  of PROB, taking O(log 1 ) calls to matchProb. This inverse algorithm is useful for several reasons. Given any effective pre-match forecast πij , we can produce serve probabilities that are consistent with πij according to our hierarchical Markov model. By setting the constraint fij + fji = t, we also ensure that the sum of our players’ serve probabilities agrees with historical data. 3 While Klaassen and Magnus argue that t = fij + fji is largely independent of πij , t holds greater importance when predicting specific scorelines a match (Barnett et al., 2006). Using the hierarchical Markov Model’s equations, Barnette specifically computed the probability of reaching any set score given fij , fji. When comparing probabilities across matches, one can observe that as t increases, closer set scores and tiebreak games become more likely.4 That is, t encodes information regarding likely trajectories of a match scoreline and relative importance of winning each game on serve. For higher t, service games are won more often and service breaks hold relatively more importance, while the opposite holds for lower t. Now, given πij and t, we can produce elo-induced serve probabilities for any two players. 4.3.3 Importance of fij + fji To illustrate the importance of t = fij + fji, we consider the two following matches: match pi pj tny name surface πij t m1 Feliciano Lopez John Isner ATP Shanghai 2014 hard .478 1.55 m2 Andreas Seppi Juan Monaco ATP Kitzbühel 2014 clay .476 1.14 In both matches, πij is approximately the same while t differs significantly. To visualize how service breaks can have varying importance, consider win-probability graphs corresponding to the following progression of points, P . In each scenario, pi and pj ’s serve probabilities are calculated from method 4.3.2. 3While it is just an estimate, t = fij +fji is our best predictor of overall serving ability from methods covered in this project. 47-6, 6-7, 7-5, 5-7, etc. 33 P = “SSSS;SSSS;SSSS;SSSS;SSSS;RRRR” Over the first five games, both players win every point on serve. When t = 1.55, service holds are expected and and win probability hardly moves. After pi breaks serve in the sixth game, his win probability suddenly spikes up, indicating a substantial advantage. When t = 1.14, our model assumes that players win service games with much lower probability. Therefore, each time a player holds serve, his win probability noticeably increases. When pi breaks in the sixth game, his win probability climbs, but not as dramatically as in the previous case. When service games are won at a lower probability, service break advantages become weaker. t’s influence on the trajectory of a match suggests that prediction from any in-match scoreline will be most effective when t accurately reflects the players’ overall serving ability. While Elo ratings or betting odds yield the most effective pre-match predictions (Kovalchik, 2016), their implied odds offer no information about t in a given match-up. As t holds significance for in-match forecasting from specific scorelines, method 4.3.2 allows us to combine any effective pre-match forecast πij with t to produce new serving probabilities. 34 4.4 Results The following models were trained on 2011-2013 point-by-point data (4584 matches) and tested on 2014 point-by-point data (1855 matches). For consistency, we only used best-of-three ATP tour-level matches and excluded matches from the Davis Cup. From cross-validation on the training set, we determined that setting hyper-parameter a = 300 yielded optimal performance across beta experiments. We then added a beta variation to KM elo-induced and KM logit- induced, our two best-performing point-based models. As observations corresponds to a single point in each match, longer matches contribute more to our dataset and have larger influ- ence on evaluation metrics. Since addressing this outsized influence could complicate results’ interpretability, we present model evaluation on a point-by-point basis. LR - Logistic Regression model with feature set specified in 4.1 RF - Random Forest model with feature set specified in 4.2 Equivalent - KM model with serve probabilities set to tour-level averages (assumes equal player ability) elo-induced - KM model with serve probabilities generated from elo-based win probability logit-elo - KM model with serve probabilities generated from logistic regression on Elo and surface-based elo 35 Table 4.3: Results Table Method Variation Accuracy Log Loss Calibration LR Differentials 71.4 .540 .993 LR Elo 75.2 .505 1.006 LR All 76.2 .488 .993 RF 70.1 .562 .902 KM Equivalent 71.5 .538 1.040 KM James-Stein 74.9 .501 1.002 KM adjusted James-Stein 75.9 .498 .992 KM elo-induced 76.2 .486 1.001 KM elo-induced (a=300) 76.3 .481 1.003 KM logit-elo 76.5 .481 .996 KM logit-elo (a=300) 76.5 .477 .999 Of our models, Random Forest had the least predictive power. While this is perhaps due to difficulty in constructing an effective feature space, nearly all features fed into each machine learning model had an additive effect on win probability. Positive score differentials, break advantages, and Elo differences will all boost a player’s expected win probability. The only non-additive features considered were “set” and “surface.” While the random forest model was able to consider various situations among these two features, we conclude that its inclination to group similar examples together, rather than estimate linear relationships between variables, ultimately accounted for its performance. Logistic Regression, on the other hand, is designed to learn additive relationships between fea- tures and its probability output. It’s performance with the full feature set was competitive with our point-based models, despite their superior grasp of tennis’ scoring system.5 Presumably, this model holds an advantage over point-based models in its ability to fit itself to the training data. Aside from our hyper-parameter search of a in beta experiments, the point-based models offer no comparable method to optimize performance with respect to a training set. 5Consider, for example, how logistic regression has no knowledge of when tiebreak games occur. 36 Of the point-based models, KM logit-elo (a=300) achieved the best performance, although the improvement from running beta experiments was marginal. Since Elo ratings produce more reliable pre-match win forecasts πij than estimates of fij , fji from historical serve/return data, using serve probabilities induced from Elo ratings and then from our logit Elo model improved our models performance by significant increments. Over the next few sections, we will examine our models’ calibration and performance across subsets of our dataset. 37 4.4.1 Model Calibration Calibration represents the ratio between observed rate of success and mean predicted rate of success. From a perfectly calibrated model that predicts a win probability of p, we can always expect a win to occur with probability p. While the Random Forest model under-predicted wins, the rest of our models closely resemble a perfectly calibrated model. From our table, 38 KM logit elo and KM JS exhibited near-perfect calibration of 1.001 and .999, while Logistic Regression reported .993. As there is degree of uncertainty in measuring calibration, we could not reasonably expect to produce more well-calibrated models. Given win probability forecast πij at any point of a match within our test set, we expect pi to win with probability πij . 4.4.2 By Set Since we have less information about a match early on, it is always most difficult to forecast matches during the first set. However, across both metrics, models are most effective during the second set. By this time in the match, the winner of the first set is often a clear favorite to win the match, unless their opponent has a significantly higher Elo rating or serve percentage fji. As 64.3% of matches in our test set are straight-sets victories, many matches offer relatively easy predictions for our model during the second set.6 Uncertainty increases once the match enters a third set because the set score has evened at 1-1. While the set score is tied in both first and third sets, it is easier to predict the outcome during the third set because its outcome corresponds directly to the match outcome. When 6Best-of-three matches that end in two sets are straight-sets. 39 predicting in the first set, the player who wins the first set can still go on to lose the match, which increases uncertainty. 4.4.3 By Surface By surface, models predicted grass court matches most effectively and clay court matches least effectively. As one can observe from historical tournament averages ft across surfaces, grass courts tend to yield higher serving percentages fij , fji than clay courts. Presumably, this is because the ball bounces more quickly off grass courts than clay courts. One explanation for observed differences is that service breaks become more likely on clay courts because fij , fji are lower. With service breaks more likely, it becomes harder for our models to predict the match because advantages from a current scoreline are more likely to be turned around. Hard courts, where the majority of matches took place, ultimately proved in between grass and clay in predictability. Further insight into the volatility of matches may formally shed light on differing results across surfaces.7 7We may calculate a measure of volatility from point-by-point sequences or a model’s win probability output. 40 4.5 Visualizing Win Probability As the most effective Logistic Regression and KM models were not far off in predictive perfor- mance, we now visualize their forecasts with several tour-level matches. Over the first thirty points of the match, the contrast between our models is reminiscent of 4.3.3. Since our Logistic Regression model has learned specific coefficients for the “point diff” feature, Reister’s win probability spikes up and down noticeably as he and Gasquet exchange games on serve. Our KM logit elo model seems to have set t = fij + fji to a high value, expecting frequent service holds and producing a relatively static win probability over this time frame. While we can communicate the relative importance of service breaks with constraint t = fij + fji, there is no direct method for controlling probability fluctuations with Logistic Regression. Over the last thirty points of the match, consider how Reister’s win probability slowly declines according to the KM model, while it remains approximately constant according to Logistic Regression. One drawback of our Logistic Regression feature sets was that they could not distinguish between situations whose score differentials are equivalent. As Gasquet maintains a break advantage through the end of third set, this explains logistic regression’s behavior, since 41 score differentials remain similar. In general, consider a player serving at (1,1,5,4,3,0) and one serving at (1,1,1,0,3,0). In the first situation, the player serving wins the match if he wins any of the next three points. From the second scenario, the player serving holds a break advantage early in the final set, from which the returner has many more chances to come back. Assuming each player serves at fi = fj = .64, our point-based win-probability equation suggests a substantial difference between these two scenarios: Pm(1, 1, 5, 4, 3, 0) = .994 Pm(1, 1, 1, 0, 3, 0) = .800 Although the first situation is clearly favorable, Logistic Regression will observe equivalent dif- ferentials (set diff, game diff, and point diff) and compute approximately the same probability in both scenarios.8 In a similar example, Logistic Regression may also fail to notice when a higher-ranked player is about to lose a close match. Simon leads for most of the match until Gabashvili comes back to force the third set. At the (β0+β1(s −s8 i j )+β2(gi−gj)+β3(xi−xj) After fitting coefficients, P(x) = P(s ei − sj , gi − gj , xi − xj) = (β , 1+e 0 +β1(si−sj)+β2(gi−gj)+β3(xi−xj) and P(1,0,5,4,3,0) = P(1,0,1,0,3,0) by symmetry. 42 end of the third set, Gabashvili serves out the match from 5-4, yet the model fails to detect that Simon is about to lose. This is because, with sets at 1-1 and game differentials nearly even, Simon’s Elo advantage outweighs his current scoreline disadvantage. Consequentially, his win-probability hovers between 50% as he loses the match, rather than fall toward 0. While we considered several score-based interaction terms to prevent scenarios like this, adding them to our feature set did not significantly change results. Still, when comparing model performance in specific scenarios, it is important to note that the KM model was specifically designed for tennis prediction while our Logistic Regression model was simply fed score-related features that seemed helpful. Despite apparent faults, observed performance suggests potential in further exploring machine learning methods. 43 5. Conclusion For the first time, we have documented the performance of in-match prediction methods on thousands of matches. While previous work only referenced individual matches or the 1992-1995 Wimbledon match dataset (Klaassen and Magnus, 2003), we have applied methods to thousands of matches from recent years. Results from this project and Jeff Sackmann’s point-by-point dataset should serve as an appropriate benchmark for future work in this field. Although many papers have referenced Klaassen and Magnus’ point-based model, none has explicitly outlined a way to correct for uncertainty stemming from small sample sizes in players’ serve/return history. Applying a normalization method, such as the James-Stein estimator in 3.5, significantly boosts the performance of point-based models. By factoring in the overall ability of a player’s opponents in the past twelve months, calculating fij , fji from the opponent- adjusted method in 3.6 offers additional predictive power. Still, none of these variations produce pre-match forecasts as effectively as Elo ratings. Using our approximation algorithm in 4.3.2, we can derive serve probabilities from any pre-match forecast πij and historical serve parameter t = fij+fji. A combination of this method, using πij derived from an Elo-based logistic regression, and our KM model ultimately produced the most reliable in-match prediction model. Still, logistic regression’s performance with a score-based feature set proved noteworthy. Machine- learning models offer potential in their ability to learn from a training set. While serve param- eters fij , fji may be obtained from a machine-learning model’s forecast, πij , our traditional point-based hierarchical Markov Model offers no comparable method to “train” itself from a dataset. Future work with point-based models that offers some aspect of optimization with 44 respect to a training set may be of much interest to fans and bettors alike. For the time be- ing, however, we have outlined several approaches to produce well-calibrated win probability forecasts at any point in a tennis match. 5.1 Applications to Betting Markets Methods explored in this project may be applied to betting markets in several ways. First, we can take any pre-match forecast πij , derived from implied market odds, with historical serve parameter t = fij + fji and produce corresponding serve probabilities fij , fji from method 4.3.2. As Kovalchik found that the Bookmaker Consensus Model outperformed Elo ratings (Kovalchik, 2016), we could likely further improve in-match prediction models by using pre- match betting odds. Next, we may back test any of our in-match prediction techniques against historical in-match odds. Betfair, an online gambling company, offers historical market data which is time-stamped yet lacks corresponding match scores. While mapping time-stamped market odds to corresponding scorelines across our point-by-point dataset proved out of scope, analyses with in-match betting data do exist. Huang demonstrated score inference from betting odds (Huang et al., 2011), and market-implied odds may serve as a benchmark against which to compare models. Finally, one may also test betting strategies with any of these models in real time, provided with up-to-date Elo ratings and fij , fji. 5.2 Future Steps Aside from beta experiments, all point-based models in this paper operated under the assump- tion that points played by each server are i.i.d. While Klaassen and Magnus concluded that deviations from i.i.d. are small enough for such models to provide a reasonable approximation (Klaassen and Magnus, 2001), the truth is that points within a match are not independent. Although players train themselves to ignore the score board, no player is truly capable of main- taining a constant level of play throughout every match. As soon as a player’s level fluctuates, 45 sequential points become dependent, suggesting dips or spikes in form throughout the match. Future work exploring momentum through dependencies between points, games, and sets could add another dimension of prediction to our point-based models. To start, one could quantify how much winning a point on serve affects the probability of winning the next point on serve. In total, Jeff Sackmann’s point-by-point dataset offers over 100,000 matches over which to explore effects of momentum.1 Explored methods in this paper were tested exclusively on ATP matches. All methods and analyses conducted in this paper may also be applied to women’s WTA matches in Sack- mann’s point-by-point dataset and contrasted to illustrate match predictability across men’s and women’s tours. 5.3 Visualizing Grand Slam Matches Below, we view win-probability graphs for several historical matches. Probabilities were gen- erated by KM logit-elo (a=300). 5.3.1 Roger Federer vs. Rafael Nadal Australian Open 2017 Final Earlier this year, Roger Federer overcame Rafael Nadal in the Australian Open final 6-4 3-6 6-1 3-6 6-3. 1While we examined ATP tour-level matches, there are also corresponding WTA matches and satellite-level events for each tour. 46 5.3.2 Stanislas Wawrinka vs. Novak Djokovic French Open 2015 Final With Djokovic facing pressure complete a career Grand Slam with a French Open title for years,2 the title finally appeared within his grasp as he knocked out Nadal, then a 9-time French Open champion, en route to the 2015 final. However, Wawrinka had other plans on this day, turning around the match and overpowering Djokovic with unwavering determination. 2This entails winning each of the grand slams once: Australian Open, French Open, Wimbledon, US Open. 47 5.3.3 Novak Djokovic vs. Roger Federer US Open 2011 Semi-Final In one of the most famous comebacks in tennis history, Djokovic recovered from a two-set deficit before escaping defeat at 3-5 15-40 on Federer’s serve in the final set.3 3If one were to consider Djokovic’s decision to hit an all-or-nothing return winner at 3-5 15-40 in the fifth set, we could have computed his win probability to be even less than p = .015, as Federer would have surmised (https://www.theguardian.com/sport/2011/sep/11/us-open-2011-federer-djokovic). Calculating in-match win probability within points based on shot-selection is realistic nowadays, thanks to Jeff Sackmann’s match charting project and recent data analyses with Hawkeye ball-tracking data. 48 Bibliography Barnett, T. and Clarke, S. R. (2005). Combining player statistics to predict outcomes of tennis matches. IMA Journal of Management Mathematics, 16(2):113–120. Barnett, T. J., Clarke, S. R., et al. (2002). Using microsoft excel to model a tennis match. In 6th Conference on Mathematics and Computers in Sport, pages 63–68. Queensland, Australia: Bond University. Barnett, T. J. et al. (2006). Mathematical modelling in hierarchical games with specific reference to tennis. PhD thesis, Swinburne University of Technology. Bevc, M. (2015). Predicting the outcome of tennis matches from point-by-point data. Bialik, C., Morris, B., and Boice, J. (2016). How we’re forecasting the 2016 u.s. open. http:// fivethirtyeight.com/features/how-were-forecasting-the-2016-us-open/. Accessed: 2017-10-30. Burke, B. (2014). Win probability and wpa. http://www.advancedfootballanalytics.com/ index.php/home/stats/stats-explained/win-probability-and-wpa. Accessed: 2017- 10-30. Efron, B. and Morris, C. N. (1977). Stein’s paradox in statistics. WH Freeman. Elo, A. E. (1978). The rating of chessplayers, past and present. Arco Pub. Huang, X., Knottenbelt, W., and Bradley, J. (2011). Inferring tennis match progress from in- 49 play betting odds. Final year project), Imperial College London, South Kensington Campus, London, SW7 2AZ. Klaassen, F. J. and Magnus, J. R. (2003). Forecasting the winner of a tennis match. European Journal of Operational Research, 148(2):257–267. Klaassen, F. J. G. M. and Magnus, J. R. (2001). Are points in tennis independent and identically distributed? evidence from a dynamic binary panel data model. Journal of the American Statistical Association, 96(454):500–509. Kovalchik, S. A. (2016). Searching for the goat of tennis win prediction. Journal of Quantitative Analysis in Sports, 12(3):127–138. Lewis-Kraus, G. (2016). The great ai awakening. The New York Times Magazine. Lock, D. and Nettleton, D. (2014). Using random forests to estimate win probability before each play of an nfl game. Journal of Quantitative Analysis in Sports, 10(2):197–205. Madurska, A. M. (2012). A set-by-set analysis method for predicting the outcome of professional singles tennis matches. Imperial College London, Department of Computing, Tech. Rep. Sipko, M. and Knottenbelt, W. (2015). Machine learning for the prediction of professional tennis matches. Steinberg, D. (2017). Why espn uses those in-game win probability stats that drive some base- ball fans nuts. https://www.washingtonpost.com/news/dc-sports-bog/wp/2017/04/23/ why-espn-uses-those-in-game-win-probability-stats-that-drive-some-baseball-fans-nuts/ ?utm_term=.8163323a5ee2. Accessed: 2017-10-30. 50