Journal of Sports Analytics xx (2021) x–xx 1 DOI 10.3233/JSA-200345 IOS Press
1 Forecasting serve performance
2 in professional tennis matches
∗ 3 Jacob Gollub 4 Harvard University, Department of Statistics
5 Abstract. Many research papers on tennis match prediction use a hierarchical Markov Model. To predict match outcomes, 6 this model requires input parameters for each player’s serving ability. While these parameters are often computed directly 7 from each player’s historical percentages of points won on serve and return, doing so fails to address bias due to limited 8 sample size and differences in strength of schedule. In this paper, we explore a handful of novel approaches to forecasting 9 serve performance that specifically address these limitations. By applying an Efron-Morris estimator, we provide a means 10 to robustly forecast outcomes when players have limited match data over the past year. Next, through tracking expected 11 serve and return performance in past matches, we account for strength of schedule across all points in a player’s match 12 history. Finally, we demonstrate a new way to synthesize historical serve data with the predictive power of Elo ratings. When 13 forecasting serve performance across 7,622 ATP tour-level matches from 2014-2016, all three of these proposed methods 14 outperformed Barnett and Clarke’s standard approach.
15 Keywords: Match prediction, point-based models, Efron-Morris estimator, Elo ratings
16 1. Introduction Assuming that serve and return performances fol- 35 low a normal distribution on a match-by-match basis, 36 17 Statistical prediction models have been applied to Newton and Aslam (2009) ran Monte Carlo simu- 37 18 tennis for decades, often to predict the winner of a lations to predict the winner of a match. Adapting 38 19 match. Laying the groundwork for research on point- Barnett and Clarke’s approach, Spanias and Knotten- 39 20 based models, Klaassen and Magnus (2001) first belt (2012) predicted serve performance by modeling 40 21 tested the assumption that points in a tennis match the probabilities of specific point outcomes (e.g. ace, 41 22 are independent and identically distributed, or i.i.d. rally win on first serve, etc.) within a service game. 42 23 While this was proven false, they concluded devi- In order to reduce bias from variations in strength 43 24 ations are small enough that the i.i.d. assumption of schedule, Knottenbelt et al. (2012) then proposed 44 25 provides a reasonable approximation. By estimating the Common-Opponent Model, which measures per- 45 26 each player’s probability of winning a point on serve formance relative to common adversaries in their 46 27 and computing win probability from these parame- respective match histories. More recently, Kovalchik 47 28 ters, they demonstrated a new way to forecast matches and Reid (2018) demonstrated a way to calibrate 48 29 under this assumption (Klaassen and Magnus, 2003). serve parameters to the win probability implied by 49 30 In the time since, we have seen a variety of ways each player’s Elo rating. 50 31 to forecast serve performance. Barnett and Clarke Building upon previous work in serve performance 51 32 (2005) estimated these parameters byUncorrected adjusting his- prediction, Author this exploration servesProof two purposes. 52 33 torical tournament statistics by the performance of Following a previous assessment of 11 different pre- 53 34 the server and returner, relative to the average player. match prediction models on 2014 ATP tour-level 54 matches (Kovalchik, 2016), we evaluate serve perfor- 55 ∗ mance prediction methods across the same dataset as 56 Corresponding author: Jacob Gollub, Harvard University, Department of Statistics. E-mail: [email protected]. a benchmark for future work. We also introduce sev- 57
ISSN 2215-020X © 2021 – The authors. Published by IOS Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (CC BY-NC 4.0). 2 J. Gollub / Forecasting serve performance in professional tennis matches
58 eral new methods to more robustly compute expected directly from point-level probabilities, while Elo rat- 103 59 serve performance. When forecasting thousands of ings infer player ability solely from match outcomes. 104 60 matches, the previously stated methods tend to run
61 into problems with limited match data and differences 2.1. Hierarchical markov model 105 1 62 in strength of schedule. To avoid these pitfalls, we
63 explore variations of Barnett and Clarke’s approach Consider a tennis match between player i and 106 64 that specifically address these issues. player j, where fij estimates the probability that 107 65 In section 2, we review the hierarchical Markov player i wins a point on serve against player j in a 108 66 Model and Elo ratings, two of the most often-cited given match. Given serve parameters fij ,fji we can 109 67 methods for predicting tennis match outcomes. In calculate the probability that player i wins the match, 110 68 section 3, we review Barnett and Clarke’s formula πij , from any score. To do so, consider how scores 111 69 and explore further approaches to serve performance are composed of points, games, and sets. With player 112 70 prediction. Section 3.1.1 introduces a Bayesian esti- i serving to player j, let ai,aj represent each player’s 113 71 mator to more robustly predict serve performance within-game score. Then we compute the probabil- 114 72 from limited match data. Section 3.1.2 demonstrates ity of player i winning the current game from score 115 73 an approach that tracks expected serve and return per- ai − aj as follows: 116 74 formances in every match, allowing us to consider 75 strength of schedule over the course of a player’s ⎧ ⎪ , a = ,a ≤ 76 entire match history. Section 3.2 harnesses Elo rat- ⎪1 if i 4 j 2 ⎪ 77 ings’ predictive power to more accurately predict ⎪ = ≤ ⎪0, if aj 4,ai 2 78 serve performance while maintaining the overall ⎪ ⎨ 2 79 serve ability between two players. In section 4, we (fij ) = , if a = a = 3 Pg(ai,aj) 2 + − 2 i j 80 present several case studies to examine how each ⎪ (fij ) (1 fij ) ⎪ 81 approach informs predictions for individual matches. ⎪ ∗ + + ⎪fij Pg(ai 1,aj) 82 In section 5, we evaluate all proposed methods ⎪ ⎪ 83 ⎩ alongside established prediction models. Section 6 (1 − fij )Pg(ai,aj + 1), otherwise. 84 summarizes findings and suggests future steps for 117 85 research with point-based models. Following this approach, one may calculate player 118 86 Effective serve performance prediction holds i’s probability of winning the current set, Ps(gi,gj), 119 87 strong implications for both player analytics and and then the match, Pm(si,sj), through similar recur- 120 88 in-match forecasting. With more effective forecasts, sive relationships. Barnett et al. (2002) describe the 121 89 coaches can better understand how their players recursion in full detail. 122 90 match up with opponents on serve and return, 91 and adjust their strategies accordingly. By using 92 the hierarchical Markov Model, we can also com- 2.2. Elo ratings 123 93 pute win probability as a function of each player’s 94 expected serve performance from any score. There- Elo was originally developed as a head-to-head rat- 124 95 fore, improving existing methods will provide means ing system for chess players (Elo, 1978). Recently, 125 96 to more confidently predict match outcomes while FiveThirtyEight’s Elo variant has gained prominence 126 97 play is in progress, a problem with significant applica- in the media (Bialik et al., 2016). For a match at 127 98 tion to betting markets and real-time sports analytics. time t between player i and player j with Elo rat- 128 ings Ei(t) and Ej(t), player i is forecasted to win 129 with probability: 130
99 2. Models for match win prediction − Ej(t)−Ei(t) 1 πˆ (t) = 1 + 10 400 . 131 100 Newly proposed methods will require the context ij 101 of two popular match prediction models.Uncorrected The hier- Author Proof 102 archical Markov Model computes win probability For the following match, player i’s rating is then 132 1Knottenbelt’s Common-Opponent model addresses strength updated accordingly: 133 of schedule at the cost of reducing the amount of available match data. Kovalchik and Reid’s approach indirectly addresses both of + = + ∗ − these issues through the use of Elo ratings. Ei(t 1) Ei(t) Kit (Wi(t) πˆ ij (t)). 134 J. Gollub / Forecasting serve performance in professional tennis matches 3
135 Wi(t) is an indicator for whether player i won the return. ft denotes the percentage of points won on 172 136 given match at time t, while Kit is the learning rate serve at the match’s given tournament and fav,gav 173 137 for player i at time t. represent tour-level averages for the percentages of 174 138 According to FiveThirtyEight’s analysts, Elo rat- points won on serve and return, respectively. 175 139 ings perform optimally when allowing Kit to decay While Barnett and Clarke’s dataset was limited to 176 140 slowly over time (Bialik et al., 2016). With mi(t) rep- year-to-date statistics, we may calculate fi,gi with 177 141 resenting the number of player i’s career matches at the past twelve months of match data for any given 178 2 Mi 142 time t, we update the learning rate as follows : match. Where (y,m) represents the set of player i’s 179 matches in year y, month m, we obtain the following 180 250 4 = statistics : 181 143 Kit .4 . (5 + mi(t)) 12 = ∈Mi wik t 1 k (y−1,m+t) 144 This variant updates a player’s Elo rating most = 182 fi(y, m) 12 = i n 145 quickly when we have no information about them t 1 k∈M − + ik (y 1,m t) 146 and makes smaller changes as mi(t) accumulates. 12 = ∈Mi njk − wjk 147 To apply this rating system to all ATP tour-level t 1 k (y−1,m+t) = 183 gi(y, m) 12 . 148 matches, we initalize each player’s Elo rating at = ∈Mi njk t 1 k (y−1,m+t) 149 Ei(0) = 1500 and match history mi(0) = 0. Then, 150 we iterate through all tour-level matches from 1968- wik denotes the number of service points won by 184 151 present in chronological order, storing Ei(t),Ej(t) player i in match k and nik the total number of service 185 152 for each match and updating each player’s Elo rating 3 points played by player i in match k. 186 153 accordingly. Next, we calculate ft for a given tournament and 187 year, where M(v,y) represents the set of all matches 188 played at tournament v in year y: 189 154 3. Predicting serve performance k∈M − wk 155 (v,y 1) A significant portion of research in tennis match f (v, y) = . 190 t n 156 prediction concerns estimating each player’s proba- k∈M(v,y−1) k 157 bility of winning a point on serve. We present several 158 variations to Barnett and Clarke’s approach in Sec- wk and nk represent the number of service points won 191 159 tion 3.1 and a way to reconcile Elo ratings with these and played in match k, respectively. 192 160 point-based methods in Section 3.2. Finally, we calculate fav,gav where M(y,m) repre- 193 sents the set of tour-level matches played in year y, 194 161 3.1. Barnett-clarke formula month m: 195 12 162 Given players’ historical serve/return perfor- w t=1 k∈M(y−1,m+t) k f (y, m) = 196 163 mance, Barnett and Clarke (2005) demonstrated a av 12 t=1 k∈M − + nk 164 method to calculate fij ,fji. (y 1,m t) gav(y, m) = 1 − fav(y, m). 197 165 fij = ft + (fi − fav) − (gj − gav)
166 fji = ft + (fj − fav) − (gi − gav) Overall, Barnett and Clarke’s formula assumes that 198 differences between player serve and return abil- 199 167 In a match between player i and player j, each param- ity are additive. Next, we explore variations to this 200 168 eter estimates one player’s probability of winning a approach. 201 169 point on serve against the other. fi represents player
170 i’s historical percentage of points won on serve, while 3.1.1. Efron-morris estimator 202 171 g corresponds to their percentage of points won on i UncorrectedIn the case Author of players who do notProof regularly compete 203 in tour-level events, fi,gi must be calculated from 204 2The constants in this equation are parameter values that limited sample sizes. Consequently, match probabili- 205 FiveThirtyEight’s team chose after fitting this model on decades of tour-level match data. ties based on these estimates can be skewed by noise. 206 3Tennis’ Open Era began in 1968, when professionals were allowed to enter grand slam tournaments. 4For current month m, we only collect month-to-date matches. 4 J. Gollub / Forecasting serve performance in professional tennis matches
2 207 To address this, we turn to the Efron-Morris estimator Normalization coefficient Bi depends on both τ , the 247 208 to provide alternative parameters of the form: 2 variance of true service ability, and σi , the variance 248 of fi given true service ability θi. With fˆi denoting 249 = + − − − 209 fij ft (fi fav) (gj gav) the observed value of fi across ni points, we first 250 estimate mean and variance of true service ability 251 210 f = f + (f − f ) − (g − g ). ji t j av i av from all matches in our dataset: 252 211 Rather than directly apply the estimator to f ,f ,we fˆ ij ji fi∈F i fav = 253 212 normalize the serve/return parameters fi,gi which |F| 213 constitute Barnett and Clarke’s equations. 214 Decades ago, Efron and Morris (1977) described (fˆ − f )2 2 fi∈F i av 215 a method to estimate a group of sample means τˆ = . 254 |F|−1 216 with unequal variances. The Efron-Morris estima- 217 tor shrinks sample means toward the overall mean Then, using maximum likelihood estimation, we esti- 255 218 by a magnitude proportional to each sample mean’s 2 mate Bi and σi in order to produce the Efron-Morris 256 219 uncertainty, producing a mean-squared error favor- estimator: 257 220 able in expectation to that of Maximum-Likelihood 221 Estimation. While Barnett and Clarke use raw histor- 2 σˆi 222 ical averages of serve and return points won, we can Bˆ = 258 i 2 2 τˆ + σˆi 223 instead use this estimator to feed more reliable param- 224 eters into their equations. Just as Efron and Morris ˆ − ˆ 2 fi(1 fi) σˆ = 259 225 estimated toxoplasmosis rates across hospitals with i ni 226 uneven populations, we will apply this method to = ˆ + ˆ − ˆ 227 serve performance prediction. fi fi Bi(fav fi). 260 228 Consider our match dataset M, consisting of 229 all tour-level matches from 1968-present. For each When ni is large, our uncertainty in fi decreases 261 ˆ 230 match k between players i and j, we calculate each and shrinkage coefficient Bi approaches zero. As ni 262 ˆ 231 player’s historical percentage of points won fik,fjk gets smaller, Bi approaches one and fi more closely 263 232 as outlined in section 3.1. Then, F contains each resembles the average serving ability. By repeating 264 233 player’s historical serve performance before every the same calculations with gi in place of fi ,wecan 265 234 match: obtain Efron-Morris estimators for return ability and 266 then compute f ,f from these serve and return 267 ij ji F = F estimators. 268 235 k∈M k While Barnett and Clarke’s original paper 269 236 Fk ={fik,fjk}. computed fi,gi with sample means, using an Efron- 270 Morris estimator will produce more robust forecasts 271
237 To model serve performance according to a Bayesian across large datasets, where the amount of available 272 238 distribution, we consider each fi ∈ F to be a ran- data for a given match varies significantly. In addi- 273 239 dom variable that approximates a player’s true service tion, this estimator can be applied to other variations 274 240 ability θi as follows: of Barnett and Clarke’s approach, as we will observe 275 when evaluating methods in Section 5. 276 | ∼ 2 241 fi θi N(θi,σi ) 3.1.2. Opponent-adjusted ratings 277 2 242 θi ∼ N(μ, τ ). While Barnett and Clarke’s equation considers the 278 opponent’s serve and return ability, it does not track 279
243 Put together, the above statements imply the follow- strength of schedule throughout each player’s match 280 history. This is important, as a player’s win per- 281 244 ing (Efron and Morris, 1975): Uncorrected Author Proof centages on serve/return may become inflated from 282 | ∼ + − 2 − playing weaker opponents or vice versa. In this sec- 283 245 θi fi N(fi Bi(μ fi),σi (1 Bi)) tion, we propose a variation to Barnett and Clarke’s 284 σ 2 i equation which replaces f ,g with opponent- 285 246 B = . av av i τ2 + σ 2 − − i adjusted averages 1 gi, 1 fi . The equations then 286 J. Gollub / Forecasting serve performance in professional tennis matches 5
287 become: based forecasts. However, given that Elo has since 325 been demonstrated to outperform ATP rank in pre- 326 = + − − − − − 288 fij ft (fi (1 gi)) (gj (1 fj)) dicting match outcomes, we apply this method with 327 Elo forecasts. 328 289 f = ft + (fj − (1 − g )) − (gi − (1 − f )). ji j i Even when we impose the constraint fij + fji = b, 329 our hierarchical Markov Model’s match probability 330 290 f ,g represent the average serve and return abil- i i equation has no analytical solution to its inverse. 331 291 ities of player i’s opponents in the last twelve Therefore, we turn to the following approximation 332 292 months. To calculate this, we weight each opponent’s algorithm to generate serving percentages that 333 293 serve/return ability by the number of points played correspond to a win probability within of our Elo 334 294 throughout player i’s match history: forecast: 335