<<

Journal of Sports Analytics xx (2021) x–xx 1 DOI 10.3233/JSA-200345 IOS Press

1 Forecasting serve performance

2 in professional matches

∗ 3 Jacob Gollub 4 Harvard University, Department of Statistics

5 Abstract. Many research papers on tennis match prediction use a hierarchical Markov Model. To predict match outcomes, 6 this model requires input parameters for each player’s serving ability. While these parameters are often computed directly 7 from each player’s historical percentages of points won on serve and return, doing so fails to address bias due to limited 8 sample size and differences in strength of schedule. In this paper, we explore a handful of novel approaches to forecasting 9 serve performance that specifically address these limitations. By applying an Efron-Morris estimator, we provide a means 10 to robustly forecast outcomes when players have limited match data over the past year. Next, through tracking expected 11 serve and return performance in past matches, we account for strength of schedule across all points in a player’s match 12 history. Finally, we demonstrate a new way to synthesize historical serve data with the predictive power of Elo ratings. When 13 forecasting serve performance across 7,622 ATP tour-level matches from 2014-2016, all three of these proposed methods 14 outperformed Barnett and Clarke’s standard approach.

15 Keywords: Match prediction, point-based models, Efron-Morris estimator, Elo ratings

16 1. Introduction Assuming that serve and return performances fol- 35 low a normal distribution on a match-by-match basis, 36 17 Statistical prediction models have been applied to Newton and Aslam (2009) ran Monte Carlo simu- 37 18 tennis for decades, often to predict the winner of a lations to predict the winner of a match. Adapting 38 19 match. Laying the groundwork for research on point- Barnett and Clarke’s approach, Spanias and Knotten- 39 20 based models, Klaassen and Magnus (2001) first belt (2012) predicted serve performance by modeling 40 21 tested the assumption that points in a tennis match the probabilities of specific point outcomes (e.g. , 41 22 are independent and identically distributed, or i.i.d. rally win on first serve, etc.) within a service game. 42 23 While this was proven false, they concluded devi- In order to reduce bias from variations in strength 43 24 ations are small enough that the i.i.d. assumption of schedule, Knottenbelt et al. (2012) then proposed 44 25 provides a reasonable approximation. By estimating the Common-Opponent Model, which measures per- 45 26 each player’s probability of winning a point on serve formance relative to common adversaries in their 46 27 and computing win probability from these parame- respective match histories. More recently, Kovalchik 47 28 ters, they demonstrated a new way to forecast matches and Reid (2018) demonstrated a way to calibrate 48 29 under this assumption (Klaassen and Magnus, 2003). serve parameters to the win probability implied by 49 30 In the time since, we have seen a variety of ways each player’s Elo rating. 50 31 to forecast serve performance. Barnett and Clarke Building upon previous work in serve performance 51 32 (2005) estimated these parameters byUncorrected adjusting his- prediction, Author this exploration servesProof two purposes. 52 33 torical tournament statistics by the performance of Following a previous assessment of 11 different pre- 53 34 the server and returner, relative to the average player. match prediction models on 2014 ATP tour-level 54 matches (Kovalchik, 2016), we evaluate serve perfor- 55 ∗ mance prediction methods across the same dataset as 56 Corresponding author: Jacob Gollub, Harvard University, Department of Statistics. E-mail: [email protected]. a benchmark for future work. We also introduce sev- 57

ISSN 2215-020X © 2021 – The authors. Published by IOS Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (CC BY-NC 4.0). 2 J. Gollub / Forecasting serve performance in professional tennis matches

58 eral new methods to more robustly compute expected directly from point-level probabilities, while Elo rat- 103 59 serve performance. When forecasting thousands of ings infer player ability solely from match outcomes. 104 60 matches, the previously stated methods tend to run

61 into problems with limited match data and differences 2.1. Hierarchical markov model 105 1 62 in strength of schedule. To avoid these pitfalls, we

63 explore variations of Barnett and Clarke’s approach Consider a tennis match between player i and 106 64 that specifically address these issues. player j, where fij estimates the probability that 107 65 In section 2, we review the hierarchical Markov player i wins a point on serve against player j in a 108 66 Model and Elo ratings, two of the most often-cited given match. Given serve parameters fij ,fji we can 109 67 methods for predicting tennis match outcomes. In calculate the probability that player i wins the match, 110 68 section 3, we review Barnett and Clarke’s formula πij , from any score. To do so, consider how scores 111 69 and explore further approaches to serve performance are composed of points, games, and sets. With player 112 70 prediction. Section 3.1.1 introduces a Bayesian esti- i serving to player j, let ai,aj represent each player’s 113 71 mator to more robustly predict serve performance within-game score. Then we compute the probabil- 114 72 from limited match data. Section 3.1.2 demonstrates ity of player i winning the current game from score 115 73 an approach that tracks expected serve and return per- ai − aj as follows: 116 74 formances in every match, allowing us to consider 75 strength of schedule over the course of a player’s ⎧ ⎪ , a = ,a ≤ 76 entire match history. Section 3.2 harnesses Elo rat- ⎪1 if i 4 j 2 ⎪ 77 ings’ predictive power to more accurately predict ⎪ = ≤ ⎪0, if aj 4,ai 2 78 serve performance while maintaining the overall ⎪ ⎨ 2 79 serve ability between two players. In section 4, we (fij ) = , if a = a = 3 Pg(ai,aj) 2 + − 2 i j 80 present several case studies to examine how each ⎪ (fij ) (1 fij ) ⎪ 81 approach informs predictions for individual matches. ⎪ ∗ + + ⎪fij Pg(ai 1,aj) 82 In section 5, we evaluate all proposed methods ⎪ ⎪ 83 ⎩ alongside established prediction models. Section 6 (1 − fij )Pg(ai,aj + 1), otherwise. 84 summarizes findings and suggests future steps for 117 85 research with point-based models. Following this approach, one may calculate player 118 86 Effective serve performance prediction holds i’s probability of winning the current set, Ps(gi,gj), 119 87 strong implications for both player analytics and and then the match, Pm(si,sj), through similar recur- 120 88 in-match forecasting. With more effective forecasts, sive relationships. Barnett et al. (2002) describe the 121 89 coaches can better understand how their players recursion in full detail. 122 90 match up with opponents on serve and return, 91 and adjust their strategies accordingly. By using 92 the hierarchical Markov Model, we can also com- 2.2. Elo ratings 123 93 pute win probability as a function of each player’s 94 expected serve performance from any score. There- Elo was originally developed as a head-to-head rat- 124 95 fore, improving existing methods will provide means ing system for chess players (Elo, 1978). Recently, 125 96 to more confidently predict match outcomes while FiveThirtyEight’s Elo variant has gained prominence 126 97 play is in progress, a problem with significant applica- in the media (Bialik et al., 2016). For a match at 127 98 tion to betting markets and real-time sports analytics. time t between player i and player j with Elo rat- 128 ings Ei(t) and Ej(t), player i is forecasted to win 129 with probability: 130

99 2. Models for match win prediction − Ej(t)−Ei(t) 1 πˆ (t) = 1 + 10 400 . 131 100 Newly proposed methods will require the context ij 101 of two popular match prediction models.Uncorrected The hier- Author Proof 102 archical Markov Model computes win probability For the following match, player i’s rating is then 132 1Knottenbelt’s Common-Opponent model addresses strength updated accordingly: 133 of schedule at the cost of reducing the amount of available match data. Kovalchik and Reid’s approach indirectly addresses both of + = + ∗ − these issues through the use of Elo ratings. Ei(t 1) Ei(t) Kit (Wi(t) πˆ ij (t)). 134 J. Gollub / Forecasting serve performance in professional tennis matches 3

135 Wi(t) is an indicator for whether player i won the return. ft denotes the percentage of points won on 172 136 given match at time t, while Kit is the learning rate serve at the match’s given tournament and fav,gav 173 137 for player i at time t. represent tour-level averages for the percentages of 174 138 According to FiveThirtyEight’s analysts, Elo rat- points won on serve and return, respectively. 175 139 ings perform optimally when allowing Kit to decay While Barnett and Clarke’s dataset was limited to 176 140 slowly over time (Bialik et al., 2016). With mi(t) rep- year-to-date statistics, we may calculate fi,gi with 177 141 resenting the number of player i’s career matches at the past twelve months of match data for any given 178 2 Mi 142 time t, we update the learning rate as follows : match. Where (y,m) represents the set of player i’s 179 matches in year y, month m, we obtain the following 180 250 4 = statistics : 181 143 Kit .4 . (5 + mi(t)) 12 = ∈Mi wik t 1 k (y−1,m+t) 144 This variant updates a player’s Elo rating most = 182 fi(y, m) 12 = i n 145 quickly when we have no information about them t 1 k∈M − + ik (y 1,m t) 146 and makes smaller changes as mi(t) accumulates. 12 = ∈Mi njk − wjk 147 To apply this rating system to all ATP tour-level t 1 k (y−1,m+t) = 183 gi(y, m) 12 . 148 matches, we initalize each player’s Elo rating at = ∈Mi njk t 1 k (y−1,m+t) 149 Ei(0) = 1500 and match history mi(0) = 0. Then, 150 we iterate through all tour-level matches from 1968- wik denotes the number of service points won by 184 151 present in chronological order, storing Ei(t),Ej(t) player i in match k and nik the total number of service 185 152 for each match and updating each player’s Elo rating 3 points played by player i in match k. 186 153 accordingly. Next, we calculate ft for a given tournament and 187 year, where M(v,y) represents the set of all matches 188 played at tournament v in year y: 189 154 3. Predicting serve performance k∈M − wk 155 (v,y 1) A significant portion of research in tennis match f (v, y) = . 190 t n 156 prediction concerns estimating each player’s proba- k∈M(v,y−1) k 157 bility of winning a point on serve. We present several 158 variations to Barnett and Clarke’s approach in Sec- wk and nk represent the number of service points won 191 159 tion 3.1 and a way to reconcile Elo ratings with these and played in match k, respectively. 192 160 point-based methods in Section 3.2. Finally, we calculate fav,gav where M(y,m) repre- 193 sents the set of tour-level matches played in year y, 194 161 3.1. Barnett-clarke formula month m: 195 12 162 Given players’ historical serve/return perfor- w t=1 k∈M(y−1,m+t) k f (y, m) = 196 163 mance, Barnett and Clarke (2005) demonstrated a av 12 t=1 k∈M − + nk 164 method to calculate fij ,fji. (y 1,m t) gav(y, m) = 1 − fav(y, m). 197 165 fij = ft + (fi − fav) − (gj − gav)

166 fji = ft + (fj − fav) − (gi − gav) Overall, Barnett and Clarke’s formula assumes that 198 differences between player serve and return abil- 199 167 In a match between player i and player j, each param- ity are additive. Next, we explore variations to this 200 168 eter estimates one player’s probability of winning a approach. 201 169 point on serve against the other. fi represents player

170 i’s historical percentage of points won on serve, while 3.1.1. Efron-morris estimator 202 171 g corresponds to their percentage of points won on i UncorrectedIn the case Author of players who do notProof regularly compete 203 in tour-level events, fi,gi must be calculated from 204 2The constants in this equation are parameter values that limited sample sizes. Consequently, match probabili- 205 FiveThirtyEight’s team chose after fitting this model on decades of tour-level match data. ties based on these estimates can be skewed by noise. 206 3Tennis’ Open Era began in 1968, when professionals were allowed to enter tournaments. 4For current month m, we only collect month-to-date matches. 4 J. Gollub / Forecasting serve performance in professional tennis matches

2 207 To address this, we turn to the Efron-Morris estimator Normalization coefficient Bi depends on both τ , the 247 208 to provide alternative parameters of the form: 2 variance of true service ability, and σi , the variance 248 of fi given true service ability θi. With fˆi denoting 249  = +  − −  − 209 fij ft (fi fav) (gj gav) the observed value of fi across ni points, we first 250    estimate mean and variance of true service ability 251 210 f = f + (f − f ) − (g − g ). ji t j av i av from all matches in our dataset: 252 211 Rather than directly apply the estimator to f ,f ,we fˆ ij ji fi∈F i fav = 253 212 normalize the serve/return parameters fi,gi which |F| 213 constitute Barnett and Clarke’s equations. 214 Decades ago, Efron and Morris (1977) described (fˆ − f )2 2 fi∈F i av 215 a method to estimate a group of sample means τˆ = . 254 |F|−1 216 with unequal variances. The Efron-Morris estima- 217 tor shrinks sample means toward the overall mean Then, using maximum likelihood estimation, we esti- 255 218 by a magnitude proportional to each sample mean’s 2 mate Bi and σi in order to produce the Efron-Morris 256 219 uncertainty, producing a mean-squared error favor- estimator: 257 220 able in expectation to that of Maximum-Likelihood 221 Estimation. While Barnett and Clarke use raw histor- 2 σˆi 222 ical averages of serve and return points won, we can Bˆ = 258 i 2 2 τˆ + σˆi 223 instead use this estimator to feed more reliable param- 224 eters into their equations. Just as Efron and Morris ˆ − ˆ 2 fi(1 fi) σˆ = 259 225 estimated toxoplasmosis rates across hospitals with i ni 226 uneven populations, we will apply this method to  = ˆ + ˆ − ˆ 227 serve performance prediction. fi fi Bi(fav fi). 260 228 Consider our match dataset M, consisting of 229 all tour-level matches from 1968-present. For each When ni is large, our uncertainty in fi decreases 261 ˆ 230 match k between players i and j, we calculate each and shrinkage coefficient Bi approaches zero. As ni 262 ˆ  231 player’s historical percentage of points won fik,fjk gets smaller, Bi approaches one and fi more closely 263 232 as outlined in section 3.1. Then, F contains each resembles the average serving ability. By repeating 264   233 player’s historical serve performance before every the same calculations with gi in place of fi ,wecan 265 234 match: obtain Efron-Morris estimators for return ability and 266   then compute f ,f from these serve and return 267 ij ji F = F estimators. 268 235 k∈M k While Barnett and Clarke’s original paper 269 236 Fk ={fik,fjk}. computed fi,gi with sample means, using an Efron- 270 Morris estimator will produce more robust forecasts 271

237 To model serve performance according to a Bayesian across large datasets, where the amount of available 272 238 distribution, we consider each fi ∈ F to be a ran- data for a given match varies significantly. In addi- 273 239 dom variable that approximates a player’s true service tion, this estimator can be applied to other variations 274 240 ability θi as follows: of Barnett and Clarke’s approach, as we will observe 275 when evaluating methods in Section 5. 276 | ∼ 2 241 fi θi N(θi,σi ) 3.1.2. Opponent-adjusted ratings 277 2 242 θi ∼ N(μ, τ ). While Barnett and Clarke’s equation considers the 278 opponent’s serve and return ability, it does not track 279

243 Put together, the above statements imply the follow- strength of schedule throughout each player’s match 280 history. This is important, as a player’s win per- 281 244 ing (Efron and Morris, 1975): Uncorrected Author Proof centages on serve/return may become inflated from 282 | ∼ + − 2 − playing weaker opponents or vice versa. In this sec- 283 245 θi fi N(fi Bi(μ fi),σi (1 Bi)) tion, we propose a variation to Barnett and Clarke’s 284 σ 2 i equation which replaces f ,g with opponent- 285 246 B = . av av i τ2 + σ 2 −  −  i adjusted averages 1 gi, 1 fi . The equations then 286 J. Gollub / Forecasting serve performance in professional tennis matches 5

287 become: based forecasts. However, given that Elo has since 325 been demonstrated to outperform ATP rank in pre- 326  = + − −  − − −  288 fij ft (fi (1 gi)) (gj (1 fj)) dicting match outcomes, we apply this method with 327    Elo forecasts. 328 289 f = ft + (fj − (1 − g )) − (gi − (1 − f )). ji j i Even when we impose the constraint fij + fji = b, 329   our hierarchical Markov Model’s match probability 330 290 f ,g represent the average serve and return abil- i i equation has no analytical solution to its inverse. 331 291 ities of player i’s opponents in the last twelve Therefore, we turn to the following approximation 332 292 months. To calculate this, we weight each opponent’s algorithm to generate serving percentages that 333 293 serve/return ability by the number of points played correspond to a win probability within  of our Elo 334 294 throughout player i’s match history: forecast: 335

336 12  = ∈Mi njk ∗ (1 − g )  t 1 k (y−1,m+t) ijk Algorithm 1 Klaassen-Magnus Elo Serve Parameters 295 f (y, m) = i 12 procedure EloServeProbabilities(π, b, ) = i n t 1 k∈M − + jk ← (y 1,m t) f b/2 12 ∗ −  diff ← b/4 = ∈Mi nik (1 f ) ←  t 1 k (y−1,m+t) ijk currentProb .5 296 = gi(y, m) 12 . while |currentProb - π| >do: = ∈Mi nik t 1 k (y−1,m+t) If currentProb < π then f += diff else 297 Once again, nik denotes the number of service points  f -= diff 298 played by player i in match k. In calculating fi ,we diff = diff/2 299 use njk to denote the number of return points played currentProb ← matchProb(f , b - f ) 5 return f, b , −f 300 by player i in match k. Calculated from historical   301 data at the time each match, fijk ,gijk represent the 302 opponent-adjusted likelihood of player i winning a To generate serve probabilities for a given match, 337 303 point against player j in match k on serve and return, we first compute πij as player i’s chance of victory 338 304 respectively. against player j given their Elo ratings and fij ,fji 339 305 Since the formula considers each player’s as specified by any Barnett-Clarke variation in sec- 340 306 opponent-adjusted ratings at the time of each match, tion 3.1. Then we run the above algorithm with π 341 307 we must compute ratings in chronological order. = πij , b = fij + fji, and  set to a desired preci- 342 308 Similarly to Elo, we initialize all players’ opponent- 6 sion level. At each step, we call matchProb() to 343 309 adjusted ratings to f ,g before iterating through av av compute the win probability from the start of the 344 310 all tour-level matches 1968-present and calculating match if player i and player j had serve parameters 345 311 player ratings for each match accordingly. fij = f, f ji = b − f , respectively. Then we compare 346 currentProb to prob and increment f by diff, which 347 312 3.2. Klaassen-magnus elo ratings halves at every iteration. This process continues until 348 the serve parameters f, b − f correspond to a win 349 313 Before Barnett and Clarke’s approach, Klaassen 1 probability within  of πij , taking O(log  ) calls to 350 314 and Magnus (2001) suggested a method to infer matchProb(). 351 315 serving probabilities from a pre-match win forecast Given any pre-match forecast πij , we can produce 352 316 πij .Asb = fij + fji represents the overall serve   fij ,fji consistent with πij , according to our hier- 353 317 ability between two players, they impose the con-   archical Markov Model. While Kovalchik and Reid 354 318 straint that any new serve parameters f ,f must ij ji (2018) recently outlined a similar method for infer- 355  +  = 319 satisfy fij fji b. Using this, we may create a ring these parameters from Elo ratings subject to 356 →   320 one-to-one function S :(πij ,b) (fij ,fji), which difference in serve ability, fij − fji = δ, we set the 357    +  = 321 generates serving probabilities fij ,fji for both play- constraint fij fji b to ensure that overall serve 358 = 322 ers such that Pm(0, 0) πij . Uncorrectedability agrees Author with historical data.Proof As Klaassen and 359 323 As this paper was published in 2002, Klaassen and Magnus (2003) demonstrated, b encodes important 360 324 Magnus produced serve parameters from ATP rank- information regarding likely trajectories of a match 361 score and the relative importance of service breaks. 362 5The number of player j’s service points in match k is equal to the number of player i’s return points in match k. 6For this project, we set  = .001. 6 J. Gollub / Forecasting serve performance in professional tennis matches

  363 Therefore, keeping fij ,fji consistent with b more unrealistic. From the serving stats, our hierarchical 400 364 naturally lends itself to in-match prediction, a clear Markov Model computes Elahi’s win probability as 401 365 future application of methods explored here. πij = .8095, mainly in consequence of only having 402 366 Most importantly, this approach allows us to gen- collected his player statistics for one match. On the 403 367 erate serve parameters from forecasts that are not other hand, Karlovic becomes a strong favorite when 404 368 point-based. While Kovalchik (2016) explored a vari- we calculate Elahi’s win probability via Elo ratings: 405 369 ety of point-based methods specific to tennis’ scoring − − 370 system, none outperformed Elo ratings in predicting Ej(t) Ei(t) 1952.86 1585.93 d = = = .9173 406 371 match outcomes. However, Elo ratings alone lack suf- 400 400 d −1 372 ficient context to predict outcomes at a point level. πˆ ij (t) = (1 + 10 ) = .1079. 407 373 Using the above approach, we significantly expand

374 the possibilities when producing serve parameters for This leads us to further question validity of this 408 375 a given match. Should methods superior to Elo arise approach when using limited historical data. Thus, we 409   376 in the future, we may similarly intuit fij ,fji from turn to the Efron-Morris estimator to shrink Elahi’s 410 377 their match forecasts. serving and return probabilities toward fav,gav. 411

 = + − = fi fi Bi(fav fi) .6599 412 378 4. Case studies  = + − = fj fj Bj(fav fj) .7480 413 379 The following examples illustrate applications of  = + − = gi gi Bi(gav gi) .3648 414 380 newly proposed methods in several ATP tour-level  = + − = 381 matches. gj gj Bj(gav gj) .2935 415  = +  − −  − = fij ft (fi fav) (gj gav) .7495 416 382 4.1. Efron-morris estimator  = +  − −  − = fji ft (fj fav) (gi gav) .7663 417 383 To see how the Efron-Morris estimator makes our

384 model robust to small sample sizes, consider the fol- Above, we can see that the Efron-Morris estimator 418 385 lowing match. When Daniel Elahi (COL) and Ivo shrinks Elahi’s stats far more than Karlovic’s, since 419 386 Karlovic (CRO) faced off at ATP Bogota 2015, Elahi Karlovic has played many more tour-level matches in 420   387 had played only one tour-level match in the past = the past year. Given fi ,fj, we compute πij .4277. 421 388 year. From a previous one-sided victory, his year-long By shrinking the serve and return parameters, our 422 = 389 percentage of service points won, fi .7969, was model normalizes Elahi’s inflated fi and demon- 423 390 abnormally high compared to the year-long tour-level strates robustness to small sample sizes. 424 = 391 average of fav .6423. Figure 1 illustrates how normalization coefficient 425 Bi varies with n throughout our dataset. When n is 426 player name Daniel Elahi Ivo Karlovic small, as in Elahi’s case, the Efron-Morris estimator 427 service points won 51 3516 service points 64 4654 fi .7969 .7555 392 return points won 22 1409 return points 67 4903 gi .3284 .2874 Elo rating 1585.93 1952.86

393 Following Barnett and Clarke’s method, we pre- 394 dict Elahi to win 89.25% of points on serve, which 395 eclipses Karlovic’s forecast of 81.01%. Uncorrected Author Proof 396 fij = ft + (fi − fav) − (gj − gav) = .8925

397 fji = ft + (fj − fav) − (gi − gav) = .8101

398 Given that Karlovic is one of the most effective Fig. 1. Strength of Bi against number of points played with fi = 2 399 servers in the history of the game, this estimate seems .64, τˆ =.00176. J. Gollub / Forecasting serve performance in professional tennis matches 7

428 strongly shrinks estimates toward the mean. Over the Youzhny’s win probability drops to πij = .4334. As 461 429 first one thousand points of play, however, Bi sharply we see, opponent-adjusted ratings can turn around 462 430 declines in strength, approaching zero as match his- forecasts when one player has faced tougher oppo- 463 431 tory accumulates further. nents over the past twelve months. 464

432 4.2. Opponent-adjusted ratings 4.3. Klaassen-magnus elo ratings 465

433 To illustrate the effect of tracking opponent- Finally, we demonstrate the use of Klaassen- 466 434 adjusted statistics, we consider the 2014 US Open Magnus Elo ratings. In the quarterfinals of the 2016 467 435 first-round match between (RUS) Olympics at Rio de Janeiro, (JP) faced 468 436 and Nick Kyrgios (AUS). off against Gael Monfils (FRA). 469 player name Mikhail Youzhny Nick Kyrgios service points won 1828 900 player name Kei Nishikori Gael Monfils service points 2960 1370 service points won 3309 2345 fi .6176 .6569 437 service points 5069 3533 return points won 1145 424 fi .6528 .6637 470 return points 2947 1323 return points won 2103 1433 gi .3885 .3205 return points 5229 3608 Elo rating 1941.65 1931.07 gi .4021 .3972 Elo rating 2295.56 2140.56 438 From the standard approach, with ft = .6583, we 439 compute fij = .6446,fji = .6159. To then calculate Though Elo ratings clearly favored Nishikori, 471 440 opponent-adjusted statistics, we must consider each serve/return statistics over the past twelve months put 472 441 player’s expected performance given their past oppo- Monfils at a slight advantage. 473 442 nents. As we observe, Kyrgios has faced slightly 443 stronger opponents than Youzhny over the past twelve = + − − − = 444 months. fij ft (fi fav) (gj gav) .6204 474

f = f + (f − f ) − (g − g ) = .6263 475 player name Mikhail Youzhny Nick Kyrgios ji t j av i av  gi .4332 .4486 −  ∗ (1 gi) nik 1677.62 755.44 − −  To produce serve parameters that reflect Nishikori’s 476 445 fi (1 gi) .0508 .1055  Elo advantage and the pair’s overall serve ability, we 477 fi .6982 .7237 −  ∗ implement the approach from Section 3.3 with π = 478 (1 fi ) nik 889.32 365.59 ij − −  = = gi (1 fi ) .0868 .0441 .7094, fij .6204, fji .6263. 479

 446 In the above table, gi represents overall return abil-   = + = fij ,fji S(πij ,fij fji) .6451,.6016 480 447 ity of playeri’s opponents in the past twelve months. −  ∗ 448 (1 gi) nik then represents the expected num- 449 ber of points won on serve given player i’s match Now we can forecast match outcomes with the hier- 481  − − archical Markov Model, using serve parameters that 482 450 history and fi (1 gi) the performance relative to 451 this baseline. Opponent-adjusted serve probabilities respect each player’s Elo rating. 483   452 fij ,fji are calculated as follows:

 = + − −  − − −  = 453 fij ft (fi (1 gi)) (gj (1 fj)) 5. Results 484 454 .6280    = + − − − − − = We evaluated methods across three years of 485 455 fji ft (fj (1 gj)) (gi (1 fi )) tour-level matches, including Barnett and Clarke’s 486 456 .6413. Uncorrected Author Proof standard approach and Knottenbelt’s Common- 487

457 Using regular serve parameters, Youzhny is Opponent Model, outlined in Appendix A, as 488 458 favored to win with πij = .6410. With opponent- benchmarks. Across the board, Klaassen-Magnus Elo 489 459 adjusted serving percentages, we factor in the ratings fared the best, while Efron-Morris estimators 490 460 stronger opponents in Kyrgios’ match history and improved performance to a varying degree. 491 8 J. Gollub / Forecasting serve performance in professional tennis matches

492 5.1. Dataset Then we computed accuracy and log loss by compar- 531 ing these forecasts with observed wins or losses for 532 493 This project drew from a publicly available match each match in Mt. In measuring accuracy, we clas- 533 494 dataset (Sackmann, 2018). Match summary statistics sified πij as a predicted win when πij ≥ .5 and a loss 534 495 cover over 150,000 ATP tour-level matches dating otherwise. 535 496 back to 1968. Relevant features included:

r 5.3. Discussion 536 497 match date, tournament, surface type, player 498 names r We evaluated methods across 7,622 ATP tour-level 537 499 match serve/return statistics matches from 2014-16. To compute Klaassen- 538

500 The matches in this repository comprised dataset Magnus Elo ratings, we set the constraint b = fij + 539 501 M, from which we determined players’ historical fji, where fij ,fji were computed with the Efron- 540 502 serve/return performance and Elo ratings. While all Morris Estimator, as described in Section 3.1.1. 541 503 methods generated serve parameters from histori- Because a player’s performance can vary signifi- 542 504 cal data, none required a training set for tuning cantly across surfaces, we included a Barnett and 543 7 505 hyper-parameters. Excluding matches, Clarke variation which only considers performance 544 8 506 we designated all matches from 2014-16 as test set on the given match’s surface. In Table 1 and Table 2, 545 507 Mt and produced serve parameters for each match “EM” denotes a variation that used the Efron-Morris 546 508 in Mt based only on prior matches. Implementations estimator to compute fi,gi in Barnett and Clarke’s 547 509 of all methods in this paper may be found at . equation. 548 Table 1 displays each method’s RMSE in predict- 549

510 5.2. Evaluation ing the proportion of points won on serve. Of all 550 Barnett-Clarke variations, the surface-based estima- 551

511 All methods produced parameters fij ,fji to esti- tor fared worst. This was likely due to decreased 552 512 mate each player’s probability of winning a point on sample sizes, as filtering matches by surface limited 553 513 serve in a given match. We evaluated methods by the amount of available data. Its Efron-Morris variant 554 514 predicting both the winner of every match and each performed significantly better in all years, suggest- 555 515 player’s percentage of points won on serve. To eval- ing there may be value in a surface-specific approach 556 516 uate serve performance prediction, we considered when bias from limited data is offset. However, this 557 517 the RMSE (root-mean-square-error) between each variant only came close to reaching parity with the 558 518 method’s parameters and observed performance. By standard Barnett-Clarke model in the year 2014. 559 519 observing the proportion of points won on serve by While it was intended to address issues with vary- 560 520 each player for a single match k, we obtained sik,sjk ing strength of schedule, the Common-Opponent 561 521 and realized error terms eik,ejk: Model underperformed all other methods in predict- 562 ing serve performance. When two players shared few 563 wik wjk opponents across their match histories, this approach 564 522 sik = ,sjk = nik njk proved fairly susceptible to bias. Furthermore, while 565 Barnett-Clarke variants considered the past twelve 566 523 eik =|sik − fij |,ejk =|sjk − fji|. months of data, this approach considered all histor- 567 ical matches, curbing its ability to express recent 568 524 Over test set Mt, a method’s RMSE computed to: trends as players’ match histories accumulated. On 569 the other hand, opponent-adjusted ratings demon- 570 2 2 ∈M e + e k t ik jk strated an improvement to both Barnett-Clarke and 571 525 r = . 2|Mt| the Common-Opponent Model. By estimating serve 572 and return ability on a continuous scale, opponent- 573 526 To produce match win forecasts, we returned to adjusted ratings allowed us to consider all matches 574 527 the hierarchical Markov Model. AsUncorrected a function of a within the lastAuthor twelve months and Proof avoid a major short- 575 528 method’s parameters fij ,fji we calculated match win coming of the Common-Opponent Model. 576 529 probability πij recursively from the probabilities of Overall, Klaassen-Magnus Elo ratings proved most 577 530 winning sets and games, as described in Section 2.1. effective in forecasting serve performance. As Elo 578

7Davis cup matches frequently involve lower-ranked players. 8All matches occurred on hard, clay, or grass courts. J. Gollub / Forecasting serve performance in professional tennis matches 9

Table 1 Serve performance prediction of 2014-16 ATP matches Variation 2014 2015 2016 n=(2,488) (n=2,540) (n=2,594) RMSE RMSE RMSE Barnett-Clarke .0846 .0916 .0845 Barnett-Clarke EM .0823 .0909 .0823 Barnett-Clarke surface .0883 .0968 .0904 Barnett-Clarke surface EM .0850 .0944 .0872 Barnett-Clarke opponent-adjusted .0829 .0898 .0825 Barnett-Clarke opponent-adjusted EM .0821 .0894 .0818 Klaassen-Magnus Elo .0804 .0890 .0798 Common-Opponent .0957 .1046 .0943

Table 2 Match win prediction of 2014-16 ATP matches Variation 2014 2015 2016 n=(2,488) (n=2,540) (n=2,594) Accuracy Log Loss Accuracy Log Loss Accuracy Log Loss Barnett-Clarke 64.8 .648 65.0 .634 64.6 .663 Barnett-Clarke EM 65.6 .613 65.2 .609 64.7 .628 Barnett-Clarke surface 63.3 .705 63.2 .712 62.0 .742 Barnett-Clarke surface EM 63.5 .628 62.9 .628 62.5 .645 Barnett-Clarke opponent-adjusted 67.8 .637 68.0 .613 67.4 .641 Barnett-Clarke opponent-adjusted EM 67.8 .625 68.1 .605 67.6 .632 Klaassen-Magnus Elo 69.1 .589 69.3 .579 69.7 .594 Common-Opponent 63.6 .628 64.5 .627 63.4 .650

579 ratings were already known to estimate player ability 580 more effectively than point-based models, it follows 581 that adjusting serve parameters in accordance with 582 these ratings would significantly improve forecasts. 583 In contrast, Barnett-Clarke variations still appear to 584 leave out important information by ignoring match 585 outcomes and forecasting strictly from point-level 586 data. While point-based models have often proven 587 necessary for predicting more granular outcomes 588 (Barnett et al., 2006), Klaassen-Magnus Elo ratings 589 reconciled point-based models’ expressivity with Elo 590 ratings’ predictive power. Following the recent work 591 of Kovalchik and Reid (2018), this further demon- 592 strated the effectiveness of synthesizing Elo ratings Fig. 2. Number of points played on serve by players i, j in the past twelve months before every match in Mt. 593 with point-level models specific to tennis’ scoring 594 system. 9 604 595 When applied to Barnett-Clarke variants, the estimator. Following Barnett and Clarke’s original 605 596 Efron-Morris estimator improved performance to approach, estimates produced with sample sizes from 606 597 varying degrees. Presumably, this stemmed from the leftmost range of the distribution were particularly 607 598 robustness with respect to small sample sizes. To bet- susceptible to bias. However, recalling normalization strength from Figure 1, we know the Efron-Morris 608 599 ter understand its application to ourUncorrected tour-level match Author Proof 609 600 dataset, we consider the distribution of points within estimator reduced bias by shrinking these estimates 601 player match histories. 9 wik For given match k with serve parameters fik = ,fjk = 602 Figure 2 illustrates the distribution of sample nik wjk 603 sizes when computing fi,fj with an Efron-Morris , we considered nik,njk. njk 10 J. Gollub / Forecasting serve performance in professional tennis matches

610 furthest toward the mean. Since players accumulate point-by-point datasets available today, we can use 655 611 return points over time at approximately the same these same methods to forecast match outcomes 656 612 rate, we can assume the estimator normalized gi,gj while play is in progress and set similar benchmarks 657 613 in a similar manner. If only by reducing errant pre- with tour-level match data. 658 614 dictions in this range, the Efron-Morris estimator 615 succeeded in helping all Barnett-Clarke variants more 616 effectively predict serve performance. References 659 617 Table 2 details each method’s performance in pre- Barnett, T. and Clarke, S. R. 2005, Combining player statistics to 660 618 dicting match winners, where many of the same predict outcomes of tennis matches, IMA Journal of Manage- 661 619 trends surfaced. Once again, variations with an Efron- ment Mathematics, 16(2), 113-120. 662 620 Morris estimator outperformed their counterparts. In Barnett, T. J., Clarke, S. R., et al. 2002, Using microsoft excel 663 621 most cases, the improvement in log loss was greater to model a tennis match. In 6th Conference on Mathematics 664 622 than that of accuracy, supporting the notion that and Computers in Sport, pages 63-68. Queensland, : 665 623 this estimator mitigates extreme results when faced Bond University. 666 624 with limited data. The opponent-adjusted model per- Barnett, T. J. et al. 2006, Mathematical modelling in hierarchical 667 games with specific reference to tennis. PhD thesis, Swin- 668 625 formed several percentage points better than Barnett burne University of Technology. 669 626 and Clarke’s original approach across all years, Bialik, C., Morris, B., and Boice, J. 2016, How we’re forecasting 670 627 establishing itself as the most effective point-based the 2016 u.s. open. http://fivethirtyeight.com/features/how- 671 628 method surveyed in this exploration. Once again, were-forecasting-the-2016-us-open/. Accessed: 2017-10-30. 672 629 Klaassen-Magnus Elo ratings predicted outcomes Efron, B. and Morris, C. 1975, Data analysis using stein’s estimator 673 630 most effectively. However, it is worth noting that its and its generalizations, Journal of the American Statistical 674 631 output produced win probabilities identical to the Elo Association, 70(350), 311-319. 675

632 ratings fed into the model. In that regard, its rel- Efron, B. and Morris, C. N. 1977, Stein’s paradox in statistics. 676 633 ative performance confirmed results from previous WH Freeman. Elo, A. E. 1978, The rating of chessplayers, 677 past and present. Arco Pub. 678 634 studies noting Elo’s dominance in match prediction 679 635 (Kovalchik, 2016). Klaassen, F. J. and Magnus, J. R. 2003, Forecasting the winner of a tennis match, European Journal of Operational Research, 680 148(2), 257-267. 681

Klaassen, F. J. G. M. and Magnus, J. R. 2001, Are points in ten- 682 636 6. Conclusion nis independent and identically distributed? evidence from a 683 dynamic binary panel data model, Journal of the American 684 637 Although Barnett and Clarke’s approach has long Statistical Association, 96(454), 500-509. 685 638 been established in tennis match prediction, there are Knottenbelt, W. J., Spanias, D., and Madurska, A. M. 2012, 686 687 639 many ways to forecast serve performance. In order A common-opponent stochastic model for predicting the outcome of professional tennis matches, Computers & Math- 688 640 to extract more information from historical match ematics with Applications, 64(12), 3820-3827. 689 641 data, we outlined a handful of novel approaches to Kovalchik, S. and Reid, M. 2018, A calibration method with 690 642 generating the parameters in their equation. Using dynamic updates for within-match forecasting of wins in ten- 691 643 an Efron-Morris estimator improved performance nis, International Journal of Forecasting. 692

644 across the board, emphasizing the need to address Kovalchik, S. A. 2016, Searching for the goat of tennis win pre- 693 645 scenarios with limited match data. The opponent- diction, Journal of Quantitative Analysis in Sports, 12(3), 694 695 646 adjusted model demonstrated a new way to quantify 127-138. 647 strength of schedule through point-level data, result- Newton, P. K. and Aslam, K. 2009, Monte carlo tennis: a stochas- 696 tic markov chain model, Journal of Quantitative Analysis in 697 648 ing in significantly improved performance over Sports, 5(3). 698 649 Barnett-Clarke and the Common-Opponent Model. Sackmann, J. 2018, Tennis atp. 699 650 Finally, Klaassen-Magnus Elo ratings synthesized https://github.com/JeffSackmann/tennis atp. 700 651 historical serve data with the predictive power of Elo Spanias, D. and Knottenbelt, W. J. 2012, Predicting the outcomes 701 652 ratings while maintaining the overall serve ability of tennis matches using a low-level point model, IMA Journal 702 653 between two players. Uncorrectedof Management Author Mathematics, 24 (3),Proof 311-320. 703 654 A clear next step to this exploration would involve applying these methods to in-match prediction. With J. Gollub / Forecasting serve performance in professional tennis matches 11 704 Appendix A: Common-Opponent Model N k=1 P(i beats j via Ck) π = . 726 ij N 705 Consider a match between players i and j, who

706 share N common opponents throughout their entire While Knottenbelt did not originally use this model 727 707 match histories. To quantify their relative perfor- to predict serve performance, we inferred parameters 728 708 mance against opponent Ck, let spw(i, Ck) denote the fij ,fji from the model’s win probability equation as 729 709 percentage of service points won by player i against followed: 730 710 opponent Ck and rpw(i, Ck) the corresponding per- N ij 711 centage of return points won. We may quantify player ij k=1 k  = 731 712 i’s advantage over player j with respect to opponent N 713 Ck as follows: ij fij = .6 +  /2 732 ij = − − − = − ij 714 k (spw(i, Ck) (1 rpw(i, Ck)) fji .6  /2. 733 715 (spw(j, Ck) − (1 − rpw(j, Ck)). Following the above steps, we may predict the win- 734 716 Then we approximate match win probability in terms ner and serve performance of any match using the 735 717 of this relative advantage. Common-Opponent Model. 736

718 P(i beats j via Ck) = + ij + + ij M3(.6 k ,.4) M3(.6,.4 k ) 719 2

720 In the above equation, M3(f, g ) denotes player i’s 721 win probability in a best-of-three match with f, g 722 representing their probability of winning a point on 723 serve and return, respectively. To compute πij via 724 the Common-Opponent Model, we average the win 725 probability over all N common opponents shared:

Uncorrected Author Proof