A Common-Opponent Stochastic Model for Predicting the Outcome of Professional Tennis Matches William J

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Computers and Mathematics with Applications 64 (2012) 3820–3827 Contents lists available at SciVerse ScienceDirect Computers and Mathematics with Applications journal homepage: www.elsevier.com/locate/camwa A common-opponent stochastic model for predicting the outcome of professional tennis matches William J. Knottenbelt, Demetris Spanias ∗, Agnieszka M. Madurska Department of Computing, Imperial College London, South Kensington Campus, London, SW7 2AZ, United Kingdom article info a b s t r a c t Keywords: Tennis features among the most popular sports internationally, with professional matches Stochastic modelling played for 11 months of the year around the globe. The rise of the internet has stimulated Tennis a dramatic increase in tennis-related financial activity, much of which depends on Sport quantitative models. This paper presents a hierarchical Markov model which yields a pre-play estimate of the probability of each player winning a professional singles tennis match. Crucially, the model provides a fair basis of comparison between players by analysing match statistics for opponents that both players have encountered in the past. Subsequently the model exploits elements of transitivity to compute the probability of each player winning a point on their serve, and hence the match. When evaluated using a data set of historical match statistics and bookmakers odds, the model yields a 3.8% return on investment over 2173 ATP matches played on a variety of surfaces during 2011. ' 2012 Elsevier Ltd. All rights reserved. 1. Introduction Tennis has been growing in popularity around the globe ever since the first Wimbledon Championships were held at the All England Club in 1877, attracting an audience of around 200. Indeed, in China alone, an estimated audience of 65 million1 watched Li Na play Francesca Schiavone in the 2011 French Open Women's final. Stimulated by the growth of the internet, this rise in popularity has been accompanied by a dramatic increase in the financial activity related to tennis, both in terms of traditional bookmaking and modern betting exchange volumes. For example, on the Betfair betting exchange more than ¿40 million was traded in the Match Odds market of the 2011 US Open Men's final. In recent times hedge funds have been set up to exploit the excellent risk diversification opportunities offered by sporting events including tennis matches.2 A significant proportion of tennis-related financial activity depends on the availability of mathematical models for predicting the probability of victory of each player. It is such a model which forms the focus of this paper. Professional singles tennis is an attractive sport to model mathematically for a number of reasons. Firstly, each match is played between just two players, as opposed to the multitude of players involved in a team-based game such as football, rugby or basketball. Because of this there is no need to analyse the offensive and defensive strengths of possible combinations of starting line ups; nor do we need to consider the likelihood and potential effects of events such as substitutions or players being sent off. Secondly, a wealth of statistics about every match are gathered and made publically available both by professional bodies and web sites. The detail of these statistics is bound to increase in the future due to the use of systems such as Hawk-Eye, and perhaps future innovations in video and audio processing of match footage such as the rally analysis proposed by Hunter [1]. This facilitates model parameterisation using historical data and backtesting. Thirdly, there are only ∗ Corresponding author. Tel.: +44 7588053479. E-mail addresses: [email protected] (W.J. Knottenbelt), [email protected] (D. Spanias), [email protected] (A.M. Madurska). 1 http://www.guardian.co.uk/sport/2011/jun/04/french-open-2011-li-na-francesca-schiavone. 2 http://www.businessweek.com/magazine/content/10_29/b4187069936116.htm. 0898-1221/$ – see front matter ' 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.camwa.2012.03.005 W.J. Knottenbelt et al. / Computers and Mathematics with Applications 64 (2012) 3820–3827 3821 two possible outcomes of a match, whereas in sports such as horse racing numerous outcomes are possible. Finally, scoring in tennis is fine-grained resulting in more gradual in-play odds movements. There have been numerous attempts to solve the problem of modelling a tennis match. Clarke and Dyte, Klaassen and Magnus, and Radicci approached tennis modelling using player rankings [2–4]. Liu, Barnett, Newton and Keller, and O'Malley all generated equivalent hierarchical expressions to model the probabilities of winning a tennis game, set and match based on the probability of a player winning a point while serving [5–8]. Tennis is an ideal candidate for a hierarchical model as a match consists of a sequence of sets, which in turn consist of a sequence of games, which in turn consist of a sequence of points. By making the assumption that points are independent and identically distributed (i.i.d.) [9], one can create a hierarchical model for the match which only depends on the probabilities of the two players winning a point while serving. This is why authors like Barnett and Clarke, Newton and Aslam, and Spanias and Knottenbelt focus in detail on the estimation of that variable [10–12]. In particular, the technique presented by Barnett and Clarke, subsequently adopted by Newton and Aslam, and Spanias and Knottenbelt, approaches this problem by averaging each player's match statistics over a period of time (optionally filtered by surface type). This represents the performance of a player against the average opponent faced during that time period rather than a particular opponent. Barnett and Clarke attempt to combine the serving ability of one player against an average opponent with the returning ability of the other player against an average opponent. To do this, their approach computes two statistics for each player, fi and gi, reflecting their serving and receiving strengths against an average opponent respectively [10]: fi D aibi C .1 − ai/ci gi D aavdi C .1 − aav/ei where: fi = proportion of points won on serve by player i, gi = proportion of points won on return by player i, ai = probability of successful first serve by player i, aav = average probability of successful first serve (across all players), bi = proportion of own successful first serves won by player i ci = proportion of own second serves won by player i di =proportion of first serves of opponent won on return by player i, and ei = proportion of second serves of opponent won on return by player i. Now for the particular case where player i plays a match against player j the proportion of points won on serve by player i, fij, is estimated as [10]: fij D ft C .fi − fav/ − .gj − gav/ (1) where: ft = average percentage of points won on serve for tournament, fav = average percentage of points won on serve (across all players), and gav =average percentage of points won on return (across all players). Although intuitively appealing, this solution is not perfect. This is because a good player is more likely to advance in a given tournament and face other strong players. At the same time a weaker player will usually drop out earlier, and on average play against less demanding opponents. This immediately distorts any results, as for each of the two players the notion of the ``average opponent'' can be quite different. The method we propose here eliminates this bias by exploiting statistics from matches played against common opponents (i.e. opponents that both players faced in the past). The intuition behind our approach is that tennis is a transitive sport to a certain extent. Thus, we should be able to exploit the fact that if Player A is better than Player C and Player C is better than Player B, then (usually) Player A is better than Player B. The challenge is to combine the results of matches between A and C and the results of matches between B and C to reason about a match between A and B. In the case of professional singles tennis matches, this reasoning has the potential to be especially fruitful, because of the limited number of players involved. There are roughly only 150 active professional tennis players in each of the ATP and WTA tours. Within each tour, players frequently play against each other in a variety of tournaments. Although there are a limited number of head-to-head encounters for any two given players, many pairs of players share a rich set of common opponents. 3822 W.J. Knottenbelt et al. / Computers and Mathematics with Applications 64 (2012) 3820–3827 Fig. 1. Probability of the better player winning a best-of-three-sets tennis match with fixed differences of 0.01, 0.02, 0.05 and 0.10 in the two players' probability of winning a point on serve. 2. O'Malley's tennis formulae When calculating our results we used the hierarchical equations developed by Barnett [13] and O'Malley in combination with our proposed method of calculating the probability of a player winning a point on serve. Both equations assume that the probabilities of a player winning a point during serve and return are constant throughout the match and are independent of their previous values. We will briefly describe the O'Malley's closed-form equations here as they will be referred to in the application of our model. O'Malley derives the probabilities of a certain player winning

Load more