A Simple Bayesian Procedure for Forecasting the Outcomes of the UEFA Champions League Matches
Total Page:16
File Type:pdf, Size:1020Kb
Revised 12-02-15 A simple Bayesian procedure for forecasting the outcomes of the UEFA Champions League matches Jean-Louis Foulley 1 Abstract : This article presents a Bayesian implementation of a cumulative probit model to forecast the outcomes of the UEFA Champions League matches. The argument of the normal CDF involves a cut-off point, a home vs away playing effect and the difference in strength of the two competing teams. Team strength is assumed to follow a Gaussian distribution the expectation of which is expressed as a linear regression on an external rating of the team from eg. the UEFA Club Ranking (UEFACR) or the Football Club World Ranking (FCWR). Priors on these parameters are updated at the beginning of each season from their posterior distributions obtained at the end of the previous one. This allows making predictions of match results for each phase of the competition: group stage and knock-out. An application is presented for the 2013-2014 season. Adjustment based on the FCWR performs better than on UEFACR. Overall, using the former provides a net improvement of 24% and 23% in accuracy and Brier’s score over the control (zero prior expected difference between teams). A rating and ranking list of teams on their performance at this tournament and possibilities to include extra sources of information (expertise) into the model are also discussed. Keywords : Football; UEFA Champions League ; cumulative probit ; Bayesian forecasting. 1 Affiliated to Institut de Mathématiques et de Modélisation de Montpellier (3M) Université de Montpellier II 34095 Montpellier Cedex 05 and E-mail : [email protected] 1 Revised 12-02-15 1. Introduction There is a growing interest in statistics on sport and competitions from both a theoretical and an applied point of view due in particular to the availability of a large amount of data, the development of betting internet sites and the large media coverage of sporting events. Major competitions such as the UEFA Champions League and FIFA World Cup offer great opportunities to test and implement various models for analyzing the data generated by these competitions. These kinds of competitions are especially appealing to statisticians as they involve two different steps: i) a round-robin tournament where each contestant meets all the other ones within each group, followed for the top ranked teams by ii) a knock-out stage in which only the winners of each stage (round of 16, quarter and semi-finals) play the next stage up to the final. These two steps raise some difficulties for the statistician especially due to the differences of group levels of teams drawn for the mini-championship that can affect the choice of teams qualified for and playing in the next round. Another key aspect consists of fitting models to data either for rating & ranking teams or for forecasting outcomes of forthcoming matches. Although these two goals are not disconnected, they require different procedures and criteria to assess their efficiency. Knowing that, Section 2 presents a brief description of the competition and its different stages. Statistical methods are expounded in Section 3 with a focus on the cumulated probit model, its Bayesian implementation and its use in forecasting match results. An application to the 2013-2014 season is displayed in Section 4 to illustrate these procedures. Finally, Section 5 summarizes this study and discusses several alternatives and possible improvements of the model. 2. The tournament The tournament per se begins with a double round group stage of 32 teams distributed into eight groups: A, B, C, D, E, F, G and H. The eight groups result from a draw among the 32 teams allocated into four “pots” based on their UEFA club coefficients (Table 1). Among the 32 participants, 22 were automatically qualified and the ten remaining teams were selected through two qualification streams for national league champions and non champions. In the case studied here (2013-14 season), the 22 teams were: the title holder of the previous season (Bayern Munich for 2013-2014), the top three clubs of England (GBR), Spain (ESP) and Germany (DEU) championships, the top two of Italy (ITA), Portugal (PRT) and France (FRA) and the champions of Russia (RUS), Ukraine (UKR), Netherlands (NLD), Turkey (TUR), Denmark (DNK) and Greece (GRC). Table 1: Distribution of the 32 qualified teams for UEFA 2013-14 into the four pots and their UEFA coefficient 2 Revised 12-02-15 Pot 1 Pot 2 Pot 3 Pot 4 Bayern Munich *-DEU Atlético Madrid-ESP Zenith Petersburg -RUS FC Copenhagen - 146.922 99.605 70.766 DNK 47.140 FC Barcelona -ESP Shakhtar Donetsk -UKR Manchester City-GBR SSC Napoli-ITA 157.605 94.951 70.592 46.829 Chelsea -GBR AC Milano-ITA Ajax Amsterdam -NLD RSC Anderlecht - 137.592 93.829 64.945 BEL 44.880 Real de Madrid -ESP Schalke 04-DEU Borussia Dortmund-DEU Celtic Glasgow -SCO 136.605) 84.922 61.922 37.538 Manchester United- O Marseille FRA FC Basel -CHE Steaua Bucarest - GBR 78.800 59.785 ROU 130.592 35.604 Arsenal-GBR CSKA Moscow -RUS Olympiacos -GRC Viktoria Plzen -CZE 113.592 77.766 57.800 28.745 FC Porto -PRT Paris Saint Germain -FRA Galatasaray SK -TUR Real Sociedad-ESP 104.833) 71.800 54.400 17.605 Benfica Lisbon-PRT Juventus -ITA Bayer Leverkusen –DEU Austria Wien -AUT 102.833 70.829 53.922 16.575 Team underlined: National League Champion; Team in italics: qualified through the play-off rounds *Title holder of the UEFA Champions League 2012-13 The group stage played in autumn consists of a mini championship of twelve home and away matches between the four teams of the same group. The winning team and the second ranked of each group progress to the next knock out stage made up of home and away matches among the winning team of one group and the second of another group after a random draw held in December excluding teams from the same association country for the round of 16. This exclusion rule does not apply later on (eighth, quarter and semi finals), the final being played in a single match on neutral ground. Points are based on the following scoring system: 3, 1, and 0 for a win, draw and loss respectively with a preference for the goals scored at the opponent’s stadium in case of a tied aggregate score. 3. Statistical methods 3.1 A benchmark model Outcomes of matches under the format of Win Draw and Loss can be predicted either directly or indirectly via the number of goals scored by the two teams. As there is little practical difference between these two approaches (Goddard, 2005), we use for the sake of simplicity the former benchmark model under its latent ordered probit form known in sports statistics, as the Glenn and David (1960) model. 3 Revised 12-02-15 Let us briefly recall the reader the structure of this model. The random variable pertaining to the outcome X ij of the match m( ij ) between home i and away j teams is expressed via the cumulated density function of an associated latent variable Zij as follows (Agresti, 1992): pij,1 =Pr(X ij == 1) Pr(Z ij > d ) (1a) pij,2 =Pr(X ij ==-££ 2) Pr( d Z ij d ) (1b) pij,3 =Pr(X ij == 3) Pr(Z ij <- d ) (1c) with d being the threshold or cut-off point on the underlying scale and 1,2,3 standing for win, draw and loss respectively. Assuming that Zij has a Gaussian distribution with mean mij and unit standard deviation, the cumulative probit model consists of expressing mij as the sum of the difference Dsij = s i - s j in strength between teams i and j and a home vs away playing effect hij usually taken as a constant h so that: m = Ds + h ij ij (2) pij,1 =L( d -D s ij - h ) (3a) pij,3 =L( d +D s ij + h ) (3b) pij,2=1 - p ij ,1 - p ij ,3 (3c) where Lx( ) =Pr( Xx > ) =- 1 Fx( ) is the survival function equal to 1 minus the CDF F( x ), here the standard normal. The higher the difference between the strengths of the two teams i and j, the higher is the probability of win by team i against j which makes sense; the same reasoning applies to the well known home vs away playing advantage. 3.2 Bayesian implementation Now, there are several ways to implement statistical methods for making inference about the parameters of this basic model. Classical methods consider the team strength and home effects as fixed and make inference about them and other covariates effects, using maximum likelihood procedures: see eg. the general review by Cattelan (2012) in the context of the Bradley-Terry model. For others, model fitting to data is accomplished within a Bayesian framework (eg. Glickman, 1999) and this is the way chosen here. The first stage of the Bayesian hierarchical model reduces to a generalized Bernoulli or categorical distribution Cat(.) with probability parameters of the three possible outcomes described in (3abc) θ Π 1) Xij| ~ Cat ( ij ) (4) θ Π where refers to all the model parameters and ij= (p ijk, ) for k =1,2,3 . 4 Revised 12-02-15 A the second stage, we have to specify the distributions of parameters involved Π in ij viz. team strength, cut-off value and home effect. This is done as follows: 2N 2 2) sii|,~hs id( hs is , ) (5) where the team strength si is assumed normally distributed with mean hi and variance s 2 .