Journal of Sports Analytics 4 (2018) 73–84 73 DOI 10.3233/JSA-170175 IOS Press A machine learning approach to analyze ODI cricket predictors
Kalanka P. Jayalath∗ Department of Mathematics and Statistics, University of Houston Clear Lake, Houston, TX, USA
Abstract. As one-day international (ODI) games rise in popularity, it is important to understand the possible predictors that affect the game outcome. The home-field advantage, coin-toss result, bat-first or second, and day vs day-night game format are such popular variables being considered in the cricket literature. This article focuses on a comprehensive study of quantifying the significance of those important predictors via graphical ‘classification and regression tree’ (CART) and the popular logistic regression approaches. This study reveals the importance of the home-field advantage for major cricket playing nations in one-day international games but questions the uniformity of such factors under different playing conditions. Importantly, the home-field advantage is investigated further based on the opponent’s geographical location. Conclusively, the CART approach provides interesting and novel interpretations for popular predictors in ODI games.
Keywords: Classification trees, cricket, logistic regression, ODI, regression trees
1. Introduction that the home-field effect is well researched in sport statistics literature. For instance, Clarke and Norman Cricket has become one of the world’s most popu- (1995) discuss the home-field effect in English soc- lar outdoor sports. The International Cricket Council cer leagues, Pollard and Gomez´ (2014) discuss the (ICC) identified 106 cricket playing nations which home-field effect of men and women soccer leagues 10 of them are full members, 37 of them are asso- in Europe, and Karlis and Ntzoufras (1998) dis- ciates, and the remaining 59 are affiliate members. cuss a predictive modeling method to predict soccer One day international (ODI) is one of the three main outcomes using many factors including the home- types of cricket matches and is considered the most field effect. Harville and Smith (1994) discuss the popular. One reason for its popularity is due to recent home court advantage in college basketball games. advances in technology which allows for a day-night Levernier and Barilla (2007) discuss the home-field format as opposed to the classical day-only form. In advantage in major league baseball using logit mod- addition, umpire decisions and their review systems, els, Vaz et al. (2012) discuss the effect of alternating more strict bowling rules, and options of power plays home and away field advantage in rugby champi- (Silva et al., 2015) have turned it into a more offensive onship, and recently Ribeiro et al. (2016) discuss the game. These changes have made 21st-century cricket microscopic, team-specific, and evolving features of more competitive and spectacular than ever before. home NBA games. There are numerous factors that can affect a cricket Cricket is increasingly popular among the statis- game’s outcome. Like in many other team sports, tical science community, but the unpredictable and the home-field advantage is believed to be a crit- inconsistent natures of this game make it challenging ically important factor in cricket. It can be seen to apply in common probability models. However, numerous researchers successfully applied various
∗ statistical methods to cricket data. An early attempt Corresponding author: Kalanka P. Jayalath, Department of of modeling cricket batsmen data can be found in Mathematics and Statistics, University of Houston Clear Lake, Houston, TX 77058, USA. Tel.: +1 281 283 3778; E-mail: Elderton and Wood (1945). Interestingly, the home- [email protected]. field advantage in ODI cricket matches was discussed
2215-020X/18/$35.00 © 2018 – IOS Press and the authors. All rights reserved This article is published online with Open Access and distributed under the terms of the Creative Commons Attribution Non-Commercial License (CC BY-NC 4.0). 74 K.P. Jayalath / A machine learning approach to analyze ODI cricket predictors by De Silva and Swartz (1998) and they found that the The article is organized as follows: in Section 2, effect is a significant advantage for the home team. we review the logistic regression model and introduce They also addressed a continent based advantage the ‘continent’ effect to better understand the home- when a game is played in a neutral venue. Fernando field advantage based on the geographical continents et al. (2013) applied a logistic regression method of the opponent team. In section 3, we introduce the to address the home-field advantage in ODI games. CART-based classification tree approach to identify The logistic regression modeling is a commonly the effects of the predictors on the game outcome. In used powerful and easily interpretable modeling tech- section 4, we suggest the use of the margin of victories nique. However, it may be questionable that the as the response variable in place of dichotomized win- factors such as home-field advantage and the form of loss responses and adapt the regression tree approach the game are uniformly influential regardless of the to better understand the predictors for the Australian other factors such as coin-toss result and the factors team. A discussion of conclusions is included in associated with the opponent team. Fitting sepa- section 6. rate models for different forms may apprehend the use of remaining predictors. This question is further discussed via a machine learning based alternative 2. Uncontrollable variables in cricket game modeling technique and the findings were compared to the results from logistic regression models. Like in many sports, ODI cricket has both The decision trees or recursive partitioning mod- controllable and uncontrollable variables. Playing els are a tree-like graph that can be used to make combination, in and out field tactics including decisions in machine learning and data mining disci- aggressive and offensive playing behaviors may be plines. These trees consist of nodes and branches that considered controllable variables. However, venue, form easily interpretable hierarchical structures. The game type (day-only or day-night) and coin-toss decision trees are increasingly popular in inductive result are the main uncontrollable variables in the inference and commonly used in machine learning ODI format. Another partially controllable variable and predictive analytics. Decision trees with a dis- is the choice of batting first or second in a game. It is crete or qualitative response variable that can take a decision made by the coin-toss winner of the two only a finite set of values are called classification competing teams. Hence, this variable can be identi- trees and decision trees with a continuous response fied as a conditional variable. A better understanding variable are called regression trees. In classification of the uncontrollable variables may be important in trees, nodes represent class labels and branches repre- the decision-making process in cricket matches. sent the decision rules that lead to those class labels. Previous studies (see De Silva and Swartz (1998), The idea of the hierarchical tree splitting possibly Allsopp (2005), and Bandulasiri (2008)) indicated first appeared in Belson (1959), and the subse- that variables such as home-field advantage, the result quent work related to Automatic Interaction Detector of coin-toss, and batting first or second may provide (AID) algorithm by Morgan and Sonquist (1963) evidence of some probabilistic relationships to the suggested growing binary regression trees. Messen- match outcome. A study conducted by Fernando et al. ger and Mandell (1972) and Morgan and Messenger (2013) used a logistic regression model to identify (1973) extended AID for categorical outcomes using possible predictors in ODI games. In the first half of theta criterion named THAID (THeta AID) algo- this study, we accommodate a similar logistic regres- rithm. An earlier discussion of these algorithms can sion model and our model includes uncontrollable be found in Fielding and O’Muircheartaigh (1977). variables such as home-field advantage (HM), day Classification and regression trees (CART) is a pow- or night game (DN), coin-toss result (TS), and bat- erful algorithm suggested by Breiman et al. (1984), ting first choice (BF). The win-loss log odds ratio is and CART can handle both qualitative and quan- used as the response variable of this logistic regres- titative response variables. Since this study mainly sion model. Subsequently, we extend this model to focuses on the binary outcomes (win or loss), we capture the opponent team’s continent effect. attempt to classify the outcome of the game via clas- sification tree approach. For instance, one may desire 2.1. The logistic regression approach to estimate the effect of home-field advantage when the home team lost the coin-toss in a day-night game Assume that the team i plays a game against the against an Asian team. team j for the k-th time and let pijk is the probability of K.P. Jayalath / A machine learning approach to analyze ODI cricket predictors 75 the team i wins that game. Then the win-loss odd ratio model (1) has been extended to have five different pijk of that game is logit(pijk ) = . Accordingly, the continent parameters for each team including its own 1−pijk continent. outcomes of such bilateral games can be modeled by the following logistic regression model, pijk ln = β + β DNi + β TSi + β BFi 1 − p 0 1 2 3 pijk ijk ln + + + 1 − pijk β4AFRj β5AMRj β6ASAj
= β0 + β1HMi + β2DNi + β3TSi + β4BFi, (1) +β7ERPj + β8OCNj, (2) where the predictors in this model are defined as where continent variables for away teams are defined follows. as 1 if the away team “j" ∈ Africa 1 if it is a home game for the team “i" = HMi = AFRj 0 otherwise 0 otherwise