Basketball Predictions in the NCAAB and NBA: Similarities and Differences Albrecht Zimmermann
Total Page:16
File Type:pdf, Size:1020Kb
Basketball predictions in the NCAAB and NBA: Similarities and differences Albrecht Zimmermann To cite this version: Albrecht Zimmermann. Basketball predictions in the NCAAB and NBA: Similarities and differences. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016, 9 (5), pp.350 - 364. 10.1002/sam.11319. hal-01597747 HAL Id: hal-01597747 https://hal.archives-ouvertes.fr/hal-01597747 Submitted on 28 Sep 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Basketball predictions in the NCAAB and NBA: similarities and differences Albrecht Zimmermann [email protected] February 6, 2016 Abstract cruiting and resource bases play a lesser role. Most work on predicting the outcome of basketball 2. Teams play almost all other teams every season. matches so far has focused on NCAAB games. Since 3. Teams play more games, particularly in the play- NCAAB and professional (NBA) basketball have a offs, where the NCAAB's "one and done\ is in number of differences, it is not clear to what degree sharp contrast to the NBA's best-of-seven series. these results can be transferred. We explore a num- ber of different representations, training settings, and To the best of our knowledge, it is not clear how classifiers, and contrast their results on NCAAB and those differences affect the task of learning a predic- NBA data. We find that adjusted efficiencies work tive model for the sports: the first point implies that well for the NBA, that the NCAAB regular season is prediction becomes harder, whereas the other two in- not ideal for training to predict its post-season, the dicate that there is more and more reliable data. two leagues require classifiers with different bias, and Most of the existing work in the field is more or less Na¨ıve Bayes predicts the outcome of NBA playoff se- statistical in nature, with much of it developed in blog ries well. posts or web columns. Many problems that can be addressed by statistical methods also offer themselves up as Machine Learning settings, with the expected 1 Introduction gain that the burden of specifying the particulars of the model shifts from a statistician to the algorithm. Predicting the outcome of contests in organized Yet so far there is relatively little such work in the sports can be attractive for a number of reasons ML literature. such as betting on those outcomes, whether in orga- We intend to add to the body of work on sports an- nized sports betting or informally with colleagues and alytics in the ML community by building on earlier friends, or simply to stimulate conversations about work [13], and evaluating different match representa- who \should have won". tions, learning settings, and classifiers. We compare Due to wide-spread betting during the NCAAB the results on NCAAB data to those on NBA data, playoffs (or "March Madness\), much work has been with a particular focus on post-season predictions. undertaken on predicting the outcome of college bas- In the next section, we will discuss how to represent ketball matches. It is not clear how much of that teams in terms of their performance statistics, fol- work can be transferred easily to the NBA. Profes- lowed by ways of performing prediction in Section 3. sional basketball shows several differences to college In Section 4, we discuss existing work on NCAAB and basketball: NBA match prediction. Sections 6 and 7 are given 1. Teams are typically closer in skill level, since re- to the evaluation of different prediction settings on 1 NCAAB and NBA data, respectively, before we com- in a game in which both teams combined to shoot pare classifier behavior on those two types of data in 100 times at the basket is different from 40 rebounds more detail in Section 8. when there were only 80 scoring attempts. For nor- malization, one can calculate the number of posses- sions in a given game: 2 Descriptive statistics for ∗ − − ∗ teams P ossessions = 0:96 (F GA OR TO+(0:475 FTA)) and derive teams' points scored/allowed per 100 pos- In this section, we will discuss the different options sessions, deriving offensive and defensive efficiencies: for describing basketball teams via the use of game PPG ∗ 100 statistics that we evaluate later in the paper. We OE = , will begin by a recap of the state-of-the-art, after- P ossessions P AG ∗ 100 wards discussing our own extensions and aggregate DE = (1) statistics over the course of the season. P ossessions It should be noted that the factor 0:475 is empirically 2.1 State of the art estimated { when first introducing the above formu- lation for the NBA, Dean Oliver estimated the factor The most straight-forward way of describing basket- as 0:4 [10]. ball teams in such a way that success in a match can This is currently the most-used way of describ- be predicted relate to scoring points { either scor- ing basketball teams in the NBA. When discussing ing points offensively or preventing the opponent's complete teams or certain line-ups (five (or fewer) scoring defensively. Relatively easy to measure offen- player groups), the phrase "points per 100 pos- sive statistics include field goals made (FGM), three- sessions" makes frequent appearance on sites such point shots made (3FGM), free throws after fouls as fivethirtyeight.com, www.sbnation.com, and (FT), offensive rebounds that provide an additional hardwoodparoxysm.com. attempt at scoring (OR), but also turnovers that de- While such statistics are normalized w.r.t. the prive a team of an opportunity to score (TO). De- \pace" of a game, they do not take the opponent's fensively speaking, there are defensive rebounds that quality into account, which can be of particular im- end the opponent's possession and give a team con- portance in the college game: a team that puts up im- trol of the ball (DR), steals that have the same effect pressive offensive statistics against (an) opponent(s) and make up part of the opponent's turnovers (STL), that is (are) weak defensively, should be considered and blocks, which prevent the opponent from scor- less good than a team that can deliver similar statis- ing (BLK). And of course, there are points per game tics against better-defending opponents. For best ex- (PPG) and points allowed per game (PAG). pected performance, one should therefore normalize The problem with these statistics is that they are w.r.t. pace, opponent's level, and national average, all raw numbers, which limits their expressiveness. If deriving adjusted efficiencies: a team collects 30 rebounds in total during a game, we cannot know whether to consider this a good re- OE ∗ avg (OE) AdjOE = all teams , sult unless we know how many rebounds were there to AdjDEopponent be had in the first place. 30 of 40 is obviously a bet- DE ∗ avg (DE) AdjDE = all teams (2) ter rebound rate than 30 of 60. Similar statements AdjOE can be made for field goals and free throws, which opponent is why statistics like offensive rebound rate (ORR), The undeniable success of those two statistics, pi- turnover rate (TOR), or field goals attempted (FGA) oneered by Ken Pomeroy [11], in identifying the will paint a better picture. Even in that case, how- strongest teams in NCAA basketball have made them ever, such statistics are not normalized: 40 rebounds the go-to descriptors for NCAA basketball teams. 2 Dean Oliver has also singled out four statistics as illustration, imagine a team that played on day 1, 3, being of particular relevance for a team's success, 10, and 15 and we want to make a prediction for day the so-called \Four Factors" (in order of importance, 16. For TOR, for instance, we therefore average in with their relative weight in parentheses): the following way (we use superscripts to denote the day on which the statistic has been recorded): Effective field goal percentage (0.4): F GM + 0:5 · 3F GM T ORavg = eF G% = (3) F GA 1 · T OR1 + 3 · T OR3 + 10 · T OR10 + 15 · T OR15 Turnover percentage (0.25): 1 + 3 + 10 + 15 = 28 TO TO% = (4) P ossessions The impact of the most recent match is more than Offensive Rebound Percentage (0.2): half in this case, whereas match 1 { two weeks ago OR OR% = (5) { has very little impact. Enumerating game days (OR + DROpponent) would give the first match a quarter of the impact of Free throw rate (0.15): the most recent one, instead. FTA FTR = (6) F GA 2.4 Calculating adjusted statistics Due to the averaging, each team's adjusted statistics 2.2 Adjusted Four Factors are directly influenced by their opponents', and indi- In an earlier work [13], we have introduced the idea rectly by those opponents' opponents. As an illustra- of adjusting the Four Factors in the same way as effi- tion, consider a schedule like the one shown in Table ciencies and evaluated their usefulness for predicting 1. To calculate T eam1's adjusted offensive efficiency college basketball matches. While multi-layer per- after match 3, we need to know T eam4's adjusted de- ceptrons (MLP) achieved better results using the ad- fensive efficiency before the match (i.e.