The growing world of sports data and analysis. Biostatistics - CHL 5250

Snoopy

Biostatistics Dalla Lana School of Public Health University of Toronto

February 19, 2016 Introduction

Data is everywhere. It is being created everywhere in all industries. Manufacturing, marketing, health, genetics and education are a few examples of industries creating and exploring data. Within these industries researchers are analyzing and exploring this data. The sporting industry is also part of this game of data creation and exploration.

Sports data has been collected since the 1800’s. The sport of cricket being among the first sport to start collecting and keeping records of game data. and basketball were also early collectors of sports and game data. In time other sports started joining the game compiling game data both for the team as a whole and for individual players. Individual sports such as those showcased in Olympic games have been collecting data for many years. Olympic games’ results may be found online for the Athens 1896 games including winners of selected events and final results and standings. [9] Some sports that have been late comers to data collection and analysis or are more difficult to analyze are soccer, football and hockey. This will be touched upon later.

Analysis of sports data became more known to the general public with the release of the movie Money- ball based on Michael Lewis’ 2003 book, [4]. It is the story of how Bill Beane, the manager of the Oakland Athletics, used Sabermetrics, [5], to prospect and select players for the Oakland Athletics. The Oakland A’s had great success, despite having a very small payroll to work with in comparison to other teams. The team continually wins in the American League West but struggles post season. is credited with creating Sabermetrics, one of the earliest attempts to use baseball’s performance metrics to benefit a team as a whole. [10] In 1977 Bill James, [6] started publishing a series of abstracts that contained his insights, unorthodox ranking formulae, new statistical performance measures and other thoughts and ideas about baseball. It was through theses publications that Sabermetrics was born. In 1977 Bill James’ Baseball Abstracts sold only 50 copies. Now the Society of American Baseball Research, SABR, continues research into the sport of baseball including statistical analysis and is used and referenced by fans and enthusiasts all over.

Analyzing existing data provides useful information however there is more interest in using data mining techniques to investigate the collected data. Another area of interest when investigating existing data is to try and make predictions. For example, in hockey which players to put on the ice together and when, i.e. first, second, third or fourth lines and defence players; which players to trade and when and of course for betting purposes. Some of the disciplines used in data mining of sports data include statistics, artificial intelligence and machine learning.

This report briefly look at sport and data. Some examples of areas of interest of analysis will be pre- sented. Techniques and tools that have been created for sports analysis will be introduced. Some resources where sports data and analysis can be found will also be presented. Finally, an overview of some more recent contributions to the field will be introduced.

Areas of Analysis

In team sports, if not all sports, the goal is to win the game. But how or what aspects contribute to a win? Individual player ability is of course very important. However, having a team of top ranked players whose strengths do not compliment each others’ does not necessarily result in a win. Team balance is also very important to consider. When playing games, strategy is necessary. When considering strategy, looking at weaknesses, both within your own team but also within your opposition is important. This can be helpful in determining which plays to execute, players to use and when, which opposing players to keep an eye on and more. Another unfortunate and inevitable part of sport, is injury. An injured player changes the entire balance of a team and their game. Player performance, team balance, opposition weaknesses and possibility of debilitating injury are some of the more common areas of interests when analyzing sports data.

1 When attempting to analyze any one of these areas, for example player performance, a means of mea- suring this area, i.e. player performance, is needed. The plus-minus, +/−, statistic is one such statistics. It was first introduced in hockey and has been an official National Hockey League, NHL, statistic since 1968. In 2003 Roland Beach of 82games.com, [12], brought it to the National basketball Association, NBA and soon after it became an official NBA statistic. Curiously, the statistic is more meaningful when there is more scoring happening in a game. The plus-minus statistic looks at how a team performs with a certain player in play versus how the team performs when that player is not in play. Then the overall impact the player has on the team’s success is calculated. For example in basketball, let’s say a team is 8 points better, +8, when player A is on the court. However, when player A is not on the court the team is 2 points worse, -2, than their opponents. To find player A’s +/− rating, the difference of these two values is taken, +8 − (−2) = +1. Obviously, a positive value implies player A on the court results in a better performing team than when player A is not on the court, [11]. Dean Oliver, the “Bill James” of basketball, focused his efforts on the use of possession statistics. Possession here means the period of time one team has the ball. Dean Oliver’s contribution was to evaluate a team’s performance on how many points they scored or allowed opponents to score per 100 possessions. The idea behind APBR metrics, the Association for Professional Basketball Research, [13], the basketball version of Sabermetrics, is that the team must function as an entire cohesive unit. As a result, the team should be analyzed rather than trying to analyze how players perform with one another at an individual level.

Hockey is another sport that can be challenging to analyze. Perhaps because it is a a fast paced, continu- ous game. However, it does have many metrics associated with the sport and analysis is being attempting on varying aspects of the game. Among the metrics measured in hockey, possession metrics are of great interest. The idea is whoever, player or team, has the puck, has more control of events in the game including a greater chance of scoring or preventing the opposing team from scoring, and hence a better chance of winning. There are of course finer details, events and situations to consider but this is the essence behind Possession metrics. Possession metrics include shot%, Fenwick and Corsi values. Shot% may be thought of as shots on goal di- vided by shots for and shots against the team under consideration. Fenwick is an expanded version of shot% counting up all shots, goals and misses for and against the team under consideration. Corsi values are similar to Fenwick values but includes blocked shots for and against the team in the equation. What is interesting about possession metrics is when these values are observed. A team’s ability to control events during a game differ depending on the state of the game, whether it is a “tied game” situation, the team is leading by 2 or 3 goals, or trailing by 1 or 2 goals. Normally teams in a tied game situation or when there is a 1 goal difference will make more effort to gain and maintain control of the events during the game and hence their drive to game possession of the puck will increase. This is usually when the possession metric values are measured, [1].

As mentioned in the introduction some sports are late comers to data analysis. Historically people in soccer have not given much consideration to statistics. Claims such as the nuances, beauty and elegance of soccer cannot be captured statistically have been made. For example, when 3 or 4 beautiful passes are made before a goal, how is this statistically modeled? Perhaps it is the continuous nature of the game that makes it more daunting when trying to use statistics and its models to represent the events in the game. Sports like baseball, football and basketball can be broken down into discrete events, e.g. pitcher throws a ball, batter hits a ball, left fielder catches a ball etc. There are over 29 categories or metrics that measure baseball hitters on ’s website http://mlb.mlb.com/home, [14], but only 6 categories for Major League Soccer, i.e. goals, assists, saves, shots, shots on goal, yellow card, [15]. However, more data for soccer is being accumulated. Opta a London based company and Match Analysis and US company are examples of companies dedicated to collecting soccer data. This is accomplished through charting out and counting all distinct actions in soccer through watching live feeds of soccer games. In some cases up to 2500 actions are coded and put into databases with video footage. Data and statistics are being created that were not previously available. For example, a team’s number of touches was not observed before but now after having been counted and recorded the number of touches has been shown to be highly correlated to, FIFA, the A Federation Internationale de Football Association, world ranking. Not all feel statistics in soccer is a good predictor of who will win. In the professional ranks of sports where signing players for millions of dollars is a game in itself, any extra information is welcome. This is one way where the statistics behind the sport becomes valuable. There are companies working hard to provide soccer statistics. Prozone, a British

2 company has worked with the US national team. Amisco, a French company has worked for Real Madrid and Chelsea. Opta, another UK company, has worked for Arsenal and Manchester City. While Match Analysis, a US company, has established a foothold in North America, [16].

Football is another late comer whose statistical techniques are not as advanced as for baseball and bas- ketball. Possible reasons may be a lack of statistical data on individual players except for basics such as the number of touchdowns, receptions and interceptions. Another reason may be that the number of games played in football is only 16 in a season while for baseball there are 162 games in a season and 82 games in a basketball season. One useful measure in football, however, is defence adjusted value over average, DVOA. It is a comparative measure of success for a particular play. This statistic treats each play as a new event and measures the potential success versus the average success of the league. A useful player based statistic is defence adjusted points above replacement, DPAR. This is used to determine the point based contribution of a player as compared to the performance of a replacement player. For example, if a player has 3 DPAR then the team should score 3 points due to that player’s presence in the lineup while 3 points would be lost if that player was replaced, [2].

Simulation, Methods and Tools

Predictive modeling is a long time goal of individuals and organizations interested in sports. Baseball has used simulation for a long time through fantasy and rotisserie leagues. A fantasy or rotisserie league is a game where participants act as owners to build a team that compete against other fantasy owners based on the statistics generated by the real individual players or teams of the particular sports league, [17]. For example, simulations can be made to find the optimal pinch hitters, a batter that is used to replace an existing player, using Markov Chains. Another way of predicting baseball’s division winners uses a two-stage Bayesian model based on a team’s relative strength. Here the assumption that teams playing at home possess an advantage. This idea was used to simulate the Major League Baseball’s, MLB, entire 2001 season. The method turned out to be quite accurate at predicting 5 of the 6 division winners by July 30th, 2001. Bayesian models have been used to predict Cy Young winners, an award given annually to the best pitchers in the American and National baseball leagues in honour of Hall of Fame pitcher Cy Young, [2, 18].

Another interesting example of simulation being used in baseball prediction is the simulation method developed by Loyola Marymount. Here historical player data is used to predict future home run totals by analyzing the frequency distributions of home runs. The top performers, record breaking seasons players, are considered “large” events. These “large” event frequencies are related to the frequencies of “smaller” events, individual events. This idea of using “large” events related to “small” events is based on a similar approach used to model earthquake frequencies and intensity distributions. When applied to baseball, a sin- gle hit can be thought of as a “small” tremor whereas a cluster of hits would increase the model’s intensity. [2]

Examples of ways other sports have used simulation are in hockey, football and soccer. In hockey, hid- den Markov Chains have been used to determine pattern expected outcomes based upon where the puck is located and the team having possession. In football, regressive and autoregressive techniques have been used in simulation to determine factors most responsible for scoring events. Finally, in soccer Monte-Carlo methods have been used to simulate game play, [2].

While specific methods used in analysis will be not discussed in detail here, a brief outline of some meth- ods used for some analysis will be mentioned. In the game of hockey, when to pull the goalie out of a game and replace the goalie with a skater is question faced by many, if not all, coaches at some point during a game. Papers and research by hockey enthusiasts have been written discussing the best time to perform such an event. A dynamic programming approach is used by Washburn, [27], to determine the optimal time to pull a goalie. Zaman, [28], considered the problem of when to the pull the goalie using Markov chains. He defines 7 events and estimates the transition probabilities. Through his arguments he suggests when to pull the goalie based on the current location of the puck, defensive, offensive and neutral zones.

3 Beaudoin and Swartz, [29] develop a simulation program to simulate hockey games under certain situations and strategies with respect to pulling the goalie with the aid of Bayes constrained estimation and Markov chain Monte Carlo methods. After repeated simulation strategies it can be assessed when to pull the goalie. In Addona and Yates, [30], the relative age effect in the NHL is studied. A child who is a few months older than another child when introduced to hockey has an advantage as these few months can make a difference in size, strength and athletic ability. This in turn carries on throughout the child’s athletic career in the sport. Change point analysis is used to analyze this and a Bayesian change point analysis to show robustness.

Many computer tools have been developed to aid in the prediction of sport and other aspects of sports us- ing collected sports data. For example, Advance Scout developed by IBM in the mid 1990’s is a data mining tool used to show hidden patterns within NBA game data providing additional information to coaches and or- ganization officials, [3]. SportsViz is a tool that allows a graphical way of finding interesting patterns in data, [20]. Sports Data Hub is an interactive data source for fantasy team owners, [19]. Front Office Football from Solecismic Software allows users to build a professional football team, adjust roosters and make play calls to test optimum team performance, [21]. Synergy Online allows individual access to real time game data, [22].

Resources for Sports Data and Analysis

There are many free web resources for sports data and analysis. Some developed by their respective league’s governing body. For example, MLB.com, NBA.com, NFL.com, NHL.com are sites intended for the general public where the raw data is presented in easy to read charts, graphs and projections. Some other sources for data are listed below.

Baseball Retrosheet - Provides historical game scoring and recording of relevant game events. MLB.com http://mlb.mlb.com/home Baseball-reference.com - Contains major and minor leagues player and team data. Baseball Archive - Oldest baseball data website. This site was an answer to Bill James’ complaint about baseball statistics not being freely available to the public.

Basketball NBA http://www.nba.com/ Basketball-reference.com 82games.com

Cricket ESPNcricinfo - From ESPN which includes StatsGuru took which provides sortable stats allowing users to lookg through the data and find interesting bits and pieces. Cricket21 howstat.com

Football NFL http://www.nfl.com/ Pro-football-reference.com AdvancedNFLStats.com - Research driven collection of football enthusiasts’ insights about the sport.

4 Soccer Match Analysis - no useable data MLS http://www.mlssoccer.com/ - has the application GameNav which provides video footage of game events. Soccer Base

Hockey nhl.com http://www.nhl.com/ Hockey-reference.com

General sports STATS - Tracks and maintains data for official sport governing bodies for subscribing members. ATSStats - subscription based site providing betting information. Open Source Sports - Various sports databases

Sports related journals There are a handful of sports journals, many dedicated to sports medicine, marketing, business and more. Below a few that are closer related to quantitative work and analysis of sports. The Baseball Research Journal Journal of Quantitative Analysis in Sports Journal of Sports Economics Journal of Applied Sport Management

Discussion

Analysis of data through the means of statistics, data mining, computing, machine learning and all the other methods available out there is growing in importance and relevance in all industries. A brief snap shot of who, what, where, why and how of analysis in the sports industry has been discussed. It is a growing field with additions being made daily. People of all ages are interested in sports. Sports franchise owners, team managers have an interest in obtaining the best value for the their investment and consulting sports related analysis is one means of validating their investment. Sports enthusiasts who love the sport, follow their favorite teams and players. As a result, these dedicated fans use their existing skills and desire to learn more about their teams and favorite players and end up providing their own analysis. For example in the sport of hockey there are many individuals who dedicate their time analyzing and interpreting the game, qualitatively and quantitatively.

• Hockey Analytics: http://hockeyanalytics.com/about/ • Hockey Graphs: http://hockey-graphs.com/ • Stats Hockey Analysis : http://stats.hockeyanalysis.com/ • Hockey Fans: http://www.hockey-fans.com/stats/ Fantasy leagues exist for fans to try their own hand at managing teams and players. Big Data Baseball is the “Moneyball” story for a new generation. It is the 2013 story of the Pittsburgh Pirates and how “big data” came to their rescue, [26]. Courses are being created and offered in university programs dedicated to sports analysis, [8, 31]. Sophisticated theories, models and methodologies such as Markov Chains, Bayesian

5 Methods and regressive and autoregressive techniques are being used to analyze sports.

Recently the Montreal Canadiens, NHL team, announced that they have hired a data specialist on a part-time, contract basis with free agent Matt Pfeffer. Matt Pfeffer, from Ottawa, is a unviersity student at Trent University studying Economics. He got his start with the Ottawa 67’s in 2013 after publishing an article in TheScout, [25], detailing which midget AAA, highest caliber of minor hockey, programs are overrated and underrated at turning out future productive OHLers, Ontario Hockey League players. The economic theories he learns at school he applies to hockey and its analysis. With the hiring of Matt Pfef- fer, the Montreal Canadiens are not the last in the Atlantic division without an official statistical analyst. The Ottawa Sentaors claim that honour. [23, 24] Analysis of sports data is clearly gaining the respect and attention of all ranks within sports teams. No one wants to be left out. The region where sports, data and analysis intersect is growing and with it so is the interest of the general public.

6 References

[1] Thomas Drance, NHL: Behind The Numbers - Possession metrics Available from URL: http://blog. playnow.com/nhl-behind-the-numbers-possesion-metrics/ [2] Robert P. Schumaker, Osama K. Solieman, Hsinchun Chen Sports Data Mining, New York, Springer, 2010. [3] Inderpal Bhandari, Edward Colet, Jennifer Parker, Zachary Pines, Rajiv Pratap, Krishnakumar Ramau- jam. Brief Application Description Advanced Scout: Data Mining and Knowledge Discovery in NBA Data, Data Mining and Knowledge Discovery, 1997, 1, 121-125. [4] Michael Lewis Moneyball: The Art of Winning an Unfair Game, New York, W.W.Norton & Company, 2003.

[5] Sabermetrics. Available from URL: http://sabr.org/sabermetrics [6] Don Zminda, Bill James, The Baseball Research Journal, 2010, 39, (1), 125. [7] G. M. Hall How to Write a Paper, 3rd Ed., London, BMJ Books, 2003. [8] Keith A. Willoughby, The Science of Sports: Combining Quantitative Analysis and Sports Applications in an Undergraduate Course, INFORMS Transactions on Education, 2004, 5, (1), 88-99.

[9] Olympics.org. Available from URL: www.olympic.org [10] Rick Blechta, Bill Braund, How True is Moneyball? Available from URL: http://lateinnings. blogspot.ca/2012/03/how-true-is-moneyball.html

[11] Basketball Plus-Minus Available from URL: http://www.sportingcharts.com/dictionary/nba/ basketball-plus-minus.aspx [12] 82games Available from URL: http://82games.com/ [13] The Association for Professional Basketball Research Available from URL: http://www.apbr.org/

[14] MLB.com Available from URL: http://mlb.mlb.com/home [15] Major League Soccer, MLS Available from URL: http://www.mlssoccer.com/stats [16] Thomas Kaplan, When It Comes to Stats, Soccer Seldom Counts Available from URL: http://www. nytimes.com/2010/07/09/sports/soccer/09soccerstats.html?_r=1

[17] Fantasy sport Available from URL: https://en.wikipedia.org/wiki/Fantasy_sport [18] Cy Young Award Available from URL: https://en.wikipedia.org/wiki/Cy_Young_Award [19] Sports Data Hub Available from URL: http://sportsdatahub.com/ [20] SportsViz Available from URL: http://sportsviz.com/

[21] Solecismic Software Available from URL: http://www.solecismic.com/fof/index.php

7 [22] Synergy Sports Technology Available from URL: http://corp.synergysportstech.com/ [23] Marc Dumont, The Habs Hire Analytics Consultant Matt Pfeffer Available from URL: http://www.habseyesontheprize.com/latest-news/2015/7/7/8909663/ the-habs-hire-analytics-consultant-matt-pfeffer-canadiens-advance-statistics-employee [24] Matt Douglas, Trent Student Matt Pfeffer Working Dream Job With Ottawa 67’s Available from URL: http://trentarthur.ca/trent-student-matt-pfeffer-working-dream-job-with-ottawa-67s/ [25] The Scout.ca Available from URL: http://thescout.ca/web/index.php [26] Travis Sawchick, Big Data Baseball, New York, Flatiron Books, 2015.

[27] A. Washburn, Still more on pulling the goalie, 1991, Interfaces, 21, 59-64. [28] Z. Zaman, Coach Markov Pulls Goalie Poisson, Chance, 2001, 14, (2), 31-35. [29] David Beaudoin, Tim B. Swartz, Strategies for Pulling the Goalie in Hockey, 2010, 64, No. 3, 197-204. [30] Vittorio Addona, Philip A. Yates, A closer Look at the Relative Age Effect in the National Hockey League, Journal of Quantitative Analysis in Sports, 2010, 6, Iss. 4, Article 9, 1-17.

[31] A Graduate Course in Statistics in Sport Available from URL: http://people.stat.sfu.ca/~tim/ papers/isi_portugal.pdf [32] Case Study 2: Baseball Strategies, Case Studies in Data Analysis Poster Competition 2015. Statistical Society of Canada Available from URL: http://ssc.ca/en/meetings/2015/case-studies#case2

8