Actions Speak Louder Than Goals: Valuing Player Actions in Soccer
Total Page:16
File Type:pdf, Size:1020Kb
Actions Speak Louder than Goals: Valuing Player Actions in Soccer Tom Decroos Lotte Bransen KU Leuven SciSports Leuven, Belgium Amersfoort, Netherlands [email protected] [email protected] Jan Van Haaren Jesse Davis SciSports KU Leuven Amersfoort, Netherlands Leuven, Belgium [email protected] [email protected] ABSTRACT 1 INTRODUCTION Assessing the impact of the individual actions performed by soccer How will a soccer player’s actions impact his or her team’s perfor- players during games is a crucial aspect of the player recruitment mances in games? This question is relevant for a variety of tasks process. Unfortunately, most traditional metrics fall short in ad- within a soccer club such as player acquisition, player evaluation, dressing this task as they either focus on rare actions like shots and scouting. It is also important for the media and building fan and goals alone or fail to account for the context in which the engagement, as fans like nothing better than comparing players actions occurred. This paper introduces (1) a new language for de- and arguing why their favorite player is better than the others. scribing individual player actions on the pitch and (2) a framework Nevertheless, the task of objectively quantifying the impact of for valuing any type of player action based on its impact on the the individual actions performed by soccer players during games game outcome while accounting for the context in which the action remains largely unexplored to date. What complicates the task is happened. By aggregating soccer players’ action values, their total the low-scoring and dynamic nature of soccer games. While most offensive and defensive contributions to their team can bequan- actions do not impact the scoreline directly, they often do have tified. We show how our approach considers relevant contextual important longer-term effects. For example, a long pass from one information that traditional player evaluation metrics ignore and flank to the other may not immediately lead to a goal but canopen present a number of use cases related to scouting and playing style up space to set up a goal chance several actions down the line. characterization in the 2016/2017 and 2017/2018 seasons in Europe’s Most existing approaches for valuing actions in soccer suffer top competitions. from three important limitations. First, these approaches largely ignore actions other than goals and shots as most work to date has CCS CONCEPTS focused on the concept of the expected value of a goal attempt [1, 5, • Information systems → Data mining; Data stream mining; 20, 21]. Second, existing approaches tend to assign a fixed value to • Computing methodologies → Machine learning; Artificial each action, regardless of the circumstances under which the action intelligence; Supervised learning. was performed. For example, many pass-based metrics treat passes between defenders in the defensive third of the pitch without any KEYWORDS pressure whatsoever and passes between attackers in the offensive third under heavy pressure from the opponents similarly. Third, sports analytics; event stream data; soccer match data; valuing most approaches only consider immediate effects and fail to account actions; probabilistic classification for an action’s effects a bit further down the line. ACM Reference Format: To help fill the gap in objectively quantifying player perfor- Tom Decroos, Lotte Bransen, Jan Van Haaren, and Jesse Davis. 2019. Actions mances, this paper proposes a novel data-driven framework for arXiv:1802.07127v2 [stat.AP] 10 Jul 2019 Speak Louder than Goals: Valuing Player Actions in Soccer. In The 25th ACM valuing actions in a soccer game. Unlike most existing work, it con- SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’19), siders all types of actions (e.g., passes, crosses, dribbles, take-ons, August 4–8, 2019, Anchorage, AK, USA. ACM, New York, NY, USA, 11 pages. and shots) and accounts for the circumstances under which each of https://doi.org/10.1145/3292500.3330758 these actions happened as well as their possible longer-term effects. Intuitively, an action value reflects the action’s expected influence Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed on the scoreline. That is, an action valued at +0.05 is expected to for profit or commercial advantage and that copies bear this notice and the full citation contribute 0.05 goals in favor of the team performing the action, on the first page. Copyrights for components of this work owned by others than ACM whereas an action valued at -0.05 is expected to yield 0.05 goals for must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a their opponent. Our approach fits within the growing line of data fee. Request permissions from [email protected]. science research for analyzing sports data (e.g., [9, 19, 24, 27]). KDD ’19, August 4–8, 2019, Anchorage, AK, USA In summary, this paper makes the following five contributions: © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6201-6/19/08...$15.00 https://doi.org/10.1145/3292500.3330758 (1) a language for representing player actions; (2) a framework for valuing player actions and rating players events and more detailed information. For example, Opta has four based on their impact on the game; different event types for a shot, depending on its result, which (3) a model for predicting short-term scoring and conceding makes it extremely cumbersome to query shot characteristics. probabilities at any moment in a game; The fourth challenge is that most vendors offer optional informa- (4) a number of use cases showcasing our most interesting re- tion snippets per type of event. For example, for fouls, Opta often sults and insights; specifies more details on the exact type of foul that was commit- (5) a Python package1 that (a) converts existing event stream ted. While sometimes useful, this dynamic information makes it data to our language, (b) implements our framework, and extremely hard to apply automatic analysis tools. (c) constructs a model that estimates scoring and conceding The final challenge is that most machine learning algorithms probabilities. require fixed-length feature vectors and cannot handle variable- sized vectors arising from, e.g., the sporadic presense of optional 2 SPADL: A LANGUAGE FOR DESCRIBING information snippets. Therefore, analysts usually have to write a PLAYER ACTIONS complicated event preprocessor that extracts the features relevant to their analysis. Developing these preprocessors requires an excessive Two primary data sources about soccer games exist that can be amount of programming effort and intimate knowledge of the event used to value actions: (1) event stream data and (2) optical tracking stream format, but the end result is a one-off script tailored to one data. Event stream data annotates the times and locations of specific specific vendor’s current event stream format. events (e.g., passes, shots, and cards) that occur in a game. Optical tracking data records the locations of the players and the ball at a high frequency using optical tracking systems during games. 2.2 Language description Multiple different companies (e.g., Opta, Wyscout, STATS, Second Based on domain knowledge and feedback from soccer experts, we Spectrum, SciSports, and StatsBomb) generate one or both of these propose SPADL (Soccer Player Action Description Language) as an types of data. Due to the high cost of optical tracking systems, attempt to unify the existing event stream formats into a common tracking data is only available in wealthy leagues or clubs, while vocabulary that enables subsequent data analysis. It is designed event stream data is more widely and cheaply available. Moreover, to be human-interpretable, simple and complete to accurately de- optical tracking data is usually not shared across leagues. Hence, fine and describe actions on the pitch. The human-interpretability this paper focuses exclusively on event stream data. However, the allows reasoning about what happens on the pitch and verifying contributions in this paper could also be applied to full tracking whether the action values correspond to soccer experts’ intuitions. data with some minor extensions. A key challenge from a data The simplicity reduces the chance of making mistakes when au- science perspective is that the nature of the event stream data tomatically processing the language. The completeness enables complicates analysis. We first describe the data science challenges expressing all the information required to analyze actions in their posed by currently available event stream data and then show how full context. our proposed SPADL language addresses these challenges. To address the challenges posed by the variety of event stream formats and to benefit the data science community, we release 2.1 Five data science challenges posed by a Python package that automatically converts event streams to current event stream data SPADL. Our package currently supports event streams provided by Opta, Wyscout, and StatsBomb. The first challenge is that the event stream data serves multiple SPADL is a language for describing player actions, as opposed different objectives (e.g., reporting information to broadcasters, to the formats by commercial vendors that describe events. The newspapers, or soccer clubs), which means that the data is not distinction is that actions are a subset of events that require a necessarily designed to facilitate data analysis. Some important player to perform the action. For example, a passing event is an information can be missing (e.g., Wyscout does not record exact action, whereas an event signifying the end of the game is not end locations for shots).