Archetypoid Analysis for Sports Analytics

Data Min Knowl Disc (2017) 31:1643–1677 DOI 10.1007/s10618-017-0514-1 Archetypoid analysis for sports analytics G. Vinué1 · I. Epifanio2 Received: 2 February 2016 / Accepted: 22 May 2017 / Published online: 3 June 2017 © The Author(s) 2017 Abstract We intend to understand the growing amount of sports performance data by finding extreme data points, which makes human interpretation easier. In archetypoid analysis each datum is expressed as a mixture of actual observations (archetypoids). Therefore, it allows us to identify not only extreme athletes and teams, but also the composition of other athletes (or teams) according to the archetypoid athletes, and to establish a ranking. The utility of archetypoids in sports is illustrated with basketball and soccer data in three scenarios. Firstly, with multivariate data, where they are compared with other alternatives, showing their best results. Secondly, despite the fact that functional data are common in sports (time series or trajectories), functional data analysis has not been exploited until now, due to the sparseness of functions. In the second scenario, we extend archetypoid analysis for sparse functional data, furthermore showing the potential of functional data analysis in sports analytics. Finally, in the third scenario, features are not available, so we use proximities. We extend archetypoid analysis when asymmetric relations are present in data. This study provides information that will provide valuable knowledge about player/team/league performance so that we can analyze athlete’s careers. Responsible editors: A. Zimmermann and U. Brefeld. This work has been partially supported by Grant DPI2013-47279-C2-1-R. The databases and R code (including the web application) to reproduce the results can be freely accessed at www.uv.es/vivigui/ software. B G. Vinué [email protected] 1 Department of Statistics and O.R., University of Valencia, 46100 Burjassot, Spain 2 Dept. Matemàtiques and Institut de Matemàtiques i Aplicacions de Castelló. Campus del Riu Sec, Universitat Jaume I, 12071 Castelló, Spain 123 1644 G. Vinué, I. Epifanio Keywords Archetype analysis · Sports data mining · Functional data analysis · Extreme point · Multidimensional scaling · Performance analysis 1 Introduction A high level of professionalism, advances in technology and complex data sets con- taining detailed information about player and team performance have contributed to the development of sport science (Williams and Wragg 2004). Sports performance analysis is a growing branch within sport science. It is concerned with the investiga- tion of actual sports performance in training or competition (O’Donoghue 2010). One of the most important issues in sport science is to identify outstanding athletes (or teams) based on their performance. In particular, the question regarding who the best players are in a competition is at the center of debates between sport managers and fans. There are lists and rankings, each with their own criteria and biases. A thorough analysis of the players’ performance has direct consequences on the composition of the team and on transfer policies because this evaluation is used to decide whether the team should recruit or extend the player. To that end, managers and scouts assess players based on their knowledge and experience. However, this process is based on subjective criteria. The observer has developed notions of what a good player should look like based on his/her previous experience (Shea and Baker 2013). Thus, the evaluation is subjective/biased, which may cause flawed or incomplete conclusions. Traditional means of evaluating players and teams are best used in conjunction with rigorous statistical methods. One interesting approach to provide objective evidence about how good (or bad) the players perform based on the statistics collected for them is described by Eugster (2012). The author uses archetype analysis (AA) to obtain outstanding athletes (both positively and negatively). These are the players who dif- fer most from the rest in terms of their performance. It has been shown that extreme constituents (Davis and Love 2010) facilitate human understanding and interpretation of data because of the principle of opposites (Thurau et al. 2012). In other words, extremes are better than central points for human interpretation. AA was first proposed by Cutler and Breiman (1994). Its aim is to find pure types (the archetypes) in such a way that the other observations are a mixture of them. Archetypes are data-driven extreme points. As is rightly pointed out by Eugster (2012), in sports these extreme points correspond to positively or negatively prominent players. However, AA has an important drawback: archetypes are a convex combination of the sampled individuals, but they are not necessarily observed individuals. Fur- thermore, there are situations where archetypes are fictitious, see for example Seiler and Wohlrabe (2013). In sports, this situation can cause interpretation problems for analysts. In order to cope with this limitation, a new archetypal concept was introduced: the archetypoid, which is a real (observed) archetypal case (Vinué et al. 2015; Vinué 2014). Archetypoids accommodate human cognition by focusing on extreme opposites (Thurau et al. 2012). Furthermore, they make an intuitive understanding of the results easier even for non-experts (Vinué et al. 2015; Thurau et al. 2012), since archetypoid analysis (ADA) represents the data as mixtures of extreme cases, and not as mixtures of mixtures, as AA does. 123 Archetypoid analysis for sports analytics 1645 In this paper, we propose using ADA to find real outstanding (extreme) players and teams based on their performance information in three different scenarios. Firstly, in the multivariate case, where several classical sport variables (features) are available. Secondly, in combination with sparse functional data, for which archetypoids are defined for the first time in this work. Thirdly and finally, when only dissimilarities between observations are known (features are unavailable) and these dissimilarities are not metric, but asymmetric proximities. Functional data analysis (FDA) is a modern branch of statistics that analyzes data that are drawn from continuous underlying processes, often time, i.e. a whole function is a datum. An excellent overview of FDA can be found in Ramsay and Silverman (2005). Even though functions are measured discretely at certain points, a continuous curve or function lies behind these data. The sampling time points do not have to be equally spaced and both the argument values and their cardinality can vary across cases, which makes the FDA framework highly flexible. On the one hand, our approach is a natural extension and improvement of the methodology proposed by Eugster (2012) with regard to multivariate data. On the other hand, the methodology can also be used with other available information, such as asymmetric relations and sparse functional data. The main goal is to provide sport analysts with a statistical tool for objectively identifying extreme observations with certain noticeable features and to express the other observations as a mixture of them. Furthermore, a ranking of the observations based on their performance can also be obtained. The application of ADA focuses on two mass sports: basketball and soccer. However, it can be used with any other sports data. The main novelties of this work consist of: 1. Introducing ADA to the sports analytics community, together with FDA; 2. Extending ADA to sparse functional data; 3. Proposing a methodology for computing archetypoids when asymmetric proximities are the only available information. The outline of the paper is as follows: Sect.2 is dedicated to preliminaries. In Sect. 3 related work is reviewed. Section4 reviews AA and ADA in the multivariate case, ADA is extended to deal with sparse functional data and an ADA extension is introduced when asymmetric relations are present in data. We also present how a performance-based ranking can be obtained. In Sect. 5,ADA is used in three scenarios. In the multivariate case, ADA is applied to the same 2-D basketball data used by Eugster (2012) and to another basic basketball player statistics data set, and compared with other alternative methodologies and previous approaches. In the second scenario, ADA and FDA are applied to longitudinal basketball data. In the third scenario, ADA is applied to asymmetric proximities derived from soccer data. Finally, Sect. 6 ends the paper with some conclusions. 2 Preliminaries 2.1 Functional data analysis (FDA) Many multivariate statistical methods, such as simple linear models, ANOVA, gen- eralized linear models, PCA, clustering and classification, among others, have been adapted to the functional framework and have their functional counterpart. ADA has 123 1646 G. Vinué, I. Epifanio also been defined for functions by Epifanio (2016), where it was shown that functional archetypoids can be computed as in the multivariate case if the functions are expressed in an orthonormal basis, by applying ADA to the coefficients in that basis. However, in Epifanio (2016) functions are measured over a densely sampled grid. When functions are measured over a relatively sparse set of points, we have sparse functional data. An excellent survey on sparsely sampled functions is provided by James (2010). In this case, alternative methodologies are required. Note that when functions are measured over a fine grid of time points, it is possible to fit a separate function for each case using any reasonable basis. However, in the sparse case, this approach fails and the information from all functions must be used to fit each function. 2.2 h-Plot representation Recently, a multidimensional scaling methodology for representing asymmetric data was proposed by Epifanio (2013, 2014) (it improved on other alternatives). The dis- similarity matrix D is viewed as a data matrix and their variables are displayed with an h-plot. For computing the h-plot in two dimensions, the two largest eigenvalues (λ1 and λ2) of the variance- covariance matrix, S,ofD, are calculated, together with their corresponding√ √ unit eigenvectors, q1 and q2.

Archetypoid Analysis for Sports Analytics

Oh My God, It's Full of Data–A Biased & Incomplete

Georgia Tech in the 2001 Ncaa Tournament 2000-01 Georgia

La Salle Basketball Media Guide 2003-04 La Salle University

UCLA Men's Basketball Dec. 14, 2002 Marc Dellins/Bill Bennett/310-206

2019-20 Panini Flawless Basketball Checklist

Saturday Marks 100 Days in the Books for Espnu

Former Ohio State Standout David Lighty Signs Two-Year Contract Extension with ASVEL in French Pro a League

Individual Statistical Leaders

0719-PT-A Section.Indd

Dallas Mavericks Camp Waiver

Schedule/Results (5-6) Fresno State Men's Basketball

2019-20 Immaculate Basketball Checklist