Football and the dark side of cluster analysis

(and of exploratory multivariate analysis in general, really) Christian Hennig and Serhat Akhanli Department of Statistical Science, UCL email: [email protected],[email protected]

1 A principle for data preprocessing 4 Football players dataset 7 Standardisation

“The dark side of cluster analysis”: clustering and mapping mul- Football players characterised by 125 variables taken from Percentage variables, player age, goals, passes per 90 minutes tivariate data are strongly affected by preprocessing decisions whoscored.com (have > 2000 players but use only 75 prominent don’t have compatible variation. Standardisation is needed. such as variable transformations (“data cleaning”belongs to pre- ones for illustration). But different percentages at same level (shots left, right, processing but is not treated here). The variety of options is Variables: header) should be standardised by pooled variance, be- huge and guidance is scant. 12 position variables (binary) - indicating where a player can cause variations are compatible and relative sizes should be The framework here is the design of a dissimilarity measure, play. preserved. Bigger variation should have bigger implicit weight. used for multidimensional scaling and dissimilarity-based clus- Age, height, weight (ratio scale numbers) tering. Standardisation should not destroy implicit weighting Clustering and mapping are unsupervised; decisions cannot be Subjective data: Man of the match, media ratings by variance, where appropriate. made by optimising cross-validated prediction quality. Neither Appearance data of player and team, number of appear- is it a convincing rationale to transform data to standard ances, minutes played 8 Weighting distributional shapes such as the Gaussian. Count variables (top level): goals, tackles, shots, passes, fouls, clearances etc. Variable weights are useful if some variables seem more impor- General principle: Data should be preprocessed Count variables (lower level): subdivisions such as shots by tant than others. This influences the meaning of the results. in such a way that the resulting effective distance body parts, type of pass (long, corner, freekick etc.), suc- Weighted percentage distributions as “one variable”, e.g., left between observations matches how distance is inter- cessful/failed etc. foot, right foot, header shots each weighted 1/3. preted in the application of interest. Aim: provide mapping and classification of players that can be Correlation, shared information: if variables are used by managers and clubs looking for players. Corollary: Different ways of data preprocessing are correlated because of redundant information (e.g., percentages adding to 100), weight them down. not objectively“right”or“wrong”; they implicitly con- 5 Representation struct different interpretations of the data. If variables with complementary meanings are corre- • Standardise count variables by number of minutes played. lated, no reason not to give them full weight. Data driven principles such as optimising stability or “cluster- • Top level/lower level count variables: ability” are suspicious: can the data decide on their own how use total count (per 90 minutes), 9 Results they should be interpreted? use percentages on lower level, e.g., shots 5.5, shots right foot 3.8, left foot 0.8, header 0.9, MDS of distances as explained (pam clustering) use shots 5.5, 2 Overview of decisions ● Andrea Pirlo percentages right foot 69, left foot 14, header 16 ● Caner Erkin ● ● ● ● ● Angel Di Maria percentage profiles are complementary information to total. 100 Ricardo Rodriguez ● ● ● Gael Clichy ● ● ● Representation: decisions about how to represent the rele- ● ● ● • For binary position data use“geco coefficient”(Hennig and ● ● ● ● ● vant information in the variables properly; this may involve ● ● ● ● 50 ● ● ● Hausdorf(2006)) based on aggregating “geographical dis- ● ● ● ● excluding variables, defining new variables summarising or ● ● ●● ● ● ● ● ● tance”for every position to closest position of other player. ● ● 0 framing information in better ways, and certain kinds of ●

● • We decided to not use subjective variables. ● ● “interpretation-based”(as opposed to data-based) standardi- ● ● ● ● Sergio Aguero ● ● ●●● ● ● ● It’d be legitimate to use them - this is a decision about what −50 ● sation. ● ● ●

finalmds$points[,2] Naldo ● meaning the results should have. ● ● Transformation: variables should be transformed in such a ● Karim Benzema Martin Skrtel ● ● Diego Godin way that the resulting differences match appropriate “inter- −100 ● Robin Van Persie Diego Costa ● pretative distances”adapted to the meaning of the variables ● ● 6 Transformation Zlatan Ibrahimovic and the specific application. −150 Mario Mandzukic ● Robert Lewandiwski Standardisation: variables should be standardised in such a Some variables are very skewly distributed. ● −200 −200 −100 0 100 way that a difference in one variable can be traded off against Caught offsides Goals

the same difference in another variable when aggregating vari- 20 finalmds$points[,1] ables for computing dissimilarities. 25 15 Weighting: some variables may be more important/relevant 20 MDS of plain standardised Euclidean (pam clustering) 15 than others - weighting is about appropriately matching the 10 ● Caner Erkin Frequency Frequency ● 10 Andrea Pirlo importance of variables. 10 5

Mathematically, both standardisation and weighting are mul- 5 Ricardo Rodriguez ●

tiplications by a constant, but the rationales are quite differ- 0 0 ent. 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 Caught offsides per 90 minutes Goals per 90 minutes ● 5 ● ● ● Cesar Azpilicueta ● ● Variable selection and dimension reduction are special ● ● More variation at upper end, suggests that “interpretative dis- ● ● ● ● cases of representation and weighting. ● ● ● tance”between large values should be transformed down. ● ● ● ● ● ● Defining indexes summarising information guided by interpre- ● unmds$points[,2] ● ● ● ● ● ● But goals count (approximately) linearly in football. ● ● ● ● 0 ● ● tation is an alternative to data-based dimension reduction. ● ● ● ● ● Use concave transformation for“Caught offsides” ● ● ● ● ● ● ● ● ● ● ● ● ● ● but none for“Goals”. ● ● ● ● ● ● √ ● ● ● Cristiano Ronaldo ● Diego Godin ● ● Decide log(x + c) or x + c, value of c by looking at what it Martin Skrtel● 3 Basic ingredients Demba Ba● Phil Jones Zlatan Ibrahimovic ● ● Diego Costa ● ●

does to the values and what seems appropriate (subjective foot- −5 ● John Terry Mario MandzukicRobert Lewandiwski The framework here is the construction of a dissimilarity ball expertise). Explore by sensitivity analysis what difference measure by aggregating variable-wise distances, e.g., it makes. (Use plain square root here.) −10 −5 0 5 unmds$points[,1] Gower aggregation (Gower(1971)) d(x , x ) = Pp w d (x , x ) 1 2 i=1 i i 1i 2i Transformations: data dependent? “External validation”by football knowledge: Erkin and Pirlo are Euclidean aggregation quite different, but in the same cluster in plain Euclidean solu- pPp 2 Unfortunately, researchers have no clear formal idea d(x1, x2) = i=1 widi(x1i, x2i) . tion. Rodriguez and Clichy are expected to be similar, which Alternatives exist but are not treated here. about “interpretative distance”. Rationale of trans- they are with distances constructed here. formation is matching“interpretative distance”inde- Mapping is then done by Kruskal’s nonparametric multidimen- pendently of the data. But researchers may need to References sional scaling (Cox and Cox(1994)) and clustering by “parti- look at data for having clearer idea about “interpre- Cox, T. F. and M. A. A. Cox (1994). Multidimensional Scaling. Boca Raton: Chapman tioning around medoids”(Kaufman and Rousseeuw(1990)). tative distance”. and Hall. Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biomet- The general principle above also applies to the choice of map- rics 27, 857—-874. Hennig, C. and B. Hausdorf (2006). Design of dissimilarity measures: a new dissimilarity ping and clustering method, but not treated here. measure between species distribution ranges. In V. Batagelj, H.-H. Bock, A. Ferligoj, and A. Ziberna (Eds.), Data Science and Classification, pp. 29–38. Springer, Berlin. Kaufman, L. and P. Rousseeuw (1990). Finding Groups in Data. Wiley.

This work was supported by EPSRC grant EP/K033972/1.