Football and the dark side of cluster analysis (and of exploratory multivariate analysis in general, really) Christian Hennig and Serhat Akhanli Department of Statistical Science, UCL email:
[email protected],
[email protected] 1 A principle for data preprocessing 4 Football players dataset 7 Standardisation \The dark side of cluster analysis": clustering and mapping mul- Football players characterised by 125 variables taken from Percentage variables, player age, goals, passes per 90 minutes tivariate data are strongly affected by preprocessing decisions whoscored.com (have > 2000 players but use only 75 prominent don't have compatible variation. Standardisation is needed. such as variable transformations (\data cleaning"belongs to pre- ones for illustration). But different percentages at same level (shots left, right, processing but is not treated here). The variety of options is Variables: header) should be standardised by pooled variance, be- huge and guidance is scant. 12 position variables (binary) - indicating where a player can cause variations are compatible and relative sizes should be The framework here is the design of a dissimilarity measure, play. preserved. Bigger variation should have bigger implicit weight. used for multidimensional scaling and dissimilarity-based clus- Age, height, weight (ratio scale numbers) tering. Standardisation should not destroy implicit weighting Clustering and mapping are unsupervised; decisions cannot be Subjective data: Man of the match, media ratings by variance, where appropriate. made by optimising cross-validated prediction quality. Neither Appearance data of player and team, number of appear- is it a convincing rationale to transform data to standard ances, minutes played 8 Weighting distributional shapes such as the Gaussian.