Predictive Modeling of Football Injuries

Predictive modeling of football injuries Stylianos Kampakis Supervisors Professor Philip Treleaven, Dr Ioannis Kosmidis A dissertation submitted in partial fulﬁllment of the requirements for the degree of Doctor of Philosophy of University College London. Department of Computer Science University College London April 2016 Abstract The goal of this thesis is to investigate the potential of predictive modelling for football injuries. This work was conducted in close collaboration with Tottenham Hotspurs FC (THFC), the PGA European tour and the participation of Wolverhampton Wanderers (WW). Three investigations were conducted: 1. Predicting the recovery time of football injuries using the UEFA injury recordings: The UEFA recordings is a common standard for recording injuries in professional football. For this investigation, three datasets of UEFA injury recordings were available: one from THFC, one from WW and one that was constructed by merging both. Poisson, negative binomial and ordinal regression were used to model the recovery time after an injury and assess the significance of various injury-related covariates. Then, different machine learning algorithms (support vector machines, Gaussian processes, neural networks, random forests, naïve Bayes and k-nearest neighbours) were used in order to build a predictive model. The performance of the machine learning models is then improved by using feature selection conducted through correlation-based subset feature selection and random forests. 2. Predicting injuries in professional football using exposure records: The relationship between exposure (in training hours and match hours) in professional football athletes and injury incidence was studied. A common problem in football is understanding how the training schedule of an athlete can affect the chance of him getting injured. The task was to predict the number of days a player can train before he gets injured. The dataset consisted of the exposure records of professional footballers in Tottenham Hotspur Football Club from the season 2012-2013. The problem was approached by a Gaussian process model equipped with a dynamic time warping kernel that allowed the calculation of the similarity of exposure records of different lengths. 3. Predicting intrinsic injury incidence using in-training GPS measurements: A significant percentage of football injuries can be attributed to overtraining and fatigue. GPS data collected during training sessions might provide indicators of fatigue, or might be used to detect very intense training sessions which can lead to overtraining. This research used GPS data gathered during training sessions of the first team of THFC, in order to predict whether an injury would take place during a week. The data consisted of 69 variables in total. Two different binary classification approaches were followed and a variety of algorithms were applied (supervised principal component analysis, random forests, naïve Bayes, support vector machines, Gaussian process, neural networks, ridge logistic regression and k-nearest neighbours). Supervised principal component analysis shows the best results, while it also allows the extraction of components that reduce the total number of variables to 3 or 4 components which correlate with injury incidence. The first investigation contributes the following to the field: • It provides models based on the UEFA injury recordings, a standard used by many clubs, which makes it easier to replicate and apply the results. • It investigates which variables seem to be more highly related to the prediction of recovery after an injury. • It provides a comparison of models for predicting the time to return to play after injury. The second investigation contributes the following to the field: • It provides a model that can be used to predict the time when the first injury of the season will take place. • It provides a kernel that can be utilized by a Gaussian process in order to measure the similarity of training and match schedules, even if the time series involved are of different lengths. The third investigation contributes the following to the field: • It provides a model to predict injury on a given week based on GPS data gathered from training sessions. • It provides components, extracted through supervised principal component analysis, that correlate with injury incidence and can be used to summarize the large number of GPS variables in a parsimonious way. Acknowledgments First of all, I would like to thank my primary supervisor Philip Treleaven. His feedback when advising on this thesis, and his help in the collaboration with the football clubs was immense. I would also like to thank my second supervisor Dr. Ioannis Kosmidis. His feedback on the scientific aspects of the thesis helped clarify many points, and improve the methodology throughout the thesis. Also, his help was invaluable when preparing the papers that led to the publications related to this thesis. The medical team of THFC provided the data to conduct this thesis, but also provided useful feedback on the direction of this research and its potential applicability. Therefore, I would like to thank all the people in the team that helped me with their suggestions and support, but first and foremost I would like to thank Wayne Diesel, the head medic of THFC. His instructions and feedback where really valuable in making this work more solid. I would also like to thank my parents, family, friends and Tünde Tolvaj for their support and encouragement. Finally, I would like to thank Neil Jain from Wolverthampton Wanderers. The discussions we had helped me improve my understanding of sports science and football. Contents Acknowledgments .................................................................................................................... 5 List of Figures ........................................................................................................................... 4 List of Tables ............................................................................................................................ 7 1 Introduction .................................................................................................................... 10 1.1 Introduction and definitions ..................................................................................... 10 1.2 Research objectives .................................................................................................. 10 1.3 Research methods .................................................................................................... 11 1.4 Structure of this thesis .............................................................................................. 12 1.5 Contributions ........................................................................................................... 13 1.6 Reference style ......................................................................................................... 14 2 Background and literature review ............................................................................... 15 2.1 Current state of research in sports analytics ............................................................ 15 2.1.1 General overview of academic publications .................................................... 15 2.1.2 Current state of commercial solutions ............................................................. 16 2.2 Injuries and sports analytics ..................................................................................... 17 2.2.1 General research on football injuries ............................................................... 17 2.2.2 Quantifying football injuries: tests and risk factors ......................................... 18 2.3 The issue with current research ................................................................................ 19 3 Data in football and datasets in the current research ................................................ 21 3.1 The challenges of handling football medical data ................................................... 21 3.1.1 The challenges for data analysis ...................................................................... 22 3.1.2 Handling typical data challenges ..................................................................... 23 3.2 Datasets overview and problem explanation ........................................................... 24 4 Techniques and methods ............................................................................................... 26 4.1 Machine learning and statistical methods ................................................................ 26 4.1.1 Gaussian processes ........................................................................................... 26 4.1.2 Support vector machines .................................................................................. 28 4.1.3 Neural networks ............................................................................................... 30 4.1.4 Decision trees ................................................................................................... 32 4.1.5 Random forests ................................................................................................ 33 4.1.6 Naïve Bayes ..................................................................................................... 34 4.1.7 K-nearest neighbours ....................................................................................... 35 4.1.8 The generalized linear model

Load more