Prediction Models for Soccer Sports Analytics
Total Page:16
File Type:pdf, Size:1020Kb
Linköping University | Department of Computer and Information Science Master thesis, 30 ECTS | Computer Science 202018 | LIU-IDA/LITH-EX-A--2018/021--SE Prediction models for soccer sports analytics Edward Nsolo Supervisor : Niklas Carlson Examiner : Patrick Lambrix Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin- istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam- manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/. Copyright The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum- stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con- sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni- versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/. c Edward Nsolo Abstract In recent times there has been a substantial increase in research interest of soccer due to an increase of availability of soccer statistics data. With the help of data provider firms, access to historical soccer data becomes more simple and as a result data scientists started researching in the field. In this thesis, we develop prediction models that could be applied by data scientists and other soccer stakeholders. As a case study, we run several machine learning algorithms on historical data from five major European leagues and make a com- parison. The study is built upon the idea of investigating different approaches that could be used to simplify the models while maintaining the correctness and the robustness of the models. Such approaches include feature selection and conversion of regression predic- tion problems to binary classification problems. Furthermore, a literature review study did not reveal research attempts about the use of a generalization of binary classification predictions that applies different target class upper boundaries other than 50% frequency binning. Thus, this thesis investigated the effects of such generalization against simplic- ity and performance of such models. We aimed to extend the traditional discretization of classes with equal frequency binning function which is standard for converting regression problems into the binary classification in many applications. Furthermore, we ought to establish important players’ features in individual leagues that could help team managers to have cost-efficient transferring strategies. The approach of selecting those features was achieved successfully by the application of wrapper and filter algorithms. Both methods turned out to be useful algorithms as the time taken to build the models was minimal, and the models were able to make good predictions. Fur- thermore, we noticed different features matter for different leagues. Therefore, in accessing the performance of players, such consideration should be kept in mind. Different machine learning algorithms were found to behave differently under different conditions. How- ever, Naïve Bayes was determined to be the best-fit in most cases. Moreover, the results suggest that it is possible to generalize binary classification problems and maintain the performance to a reasonable extent. But, it should be observed that the early stages of gen- eralization of binary classification models involve a tedious work of training datasets, and that fact should be a tradeoff when thinking to use this approach. Acknowledgments Firstly, I would like to express my sincere gratitude to my thesis examiner and supervisor Prof. Patrick Lambrix and Prof. Niklas Carlson of Linköping university for the opportunity of Thesis project that was carried under their supervision. Their continuous support, guidance, and patience motivated me in the right direction which led to the successful accomplishment of this thesis. Secondly, I would like to extend the hand of gratitude to fellow schoolmates, friends, and family for their company, advice, and encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplishment would not have been possible without them. Lastly, I would like to thank almighty God for the good health and opportunity of a schol- arship to study in Sweden. This publication has been produced during scholarship period at Linköping University, thus, I would like to give a special appreciation to Swedish Institute scholarship. v Contents Abstract iii Acknowledgments v Contents vi List of Figures viii List of Tables x 1 Introduction 1 1.1 Purpose . 2 1.2 Research questions . 2 1.3 Delimitations . 3 2 Related work 5 3 Theory 7 3.1 Software (Weka) . 7 3.2 Min-max normalization . 8 3.3 Feature selection methods . 8 3.4 Class imbalance . 9 3.5 SMOTE (Synthetic Minority Oversampling Technique) . 9 3.6 TigerJython with Weka . 10 3.7 Machine learning algorithms . 10 3.8 Evaluation of the prediction models . 12 4 Research methods, techniques, and methodology 15 4.1 Pre-study . 15 4.2 Experimental study . 16 4.3 Methodology . 16 5 Data pre-processing 23 5.1 Data collection . 23 5.2 Data rescaling, missing values, and duplicates. 23 5.3 Converting regression problem to binary classification problem . 24 6 Feature selection 27 6.1 Feature selection with wrapper method . 27 6.2 Feature selection with filter attribute evaluator . 31 7 Performance of prediction models 35 7.1 Accuracy results of the prediction models . 35 7.2 F1 Score results of the prediction models . 36 vi 7.3 AUC-ROC results of the prediction models . 37 8 Discussion and conclusion 41 8.1 What are the best mechanisms for selecting essential features for predicting the performance of top players in European leagues? . 41 8.2 What are the essential features for developing prediction models for top play- ers in European leagues? . 42 8.3 What are the useful classification models for predicting performance of top players in European leagues? . 43 8.4 How can binary prediction models be generalized? . 43 9 Future research 45 Bibliography 47 A Wrapper method results of the combined-leagues 51 B Attributes selected by Wrapper method of the combined leagues 57 C Execution time of wrapper method for the combined leagues 61 D Aggregated results of filter method for the combined leagues 65 E Model accuracy results of wrapper datasets for the combined leagues 67 F Model accuracy of filter-datasets for the combined leagues 71 G F1 score results of wrapper datasets for the combined leagues 75 H F1 Score results of the filter-datasets for the combined leagues 79 I AUC-ROC results of the wrapper datasets for the combined leagues 83 J AUC-ROC results of the filter-datasets 87 K Accuracy results for individual leagues 91 L F1 score results for individual leagues 97 M AUC-ROC results for individual leagues 103 vii List of Figures 4.1 A procedure for analyzing soccer sport historical data . 17 4.2 Data preparation model . 18 4.3 Knowledge flow activities for data formatting process . 19 4.4 Feature selection with wrapper method knowledge flow model . 20 4.5 Feature selection with filter method knowledge flow model . 21 6.1 Merit of subsets of attributes selected . 28 6.2 Execution time of wrapper subset evaluator . 31 7.1 Model accuracy results . 36 7.2 Overall F1 Score results of the combined leagues . 37 7.3 Overall AUC-ROC results . 38 C.1 Execution time of wrapper attribute evaluator for defenders datasets . 61 C.2 Execution time of wrapper method for goalkeepers datasets . 62 C.3 Execution time of wrapper method for midfielders datasets . 62 C.4 Execution time of wrapper method for forwards datasets . 63 E.1 Prediction model accuracy for defenders wrapper-dataset . 67 E.2 Prediction model accuracy for midfielders wrapper-dataset . 68 E.3 Model accuracy for the goalkeepers wrapper-datasets . 68 E.4 prediction model accuracy for forwards wrapped datasets . 69 F.1 Model accuracy for defenders filter-datasets . 71 F.2 Model accuracy for midfielders filter-datasets . 72 F.3 Model accuracy for goalkeepers filter-datasets . 72 F.4 Model accuracy for forwards filter-datasets . 73 G.1 F1 Score results of the defenders wrapper-datasets . 75 G.2 F1 Score results of the midfielders wrapper-datasets . 76 G.3 F1 Score results of the goalkeepers wrapped datasets . 76 G.4 F1 Score results of the forwards wrapped datasets . 77 H.1 F1 Score results of the defenders filter-datasets . 79 H.2 F1 Score results of the midfielders filter-datasets .