Sports Analytics Algorithms for Performance Prediction
Total Page:16
File Type:pdf, Size:1020Kb
Sports Analytics Algorithms for Performance Prediction Paschalis Koudoumas SID: 3308190012 SCHOOL OF SCIENCE & TECHNOLOGY A thesis submitted for the degree of Master of Science (MSc) in Data Science JANUARY 2021 THESSALONIKI – GREECE -i- Sports Analytics Algorithms for Performance Prediction Paschalis Koudoumas SID: 3308190012 Supervisor: Assoc. Prof. Christos Tjortjis Supervising Committee Mem- Assoc. Prof. Maria Drakaki bers: Dr. Leonidas Akritidis SCHOOL OF SCIENCE & TECHNOLOGY A thesis submitted for the degree of Master of Science (MSc) in Data Science JANUARY 2021 THESSALONIKI – GREECE -ii- Abstract This dissertation was written as a part of the MSc in Data Science at the International Hellenic University. Sports Analytics exist as a term and concept for many years, but nowadays, it is imple- mented in a different way that affects how teams, players, managers, executives, betting companies and fans perceive statistics and sports. Machine Learning can have various applications in Sports Analytics. The most widely used are for prediction of match outcome, player or team performance, market value of a player and injuries prevention. This dissertation focuses on the quintessence of foot- ball, which is match outcome prediction. The main objective of this dissertation is to explore, develop and evaluate machine learning predictive models for English Premier League matches’ outcome prediction. A comparison was made between XGBoost Classifier, Logistic Regression and Support Vector Classifier. The results show that the XGBoost model can outperform the other models in terms of accuracy and prove that it is possible to achieve quite high accuracy using Extreme Gradient Boosting. -iii- Acknowledgements At this point, I would like to thank my Supervisor, Professor Christos Tjortjis, for offer- ing his help throughout the process and providing me with essential feedback and valu- able suggestions to the issues that occurred. Paschalis Koudoumas 4-1-2021 -iv- Contents ABSTRACT ................................................................................................................. III ACKNOWLEDGEMENTS ......................................................................................... IV CONTENTS ................................................................................................................... V LIST OF FIGURES .................................................................................................... VII LIST OF TABLES .................................................................................................... VIII 1 INTRODUCTION ...................................................................................................... 1 2 BACKGROUND ....................................................................................................... 3 2.1 HISTORICAL BACKGROUND ................................................................................ 3 2.1.1 Baseball ................................................................................................ 3 2.1.2 Basketball ............................................................................................. 5 2.1.3 Football ................................................................................................. 7 2.2 LITERATURE REVIEW ......................................................................................... 9 2.2.1 Introduction........................................................................................... 9 2.2.2 Football Prediction Overview........................................................... 10 2.2.3 Rating Systems ................................................................................. 12 2.2.4 Factors that affect the result of a football game ........................... 13 2.2.5 Expected Goals (xG) Models .......................................................... 14 3 GENERAL TERMS ................................................................................................ 16 3.1 DATA MINING ................................................................................................... 16 3.2 MACHINE LEARNING ......................................................................................... 17 3.2.1 Machine Learning Algorithm ............................................................ 21 3.3 SPORTS ANALYTICS ......................................................................................... 22 4 METHODOLOGY ................................................................................................... 24 4.1 PROCESS DESCRIPTION .................................................................................. 24 4.1.1 Data Collection .................................................................................. 25 4.1.2 Data Cleaning .................................................................................... 26 -v- 4.1.3 Feature Extraction and Feature Engineering ................................ 27 5 PREDICTION MODELS........................................................................................ 32 5.1 EXPLANATORY VARIABLES .............................................................................. 32 5.2 CLASSIFICATION MODELS ................................................................................ 33 5.2.1 Boosting with XGBoost..................................................................... 33 5.2.2 Hyperparameter Tuning ................................................................... 33 5.2.3 Grid Search Cross Validation .......................................................... 34 5.2.4 Logistic Regression........................................................................... 35 5.2.5 Hyperparameter Tuning ................................................................... 35 5.2.6 Support Vector Machines................................................................. 36 5.2.7 Hyperparameter Tuning ................................................................... 36 6 RESULTS AND DISCUSSION ............................................................................ 37 6.1 XGBOOST CLASSIFIER .................................................................................... 37 6.2 LOGISTIC REGRESSION .................................................................................... 38 6.3 SUPPORT VECTOR CLASSIFIER ....................................................................... 39 6.4 COMPARATIVE RESULTS FOR CLASSIFIERS .................................................... 40 6.5 EVALUATION OF RESULTS ............................................................................... 41 7 CONCLUSION AND FUTURE WORK ............................................................... 43 7.1 CONCLUSIONS .................................................................................................. 43 7.2 FUTURE WORK ................................................................................................. 43 REFERENCES ............................................................................................................. 45 -vi- List of Figures Figure 1: Branch Rickey's famous formula[2]. ........................................................... 4 Figure 2: Team salaries in 2002 in MLB [4]. .............................................................. 5 Figure 3: SportVU camera in NBA [10]....................................................................... 6 Figure 4: Kyrie Irving's report from SportVU [11]. ..................................................... 7 Figure 5: Long-ball theory [15] ..................................................................................... 8 Figure 6: Score and the Expected Goals [17]. .......................................................... 9 Figure 7: Data Mining process. [49] .......................................................................... 17 Figure 8: The hierarchy of Machine Learning [54]. ................................................. 18 Figure 9: Machine Learning structure and the most important tasks [55]. .......... 19 Figure 10: Supervised Learning [58]. ........................................................................ 19 Figure 11:Unsupervised Learning [60]. .................................................................... 20 Figure 12: Reinforcement Learning [62]. .................................................................. 21 Figure 13: Machine Learning algorithms [64]. ......................................................... 22 Figure 14: Sports Analytics [67]. ................................................................................ 23 Figure 15: Jupyter notebook [69]. .............................................................................. 24 Figure 16: Dataset example and dataset shape. .................................................... 27 Figure 17: Table example after initial feature extraction. ....................................... 28 Figure 18: Transforming full-time result into numeric data. ................................... 28 Figure 19: Table with averages. ................................................................................ 30 Figure 20: First 5 rows of the table with the home team form features. .............. 31 Figure 21: Table with features related to form both for home and away team. 31 Figure 22: Main metrics results and confusion matrix for XGBoost Classifier. .. 38 Figure 23: Main metrics results and confusion matrix for Logistic Regression. 39 Figure 24: Main metrics results and confusion matrix for SVC ............................