SIFTING THROUGH THE NOISE IN : PREDICTIVE PERFORMANCE OF TREE-BASED MODELS

Aantal woorden / Word count: 34854

Léon Sobrie Stamnummer / student number : 01502643

Promotor / supervisor: Prof. Dr. Dirk Van den Poel

Co-promotor / Co-supervisor: Bram Janssens

Masterproef voorgedragen tot het bekomen van de graad van: Master’s Dissertation submitted to obtain the degree of:

Master in Business Engineering: Data Analytics Academiejaar / Academic year: 2019-2020

ii

Confidentiality agreement

PERMISSION

I declare that the content of this Master’s Dissertation may be consulted and/or reproduced, provided that the source is referenced.

Léon Sobrie, 02/06/2020

iii

Preface

Dear reader,

This thesis is written to achieve my Master’s Degree in Business Engineering at the University of Ghent. I have chosen Formula One as the field of research due to my interest in sports. I started playing soccer when I was 6 years old. Nowadays, running and cycling are my sports of preference. My interest in analytics stirred up during my 5-year education in Business Engineering at the University of Ghent. After examining the literature of different sports, I concluded that the literature regarding motorized sports and more particularly Formula One is rather limited and additional research can add value to literature. A Formula One Grand Prix is also one of my favorite sports to watch on a lazy Sunday afternoon. What inspires me is the fact that Formula One drivers push their bodies to the limit for their passion. They undergo extreme training and they have finetuned diets to reach a level of fitness that will allow them to have higher chances of surviving when crashing their car. In the race, they need a jet fighter mindset to steer their cars at a mind-blowing pace.

The following persons deserve an acknowledgment of their contribution to this Master’s Dissertation. For this dissertation, I am profoundly grateful to Bram Janssens for his professional guidance during the past year. Next, I would like to thank prof. dr. Dirk Van den Poel, the promotor of this dissertation, who inspired me to pick 'Data Analytics' as my main subject. Lastly, my research would not have been possible without the support of my parents and my friends.

Enjoy reading my Master's Dissertation.

Léon Sobrie

iv

Preamble

The COVID-19 virus has a severe impact on the economy and social life in 2020. All the universities are closed in Belgium and they had to switch to online learning. I would like to give my praising for the flexibility of the University of Ghent to provide these online learning opportunities conveniently.

COVID-19 has no major impact on writing this dissertation since the period of the analysis is defined as 1950-2019. However, a connection with the Ergast API, the provider of the database, is established meaning the data sets can be updated with new races. This would lead to altering the basetable and augmenting the predictions with new test data. The implementation of this connection is part of the deployment phase of the SRP-CRISP-DM (Bunker & Thabtah, 2019). The purpose of this connection is to conveniently retrieve the Formula One data sets that are used to conduct this research. Thus, this connection can be considered as an add-on for this dissertation.

With this being said, the conducted analyses: (1) top 3 performance, (2) race completion, and (3) qualifying ability have been carried out in an undisturbed way. Hence, this dissertation has not been altered because of this exceptional situation.

This preamble is drawn up in consultation between the student and the supervisor and is approved by both.

v

Abstract

Formula One is one of the most highly anticipated sports and, yet, the literature regarding Formula One is rather limited. Hence, our research will contribute to literature via 3 analyses on publicly available F1 data. More particularly, tree-based models will be used to model our analyses. The first model studies factors of ‘top 3 finish’ performance leading to additional insights in high- performance in Formula One. Achieving high-performance will help teams to boost their reputation potentially leading to better sponsor deals. Success is a proxy for discovering top-performing drivers in the first analysis. The second analysis studies determinants of ‘race completion’ aiding to make Formula One a safer environment for the drivers. Both technical and human errors occur in Formula One and we hope to contribute in some sense to help to reduce those. The third analysis comprises the ‘qualifying ability’ for the race leading to make sure teams have 2 drivers at the start. Our 3 analyses are conducted using the state-of-the-art SRP-CRISP-DM (Sport Result Prediction CRoss Industry Standard Process for Data Mining) methodology of Bunker and Thabtah (2019). First, the domain logic behind our analyses is explained and we elaborate on how the analyses can be used. Second, the structure of the publicly available data is assessed and, third, a comprehensive basetable is constructed with the relevant features for our analyses. In total, 33 features are crafted to be included in our analyses. Purposely, the number of features is kept considerably limited since the best of both ‘data mining’ and ‘business logic’ worlds is pursued. Fourth, the performance of tree-based models on our Formula One analyses is studied. The reason for choosing tree-based models is due to the absence of tree-based models in previous Formula One studies. Furthermore, we study the impact of the class imbalance in the target output and its treatment on the prediction accuracy of our 3 analyses. Fifth, the results are evaluated using the metrics available in a binary classification setting: AUC, accuracy, sensitivity, specificity, lift, and F1 score. Sixth, a connection with the open-source database is made that facilitates online learning. Conclusively, this dissertation can be classified as ‘applying statistical learning in Formula One to predict top 3 performance, race completion, and qualifying ability’. The results show that the starting position in the race is the most important feature in the top 3 performance analysis. Yet, we see that the points accumulated in the season, the number of past top 3s, and the number of wins of the constructor are used by the models for predicting top 3s whereas the starting position is more used to predict non-top 3s. The prediction results of the race completion analysis are poor meaning that either it is hard to predict or we did not include enough relevant features. Qualifying ability is predicted accurately with as most important features the past number of qualifications, points in the season, and whether the driver has ever started in the top 3.

vi

Table of contents

Confidentiality agreement ...... iii

Preface ...... iv

Preamble ...... v

Abstract ...... vi

List of figures ...... xi

List of tables ...... xiv

List of abbreviations ...... xvi

Part 1: Introduction ...... 1

Background – Formula One ...... 2

Impact of COVID-19 on Formula One ...... 5

Problem situation ...... 6

Research questions ...... 7

Part 2: Literature ...... 9

Related sports ...... 9

NASCAR ...... 14

Formula One ...... 18

Contribution to literature ...... 23

Part 3: Methodology and Research ...... 24

Considered existing methodologies ...... 24 KDD ...... 24 SEMMA...... 25 CRISP-DM ...... 25

vii

SRP-CRISP-DM ...... 26 ASD-DM ...... 26 Overview ...... 27

Selecting the methodology ...... 28

Research ...... 29 Analyses ...... 29 Implementing the methodology ...... 29

Domain Understanding ...... 30

Data Understanding ...... 32 Data dictionary ...... 32 Entity Relationship Diagram (ERD) ...... 33 Data exploration ...... 35 Missing values ...... 41

Data preparation & feature extraction ...... 42 Analysis 1: Top 3 performance ...... 42 Analysis 2: Race completion ...... 42 Analysis 3: Qualification ability ...... 42 Feature description ...... 43 Race-related features ...... 43 Experience-related features ...... 44 Dependent-related features ...... 48 Overview of the crafted features ...... 48 Feature scaling ...... 51 Timeline of features ...... 51 Imputing missing values...... 52 Excluded features ...... 54

Modeling ...... 55 Training and test set ...... 55 Validation ...... 55 Tree-based models ...... 57

viii

Decision trees classifier...... 57 Bagging ...... 60 Random forest ...... 62 Boosting ...... 63 Adaptive boosting ...... 64 Gradient boosting...... 64 XGBoost ...... 65 Interpretability/flexibility trade-off ...... 66 Class imbalance ...... 68

Model Evaluation ...... 71 Confusion matrix ...... 71 Accuracy, sensitivity and specificity...... 72 AUC ...... 73 Lift ...... 74 F1 score ...... 75 Metric behavior in imbalanced classes ...... 76 Variable importance ...... 77 Individual Conditional Expectation (ICE) ...... 78 SHapley Additive exPlanation (SHAP) ...... 78

Model deployment...... 79 Idea ...... 79 Implementation ...... 79

Back on track ...... 80

Part 4: Discussion ...... 81

Results...... 81 Class imbalance ...... 81 Performance of tree-based models ...... 82 Top 3 finish analysis ...... 85 Race completion analysis ...... 96 Qualifying ability analysis...... 107

ix

Tuned hyperparameters values ...... 117 Decision tree ...... 117 Random forest ...... 117 Adaptive boosting ...... 118 Gradient boosting...... 118 XGBoost ...... 119

Conclusion ...... 120

Limitations ...... 122 Technical perspective ...... 122 Sports perspective ...... 122

Further research ...... 123 Augmenting this study ...... 123 Other Formula One study ideas ...... 123

Reference list ...... I

Appendix ...... XIII

Appendix 1: Literature Tables ...... XIII

Appendix 2: Extended Data Dictionary ...... XVIII

Appendix 3: All Wet Affected Races ...... XXIV

Appendix 4: Current Formula One teams and their previous team names ...... XXXII

Appendix 5: Build-up code in R ...... XXXIII

Appendix 6: Impact on AUC of class imbalance treatment ...... XXXVIII

x

List of figures

Figure 1. World map containing all 22 Grand Prix host countries ...... 5 Figure 2. SRP-CRISP-DM methodology applied on this dissertation ...... 29 Figure 3. Entity Relationship Diagram ...... 34 Figure 4. Number of races on the different F1 circuits ...... 35 Figure 5. Races per season throughout the F1 history ...... 36 Figure 6. Top 10 most occurring status in the 'Results' data set ...... 36 Figure 7. Proportion of wet affected races ...... 37 Figure 8. Proportion of hasQ1 ...... 38 Figure 9. Proportion of hasQ2 ...... 38 Figure 10. Proportion of hasQ3 ...... 38 Figure 11. Top 5 occurring driver’s nationalities ...... 39 Figure 12. Constructors per home country ...... 39 Figure 13. All-time wins per driver ...... 40 Figure 14. Drivers that won a race once in their career ...... 41 Figure 15. Time analysis of the availability of features used in the analysis ...... 51 Figure 16. 5-fold cross-validation ...... 56 Figure 17. Decision Tree Classifier ...... 57 Figure 18. Feature space DTC (James et al., 2013) ...... 58 Figure 19. Bagging procedure ...... 60 Figure 20. Random Forest procedure ...... 62 Figure 21. Parallel vs sequential ensemble methods inspired by Xia, Liu, Li and Liu (2017) ...... 63 Figure 22. Boosting procedure ...... 64 Figure 23. Tree-based models and the flexibility/interpretability trade-off (James et al., 2013) .... 67 Figure 24. ROC curve ...... 73 Figure 25. SRP-CRISP-DM as fundament for our analyses ...... 80 Figure 26. All ‘model’-‘class imbalance’ configurations of our analyses ...... 83 Figure 27. Average rank per performance metric for top 3 analysis...... 87 Figure 28. Decision tree top 3 under-sampled ...... 88 Figure 29. VIP DTC top 3 analysis under-sampled ...... 88 Figure 30. ICE DTC top 3 analysis ...... 89 Figure 31. VIP bagging top 3 under-sampled ...... 89 Figure 32. VIP random forest top 3 under-sampled ...... 90

xi

Figure 33. VIP adaptive boosting top 3 under-sampled ...... 90 Figure 34. VIP gradient boosting top 3 under-sampled ...... 91 Figure 35. VIP XGBoost top 3 under-sampled ...... 91 Figure 36. ICE XGBoost top 3 under-sampled with grid ...... 92 Figure 37. ICE XGBoost top 3 under-sampled with pointsSeason ...... 93 Figure 38. ICE XGBoost top 3 under-sampled with pastTop3 ...... 93 Figure 39. SHAP XGBoost top 3 under-sampled ...... 94 Figure 40. VIP XGBoost top 3 under-sampled without grid ...... 95 Figure 41. Average ranking models race completion analysis ...... 98 Figure 42. Decision tree race completion over-sampled ...... 98 Figure 43. VIP DTC race completion over-sampled ...... 99 Figure 44. ICE DTC race completion over-sampled ...... 99 Figure 45. VIP bagging race completion ...... 100 Figure 46. VIP random forest race completion over-sampled ...... 101 Figure 47. ICE plot random forest over-sampled with legacyConstructor ...... 101 Figure 48. ICE plot random forest over-sampled with pointsSeason ...... 102 Figure 49. ICE random forest over-sampled with pastHasQ2Prop ...... 102 Figure 50. VIP adaptive boosting race completion ...... 103 Figure 51. VIP adaptive boosting race completion for 2005-2019 ...... 104 Figure 52. VIP gradient boosting race completion over-sampled ...... 104 Figure 53. VIP XGBoost race completion...... 105 Figure 54. SHAP XGBoost race completion ...... 106 Figure 55. Average ranking models for qualifying ability analysis ...... 109 Figure 56. DTC qualifying ability over-sampled ...... 110 Figure 57. VIP DTC qualifying ability over-sampled ...... 110 Figure 58. ICE DTC qualifying ability over-sampled ...... 111 Figure 59. VIP bagging qualifying ability over-sampled ...... 111 Figure 60. VIP random forest qualifying ability over-sampled ...... 112 Figure 61. VIP adaptive boosting qualifying ability over-sampled ...... 112 Figure 62. VIP gradient boosting qualifying ability over-sampled ...... 113 Figure 63. ICE plot random forest qualifying ability with pointsSeason ...... 114 Figure 64. ICE plot random forest qualifying ability with everStart3 ...... 114 Figure 65. ICE plot random forest qualifying ability with rainExp ...... 114 Figure 66. VIP XGBoost qualifying ability over-sampled ...... 115

xii

Figure 67. SHAP XGBoost qualifying ability over-sampled ...... 116 Figure 68. Popular non-verified source referring to Jones (1996) (source 4) for weather data XXIV Figure 69. Source 1 (Autosport Forum) wet affected races ...... XXV Figure 70. Source 2 (Reddit) wet affected races...... XXV Figure 71. Source 2 wet affected race part 2 ...... XXVI Figure 72. Source 2 wet affected part 3 ...... XXVI Figure 73. Source 3 (Wikipedia) as extra control instrument ...... XXVII Figure 74. Code layout step 1, 2 and 3 SRP-CRISP-DM ...... XXXIV Figure 75. Code layout step 4 SRP-CRISP-DM...... XXXV Figure 76. Code layout step 5 SRP-CRISP-DM...... XXXVI Figure 77. Code layout step 6 SRP-CRISP-DM and feature selection ...... XXXVII

xiii

List of tables

Table 1. Constructors and their drivers for the Formula One 2020 Championship ...... 3 Table 2. Best performing algorithms per sport ...... 13 Table 3. Steps of data mining process methodologies ...... 27 Table 4. Datasets available in the Ergast API and their content ...... 32 Table 5. Circuits location based on longitude and latitude ...... 40 Table 6. Dependent variables in our binary classification analyses ...... 42 Table 7. Categorizing the age ...... 46 Table 8. Categorizing the years Exp ...... 47 Table 9. Features used in analysis ...... 48 Table 10. Learners used in the analyses ...... 67 Table 11. Class imbalance in training set...... 68 Table 12. Confusion matrix ...... 72 Table 13. Metrics, reasoning regarding class imbalance and reference ...... 76 Table 14. Class imbalance in the training set of the top 3 performance analysis ...... 81 Table 15. Class imbalance in the training set of the race completion analysis ...... 81 Table 16. Class imbalance in the training set qualifying ability ...... 82 Table 17. Focus of every model ...... 84 Table 18. No class imbalance treatment 'top 3 analysis' ...... 85 Table 19. Over-sampling 'top 3' ...... 85 Table 20. Under-sampling 'top 3' ...... 86 Table 21. ADASYN 'top 3'...... 86 Table 22. No class imbalance treatment 'race completion' ...... 96 Table 23. Over-sampling 'race completion' ...... 96 Table 24. Under-sampling 'race completion' ...... 97 Table 25. ADASYN 'race completion' ...... 97 Table 26. No class imbalance treatment 'qualifying ability' ...... 107 Table 27. Over-sampling ‘qualifying ability’ ...... 107 Table 28. Under-sampling ‘qualifying ability’ ...... 108 Table 29. ADASYN 'qualifying ability' ...... 108 Table 30. Complexity parameter Decision Tree Classifier ...... 117 Table 31. Number of predictors Random Forest ...... 117 Table 32. Minimum node size Random Forest ...... 118

xiv

Table 33. Maximum depth tree Adaptive Boosting ...... 118 Table 34. Maximum depth tree Gradient Boosting ...... 119 Table 35. Maximum depth tree XGBoost ...... 119 Table 36. Nrounds tree XGBoost ...... 119 Table 37. Literature table of the related sports studies ...... XIII Table 38. Literature table of the NASCAR studies ...... XV Table 39. Literature table of the F1 studies ...... XVI Table 40. Literature table of the potential methodologies ...... XVII Table 41. Circuits ...... XVIII Table 42. Constructor results ...... XVIII Table 43. Constructor standings...... XIX Table 44. Constructors ...... XIX Table 45. Driver standings ...... XIX Table 46. Drivers ...... XX Table 47. Lap times ...... XX Table 48. Pit stops ...... XX Table 49. Qualifying ...... XXI Table 50. Races ...... XXI Table 51. Results ...... XXI Table 52. Seasons ...... XXII Table 53. Status ...... XXII Table 54. Weather ...... XXII Table 55. Country ...... XXIII Table 56. Wet affected races per decade ...... XXVII Table 57. Constructor and their legacy ...... XXXII Table 58. Average AUC for different analysis with different class imbalance approaches ... XXXVIII

xv

List of abbreviations

ACC: Accuracy AdaBoost: Adaptive Boosting ADASYN: Adaptive Synthetic Sampling Method for Imbalanced Data ANN: Artificial Neural Network ASD-DM: Agile Software Development for Data Mining AUC/AUROC: Area Under the Receiver Operator Curve BAG: Bagging CART: Classification And Regression Trees CP: Complexity Parameter CV: Cross-Validation CRISP-DM: Cross-Industry Standard Procedure for Data Mining DTC: Decision Tree Classifier F1: Formula One FIA: Fédération Internationale de l'Automobile FN: False Negative FP: False Positive FPR: False Positive Rate GB: Gradient Boosting GP: Grand Prix ICE: Individual Conditional Expectation KDD: Knowledge Discovery in Databases LASSO: Least Absolute Shrinkage and Selection Operator LOOCV: Leave-One-Out Cross-Validation MICE: Multiple Imputation using Chained Equations MLP: Multilayer Perceptron NASCAR: National Association for Stock Car OLS: Ordinary Least Squares PDP: Partial Dependence Plot RBF: Radial Basis Function RF: Random Forest RFME: Random Forest with Mixed Effects SEMMA: Sample, Explore, Modify, Model, and Assess

xvi

SEN: Sensitivity SHAP: SHapley Additive exPlanations SMOTE: Synthetic Minority Over-sampling Technique SPC: Specificity SRP-CRISP-DM: Sports Results Prediction Cross-Industry Standard Procedure for Data Mining TN: True Negative TP: True Positive TPR: True Positive Rate VIP: Variable Importance Plot XGBoost: eXtreme Gradient Boosting

xvii

Part 1: Introduction

This dissertation applies principles of statistical learning in Formula One to predict the (1) race outcomes, (2) race completion, and (3) qualifying ability. The race outcome is the finish position in a race by a specific driver. Race completion is defined as a driver neither having technical difficulties nor making a human error leading to halting their race. Qualifying ability indicates whether the driver was able to qualify to start the race. In rare occasions, drivers do not qualify to participate in a race leading to reduced chances for their team to achieve a good finishing position. 4 parts will be the fundaments of our study: ‘Introduction’, ‘Literature’, ‘Methodology & Research’, and ‘Discussion’. The first part contains an introduction to the Formula One environment. Herein, the background of this sport, the tackled research problems, and the research questions are outlined. The second part of this dissertation encompasses a literature overview of F1, NASCAR and related sports. Related sports can be defined as ‘sports with a ranking system that defines the outcome of the race’. Examples of related sports are swimming, cycling, horse races, skiing… The third part explains the followed methodology in this dissertation that will be used to conduct the research. In a nutshell, the ‘Research’ section will craft the framework that will lead to our results regarding (1) the determinants top performance, (2) the internal and external drivers of race completion, and (3) the underlying factors for qualifying ability. The last part of this dissertation is the ‘Discussion’ in which the obtained results will be discussed. Next, the conclusion section answers the research question with the insights from the 3 analyses conducted: top 3 performance, race completion and qualifying ability. Furthermore, potential routes for further research are stipulated and the limitations and assumptions of this dissertation are debated. In this dissertation, a multitude of sources are used to construct a comprehensive study for predictions in Formula One. There are three sources in particular that are highly cited in this dissertations which deserve the acknowledgement of being on the first page of this dissertation: 1. Galar, Fernandez, Barrenechea, Bustince and Herrera (2011) for their contribution towards treating the class imbalance problem of our analyses. 2. James, Witten, Hastie and Tibshirani (2013) for their elaboration on tree-based models that steered the top 3 performance, race completion and qualifying ability predictions. 3. Bunker and Thabtah (2019) for their prediction framework that carried the documentation of our analyses.

1

Background – Formula One

The purpose of this ‘Background’ section is to highlight important aspects of Formula One and stress some remarkable events that happened during the rich history of this motorized sport. However, this section has not the intention to describe the entire Formula One history. In essence, Formula One is a sport in which high-speed cars race against one another on a closed circuit. However, this one-liner does not fully grasp all the beauty, danger, and passion of this motorized sport. Formula One, the pinnacle of current motorized sports, goes back a long time. Formula One finds its origins in the Le Mans Grand Prix which was organized by the ‘Auto Club de France’ in 1906 (Hughes, 2004). In the 1920s and 1930s, regulation efforts were made to create an annual competition in auto racing. After the Second World War, the FIA (Fédération Internationale de l'Automobile) crafted in 1947 the regulations to hold a World Drivers’ Championship which was called ‘Formula One’. However, it took until 1950 to effectuate the first Formula One race. In 1950, the Grands Prix, in which the Driver’s Championship was contested, were held in Britain, Monaco, , Belgium, France, Italy, and at the mythical (USA). (Bekker & Lotz, 2009; Jenkins & Floyd, 2001). The first Drivers’ Championship was won by the Italian who was driving for . In 1978, Formula One was broadcasted worldwide for the first time leading to global awareness for the sport and a drastic increase in the team budgets. Nowadays, Formula One is a multibillion-dollar industry with a loyal fan base who can have, according to Rosenberger III and Donahay (2007), up to 3 times more brand loyalty compared to fans of other sports. In 2018, Formula One reached 490,2 million unique viewers (F1., 2019a) which stresses the omnipresence of Formula One in the sports world. This global presence that Formula One is pursuing, comes with a price. Every week, every team has to transport their cars, equipment, and staff to another country. For some races, they even have to transport their entire set-up, going from staff to equipment, to the other corner of the world resulting in a lot of technical difficulties. The teams transport their belongings via a combination of three modes of transport: roadways, airways, and waterways (Iyengar, 2017). A fun fact is that every team carries enough spare parts to completely rebuild their cars (Davies, 2014). The usual discourse of Formula One ‘weekend’ is as follows. On Friday, every team gets the opportunity to test their cars on the circuit for 90 minutes to finetune the cars and take into account the characteristics of the circuit. On Saturday, the teams and their drivers fight for an as- high-as-possible starting position in the qualification rounds. The driver with the fastest lap during

2

these qualifiers gets the privilege to start at the ‘pole position’. On Sunday, the actual race is played out resulting in the top 3 drivers popping champagne at the podium. In Formula One, records are broken at a fast pace (F1., 2019b). In 2015, Max Verstappen became the youngest driver to start a race aged 17 years and 166 days. In the of 2019, the Team completed the fastest pitstop of all time, 1.82 seconds. Another record-breaker is Lewis Hamilton, winner of six championships, who holds the record for most pole positions, most consecutive race starts and the most races with a single manufacturer. Another record is specific to the 2020 World Championship since this championship includes 22 Grands Prix, the highest amount of races in one season ever. Unfortunately, COVID-19 did throw a spanner in the works resulting in a cancellation of the first part of the season. The Formula One teams lean on technological innovation in the pursuit of finding the optimal race car in the concurrent racing environment. The teams prepare their car for maximum performance in three ways which are off-season testing – during the winter -, pre-race testing, and adjustments during the race (Noble & Hughes, 2004). On the Formula One website it Technology originating from the F1 world has supercharged the world since this technology provides opportunities to change cities, save lives, and reduce emissions (F1., 2019c). Thus, the importance of technology in the field is evident, but this is not the only factor influencing results. A car, regardless of the implemented state-of-the-art technology, can only be as good as the driver that steers it. These drivers are extremely skilled at maneuvering their high-speed multi-million dollar F1 cars. It is due to the expertise of their team the drivers can drive such high-performance cars. These teams, which are called ‘constructors’, are very influential in the world of F1 as they give decades of experience and development to the drivers. Consequently, the constructors have their ranking system which contains the accumulated points of their drivers. To make the competition fair between constructors, every constructor is allowed to have 2 drivers in the team. Table 1 contains the constructors and their drivers for the Formula One 2020 Championship. Table 1. Constructors and their drivers for the Formula One 2020 Championship Team Driver Lewis Hamilton Mercedes Valtteri Bottas Ferrari Charles Leclerc Alex Albon Red Bull Max Verstappen

3

Lando Norris McLaren Carlos Sainz Daniel Ricciardo Renault Esteban Ocon Pierre Gasly Alpha Tauri1 Daniil Kvyat Sergio Perez Racing Point Lance Stroll Kimi Raikkonen Alfa Romeo Antonio Giovinazzi Haas Kevin Magnussen George Russell Williams Nicholas Latifi

A remarkable statistic is that only 6 drivers in the current Formula One Championship have ever won a race. Hamilton is the undisputed number 1 with 84 wins in his career. Vettel comes second with 53 wins throughout his career. With 21 wins, the podium is completed with Raikkonen, who started his first F1 race in 2001. Bottas, Hamilton’s teammate at Mercedes, has won 7 races. The youngsters Verstappen and Leclerc show huge potential and have respectively won 8 and 2 Grands Prix. The Formula One Championship has a global allure due to its omnipresence in the sports world. This global presence is noticeable when looking into the locations at which a Grand Prix is organized. There are 9 races in Europe, 9 in Asia, 2 in North-America, 1 in South-America and 1 in Australia in the original 2020 season2. All the countries that host a Grand Prix for the 2020 World Championship are shown in Figure 1.

1 Toro Rosso until last season, but they switched their name for the 2020. 2 COVID-19 has canceled or postponed these races leading to a potential change in this statistic.

4

Figure 1. World map containing all 22 Grand Prix host countries

Impact of COVID-19 on Formula One

The COVID-19 virus has a serious impact on events all over the world. Consequently, the sports events for the month March until June or even later are canceled because the safeguarding of general human health is of paramount importance. The 2020 season will go into history as a very special edition of no races in the first half of the season. This is because the FIA rightfully places the health of the drivers above the sport at these exceptional times. In the meantime, Formula One fans can spend their time re-watching videos of previous seasons. Bentley and Murray (2016) state that nostalgia can be found in re-watching these videos. These videos of old races can bring back good old memories of glorious wins. Moreover, nostalgia was introduced in 2019 by awarding one extra point to the driver with the fastest lap in the race. This rule was also used during the first decade of the World Championship (1950-1959).

5

Problem situation

An F1 Grand Prix (GP) is continuously monitored by all constructors and thus a huge amount of data is produced during a GP. This is also the case for the qualifications in which the drivers try to get the fastest round to start at the pole position, which is the first position to start the race. This data can be used for further analysis of the progress of the race and to intervene when irregularities are discovered. Also, this acquired data could help to build a predictive model for predicting the outcome of an F1 race. To construct such a model, features have to be created which could have predictive power towards the outcome of the GP. The problem is that too much data is created which leads to finding a model that can sift through the noise resulting in insights and pattern recognition in an F1 setting. Here is where the title of the dissertation ‘Sifting through the noise’ comes into play. This wordplay, Formula One race cars produce up to 140 dB, refers to uncovering the most important features in our analyses. The 3 problems this dissertation tackles and their related analyses are postulated below. Problem 1: discovering high-performance In the contemporary Formula One environment, 10 teams can participate the race each with 2 drivers to compete for an as high-as-possible finish position. Discovering what mattered in the past can lead to making predictions for future F1 race performance. However, there are a lot of internal and external factors to take into account. Internal factors could be past performance or years of experience and external factors could be the weather conditions in a particular race. Problem 2: safer environment The technical flaws and human errors during the race are a less pleasant side of Formula One. Technical flaws comprise engine defects, breaking issues, and related problems. Human errors are crashes between 2 (or more) drivers or a crash because of a driver misjudging the racing track. To make Formula One a safer environment, factors that contribute to a higher race completion rate will be studied. The eventual aim of the race completion analysis is to predict whether technical or human dropouts would occur in certain circumstances. Problem 3: all eggs in one basket By nature, Formula One team can ride on 2 horses since they are allowed to start with 2 drivers every race. This seems like a given but, in reality, the drivers have to qualify to participate in the race. If a driver is not able to qualify, the team will be handicapped in a sense that they only can participate with 1 driver resulting in a strategy called ‘putting all eggs in one basket’. If that one driver fails to achieve a noteworthy finish position the team will return home empty-handed. A qualifying ability analysis can clarify the drivers of these rare occasions of non-participation.

6

Through solving these problems, this Master’s Dissertation could be an aid for a certain group of people. First of all, F1 teams can discover the important features for success in Formula

One. Secondly, sports analysts can discover what drives success3 in Formula One. Lastly, F1 fans can enlarge their knowledge about Formula One by reading this dissertation. In short, this dissertation aims to solve some current managerial and academic issues in the field of Formula One.

Research questions

The main research question of this dissertation is:

RQ1: ‘What drives predictive accuracy in top 3 performance, race completion, and qualifying ability in Formula One?’

This dissertation will be based on these 3 analyses and, thus, for every analysis, a related sub- question is posed. The first model, the top 3 analysis, investigates ‘top performance’ as finishing in the top and, hence, the related sub-question is postulated as follows:

‘What factors are most important to predict top 3 performance?’

Race completion, the research topic in our second analysis, studies factors affecting human errors or technical dropouts. Thus, the research question is as follows:

‘What factors are most important to predict race completion?’

The third analysis scrutinizes the qualifying ability of a specific driver. This analysis might seem trivial from a business perspective but having 2 drivers at the start of a race allows a team to put their eggs in two baskets.

‘What factors are most important to predict racing qualifications?’

3 Success is defined as (1) top 3 performance, (2) race completion, or (3) qualifying ability.

7

The title of this dissertation ‘Sifting through the noise in Formula One: predictive performance of tree-based models suggest that tree-based models4 will play a vital role in our study. Thus, an additional research question can be posed:

RQ2: ‘How well do tree-based models perform at making predictions in the field of F1 top 3 performance, race completion, and qualifying ability?’

These tree-based models are approached using the interpretability-flexibility trade-off (James et al., 2013) which states that higher interpretability comes with a burden of lower flexibility and vice versa. Researching this trade-off applied to the tree-based models is one of the main drivers of this dissertation (cf. infra Contribution to literature). With this trade-off in mind, the following sub- question is formulated:

‘How well do interpretable tree-based models perform compared to flexible tree-based models at making predictions in Formula One?’

Class imbalance5 is a problem that occurs in many real-world applications (Galar et al., 2011). Sampling techniques can be used to increase the performance of statistical learning methods. Consequently, the following sub-question is posed:

‘What is the effect of sampling techniques on the predictive performance of the tree-based models?’

4 The tree-based models are decision trees, bagging, random forest, and boosting.

5 Class imbalance refers to the unequal occurrences of different classes for a variable. For example, a binary variable could have 5% 0s and 95% 1s. This variable has highly imbalanced classes.

8

Part 2: Literature

In this section, the literature6 about Formula One, NASCAR and related sports is discussed. The scope of the literature study is gradually narrowed towards Formula One. Firstly, the focus is on related sports analyses for which outcomes are predicted. Related sports, in this case, are defined as sports where multiple contestants are striving to win the race. Examples of these multiple-entry competitions are horse racing, swimming, hurdle racing, and skiing. Secondly, the scope is narrowed to other motorized sports with NASCAR, the American F1 counterpart of top-level racing, as closest to Formula One. The focus will be on studying the already discovered performance drivers in NASCAR. However, NASCAR is only an intermediate step since the end goal is Formula One. Lastly, Formula One literature is discussed with the findings, limitations, suggestions and results of the papers. These studies can be classified into three broad categories: 1. Analyses in which drivers regarding performance are assessed 2. Studying the technologies and their impact on the dominance of a team 3. Researching the impact of bad weather on technical issues Related sports Related sports to Formula One are sports where multiple contestants are striving to win a race. Papers related to performance assessment and predictions in the field of swimming, horse racing, hurdle racing, soccer, and cycling are discussed in this subsection. An overview of these studies can be found in Table 37 (Appendix 1). First of all, Edelmann-Nusser, Hohmann and Henneberg (2002) predict the performance in swimming using neural networks7. Specifically, their study is about predicting the performance of 200m-backstroke swimmers in the finals of the Olympic Games in Sydney. Their prediction is very accurate with a prediction error of prediction barely 0.05s. They used the ‘type of training exercises’ as input to predict the performance of a swimmer using the leave-one-out method and multiple linear regressions. The added value to literature is the fact that the performance of a swimmer can

6 All the discussed papers with their content and author(s) can be found in the Appendix. The intention is to create a summary of all the used papers in the literature overview.

7 A neural network can be seen as an interconnected group of nodes with multiple layers that facilitate statistical learning. ANN is based on the neurons in a brain.

9

be predicted accurately using neural networks. Another insight here is that the problem of having a small number of datasets can be solved with pre-training using the data of another swimmer. To conclude, this paper postulates that adapting the training intensity has a beneficial impact on the performance of a swimmer. The link between this paper and Formula One is the fact the driver does follow a work-out to stay in shape. Stuart (2016) states that driver Daniel Riccardo has an intensive work-out program for his neck, trunk, reaction, agility and cardio to maintain in top condition. Another sport of interest in this literature overview is horse racing. The first paper about horse races is written by Lo and Bacon‐Shone (1994). In their study, the authors predict the ordering probabilities in a multiple-entry competition environment. One of the reasons for this research is the fact that bets on horse races mostly involve multiple positions. The connection with F1 is that the races are also a multiple-entry competition. The authors describe 2 proposed models 푃푖푃푗 in this study. The Hartville model, as proposed by Harville (1973), uses the formula 푃푖푗 = to 1−푃푖 predict whether horse i wins and horse j finishes second. The authors state that the ordering probabilities can be found when an underlying probability distribution for the running time of horses is assumed. The second model is the Henery model, as proposed by Henery (1981), assumes that the running times are independently and normally distributed with unit variance. This study states that computing the probability 푃푖푗 will result in a non-closed form solution. The authors perform a logit analysis resulting in a systematic bias for the Hartville model. On the contrary, the Henery model does not have such a systematic bias. In a second study regarding horse races, Davoodi and Khanteymoori (2010) apply Artificial Neural Networks (ANN) in horse racing to predict the outcome of these races. The authors apply 3 different types of algorithms for training a network which are supervised, unsupervised and reinforcement learning. The authors apply a Time Series Analysis, Regression and Neural Networks in this study. The different supervised learning algorithms for the Neural Networks are used in this study resulting in an average accuracy of 77% for the algorithms. Davoodi and Khanteymoori define the learning algorithms as follows:‘ 1. Supervised learning is an approach where the neural network uses a training set and gets a clearly defined target. 2. Unsupervised learning gives more freedom to the algorithm in order to discover patterns in the data input without getting aid from external sources. 3. Reinforcement learning is learning where neural network components (neurons) are rewarded for good performance and punished for bad performance. ‘ (p. 156)

10

However, there is a fourth way of training an algorithm, semi-supervised learning, in literature. Zhu (2005) defines semi-supervised learning as using both labeled and unlabeled data to build classifiers. This is done because labeled data is expensive to obtain or time-consuming to retrieve. The third study in the research domain of horse races is from Pudaruth, Medard and Dookhun (2013). The authors predict horse races using a weighted probabilistic approach with the race track of ‘Champ de Mars’ in Mauritius as a case study. The authors propose 9 factors that are ought to have predictive power for the outcome of horse races. Amongst the factors in this study are the jockey, type of horse and the experience of the horse. Adding these 9 factors will lead to a value that depicts the predicted winner of the race with the highest total score on the factors. Since every horse is included in this analysis, it would be possible to make a ranking system depending on the total score of the 9 factors. The success rate for predicting the winner was 58.33% in this study. In 4.7 out of 8 horse races, the winner was accurately predicted. A success rate of 58.33% might not seem impressive but it definitely is when the predictions made by the best tipsters only have a success rate of 44%. The fourth and last treated paper about horse races is written by McGivney et al. (2019). The authors study the heredity of the durability traits for Thoroughbred horses, a special breed for horse racing. In their study, a prediction is made for racecourse starts according to genetic data. The model used in this study is a random forest with mixed effects (RFME). The result is that the racehorses with higher genetic potential have fewer non-participated races (27% for low genetic potential and 16% for high genetic potential) and have better race outcomes. Another sport of interest in this literature is cycling in a professional race setting. Karetnikov, Nuijten and Hassani (2019) assess the impact of a specific training scheme on the performance of the cyclist. They build a predictive model that can predict the position of a cyclist given the training scheme the cyclist has followed and the historical data of the cyclist’s performance. In this study, 48 attributes are used to predict the MMP (Maximum Mean Power) and finish positions of the cyclists. The prediction models present in this study are Linear Regression, Lasso Regression, LSTM, Decision Tree, Random Forest Regression, CatBoost and XGBoost. In this study, the CatBoost and XGBoost have the best predictive accuracy of the investigated predictive models. The best performance is obtained when the models focus on mountain races since there is more distinction between mountain skills of professional cyclists. In the Tour de France, only the best 10 cyclists can climb a mountain at a top-level pace. In this paper, there is a short rationale about which algorithm appears to be the best for a specific sport. The authors state that lasso regression yields the best performance in hurdle races, neural networks in swimming and linear regression in walking.

11

Since soccer is the most popular sport worldwide with billions of followers, predictive analyses about this sport are included in this literature overview. These studies give additional inspiration for conducting state-of-the-art sports predictions which is the intention of this dissertation. Eryarsoy and Delen (2019) develop a predictive model for the outcome of a soccer game (win, draw or loss) and they develop factors with the potential to influence the outcome of the game. The reason this study is discussed is the fact that the outcome can be 3 options. When developing a predictive model, the fact that there are 3 possible outcomes are possible should be treated appropriately. The methodology used is the CRISP-DM (Cross-Industry Standard Process for Data Mining) which is a popular data mining framework. Around 40 different variables are used to predict the outcome of the game. The prediction techniques used in this study are Naive Bayes, Decision Trees, and Ensemble models. The 2 Ensemble methods used are Gradient Boosting Trees and Random Forest. The accuracy for the predictions where respectively 74% and 86% for the “Win/Loss/Draw” prediction and the “Point/NoPoint” prediction. The CRISP-DM methodology is the standard process for data mining research and provides a promising stepwise approach to structure a Master’s Dissertation. Another paper concerning soccer is the paper of Ulmer, Fernandez and Peterson (2013). The authors discuss the prediction of soccer matches in the English Premier League using machine learning algorithms. The best performing algorithms in this paper are one-vs-all SGD, linear SVM, Random Forest and SVM with RFB kernel. In this study, draws were particularly hard to predict and it was not an option to weigh draws more since this was detrimental for the predictive accuracy of the models. Tax and Joustra (2015) present a match-based prediction system for the Dutch Eredivisie. The authors create 3 different models: a public data model, a betting odds model, and a hybrid model. The public data model is best predicted with Naïve Bayes and Multilayer Perceptron. For the betting odds model, the highest prediction accuracy is obtained with FURIA and the highest predictive performance for the hybrid model is achieved when combining ReliefF8 and LogitBoost. The last two treated papers in this related sports literature overview are respectively about skiing and hurdle races. Abut, Akay, Daneshvar and Heil (2017) examine the predictive accuracy of artificial neural networks (ANN) for the racing time of cross-country skiers. The authors tackle their research by using 3 popular ANN’s9 which all have comparable performance with acceptable

8 ReliefF is based on the study of Kira and Rendell (1992) and it serves as a practical approach to feature selection.

9 The 3 ANNs are MFANN, GRNN, RBFNN.

12

error rates. This study concludes that ANN’s are fit in for making predictions in a skiing environment. Przednowek, Iskra and Przednowek (2014) perform a predictive modeling study in the field of 400- meter hurdle races. The authors apply both linear and nonlinear multivariable models to predict the outcome of hurdle races. As linear methods include ordinary least squares (OLS), ridge and LASSO regression. The nonlinear methods consist of neural networks as multilayer perceptron (MLP) and radial basis function (RBF) network. The best model is chosen based on leave-one-out cross- validation (LOOCV). In this study, LASSO shrinkage regression is the best performing method for predicting the outcome of 400-meters hurdle races. The focus of the studies discussed in this section is on predicting the outcome of a race or game in a sport that is related to Formula One. Only a handful of papers uses linear methods and, hence, most authors opt for other methods like Neural Networks, Random Forest, or Naive Bayes. Thus, the literature of sports that are related to Formula One suggests that linear methods are insufficient for modeling race rankings. This can also be seen in Table 2 where the best-performing algorithms per sport are shown with their source. Neural networks, boosting algorithms, and LASSO regression yield the best performance in related sports.

Table 2. Best performing algorithms per sport

Sports Best performing algorithm Reference Swimming Neural network Maszczyk et al. (2012) Przednowek, Iskra and Przednowek Hurdle races LASSO regression (2014) Wiktorowicz, Przednowek, Lassota and Walking LASSO regression Krzeszowski (2015) CatBoost for the flat stages Cycling Karetnikov, Nuijten and Hassani (2019) XGBoost for the mountain stages Neural network with back *Davoodi and Khanteymoori (2010) Horse races propagation(*,**) or decision **Chen, Rinde, She, Sutjahjo, Sommer tree(**) and Neely (1994) Neural networks with Soccer Hucaljuk and Rakipović (2011) backpropagation, LogitBoost10

10 The LogitBoost method yields slightly worse results for soccer prediction, but is still worth noticing.

13

NASCAR

Another sport discipline in the motorized racing sports is NASCAR (National Association for Stock Car Auto Racing). Since this thesis is about Formula One, an overview of the NASCAR literature can give new insights into the prediction of racing activities. A potential route for this thesis could be comparing F1 with NASCAR and depicting the points of resemblance as well as the points of difference between these motorized sports. In the literature overview about NASCAR, a multitude of papers containing a predictive analysis are discussed. This section ends with a comparative analysis between F1 and NASCAR. An overview of the NASCAR studies can be found in Table 38 (Appendix 1). Probability models are constructed in a professional race setting by Graves, Reese and Fitzgerald (2003). The authors analyze the racing result of NASCAR using a Bayesian hierarchical framework as a model. This study compiles a probability model for finishing positions in NASCAR. The fundamental models in this analysis are the model of Luce (1959) and a modification of the model of Stern (1990). The authors assess the track-independent driver abilities and they predict the future starts according to the rate of improvement of the driver’s skills. The most important findings in this analysis are the considerable influence of the team’s ability on the driver’s ability. Moreover, there is evidence of the existence of track specialists; some drivers perform much better on specific tracks. The reliability of potential predictors for the outcome in NASCAR is assessed by Pfitzner and Rishel (2005). The authors acknowledge the importance of variables like the driver’s skill and pit crew performance. However, some factors are not under the control of the constructor or driver such as the weather, behavior of other drivers… Since there is a multitude of variables that are very hard to take into account during the progress of the race, the authors propose to only take into account variables that are known before the race starts. In their model, car speed, driver skills, and the related result are taken into account. Another important factor in their analysis is the presence of multi-car teams. Multi-car teams have more driver/car combinations available and have a steeper learning curve and, hence, a higher marginal return. Moreover, multi-car teams are more likely to acquire bigger sponsorship deals and thus more money to invest in their crew, technology, or driver(s). Other substantial advantages for multi-car teams are economies of scale and the availability of more data since multi-car teams have more participants. This analysis was performed for 14 races in the 2003 NASCAR season. The driver’s skill, measured via ‘laps completed’ and ‘points accumulated’, is a very important variable to determine the performance in NASCAR.

14

However, team-related variables exercise an influence on the result. There is a positive correlation between team size and better position and switching teams is correlated with worse positions. Allender (2008) studies the influence of driver experience in predicting the outcome of NASCAR races. This study performs a regression on the data from 38 races of the 2002 NASCAR season. The dependent variable is the outcome of the race and the independent variables are the starting position of each driver, the track length, % of laps under caution (number of laps in which car positions are frozen because of accidents), and the driver experience expressed in years. In the author’s model, the starting position and the driver experience expressed in years are the significant variables. Next, interaction effects are included in the analysis to assess the significance of these effects. There are 2 significant interaction effects in this study. First, the more experienced and the higher the starting positions, the better the outcome of the driver will be. Secondly, the driver experience and the track length are interacting with one another and have a positive impact on the driver’s outcome. This is because during the career progress drivers will gain experience and will be able to deal with different track lengths more smoothly. Tulabandhula and Rudin (2014) discuss the challenges for predictive analytics in a professional racing environment. In their paper, the authors contributed to the literature by creating a real-time model that supports team captains to make vital tire-changing decisions. In their in-race analysis, the entire knowledge process cycle is considered. The knowledge process in this study is composed of an explanatory analysis, feature engineering, modeling, data mining and decision making. The authors ask 3 research questions in their real-time decision-making analysis. The first research question wonders whether it is possible to determine the further outcome of the race using the driver’s racing history. Secondly, it is questioned whether re-fueling and tire-changing decisions can be optimized according to the predicted performance of the driver for the remainder of the race. Lastly, this study investigated whether a driver's past behavior has valuable insights for the future. This paper addresses the complexity of racing since tire-decisions are crucial and drivers influence one another (neighborhood effect). The following hypotheses are postulated in this research:

1. The momentum in the ranking, whether racer is climbing in rank or not, has predictive power. 2. Neighborhood effect, which is the fact that drivers influence one another, is a predictive factor. 3. Aggregation can be done over races and thus performance is not only based upon one Grand Prix.

15

Two baseline models are created as a benchmark for predictive accuracy. The first baseline model uses the starting rank to predict the finish rank of the driver. The second baseline model predicts the finish rank with the average rank as input. These baseline models are compared with more advanced models, which are ridge regression, support vector regression (SVR), LASSO (least absolute shrinkage and selection operator), and random forests. The predictive accuracy measures are R2, RSME (Root Squared Mean Error), and sign accuracy The baseline measures have a high performance for the RSME since starting rank, average rank, and finish rank of a racer are in almost all cases close to one another. However, these baseline models have no additional value for in- race decisions since they are static. The starting rank of a racer does not change during the race and hence no predictive power is found in such a measure. This translates itself in a slightly negative R2 and lower sign accuracy for both baseline models in this study. All advanced models perform better on R2 and sign accuracy than the baseline models. However, the model using the random forest algorithm has a lower R2 than the other advanced models. In the thesis Choo (2015), NASCAR and F1 are studies to create an in-race performance prediction software. The author builds in his thesis further on the work of the previously discussed paper from Tulabandhula and Rudin (2014). This study contributes to the literature with new insights and with reinforcements of previous findings. First of all, there is a strong correlation between the tire wear characteristics on the one side and the tire change decisions on the other side. Secondly, there is a positive correlation between the racer’s finish position and the racer’s momentum in the race and the performance of the pit crew, which is the crew responsible for monitoring the pit-stops. Thirdly, it is possible to aggregate races for the analysis as was hypothesized by Tulabandhula and Rudin. Choo concludes that tire-change decisions at later stages have more impact on the end result of the race than early-stage tire change decisions. Moreover, poorly orchestrated early-stage tire change decisions can be compensated by good driver performance or outstanding pit crew performance. The next paper about NASCAR considers the impact of increasing price money on the performance of drivers. Over the past years, price money has increased in motorized sports and this might lead to drivers taking more risks since more money can be won. This reasoning is called the ’tournament theory’. Humphreys and Frick (2019)study the tournament theory in NASCAR. The authors assess the predictive potential of difference in price structure for the performance of racers. They find that higher spreads in prices paid leads to a higher average speed in the race.

16

F1 and NASCAR are both motorized sports but it would be convenient to know the similarities and the differences between these sports. Such a comparative analysis has been made by Silva and Silva (2010). In their study on 2009 data for both sports, the relationship between the past success, which is success during the practice round, qualifiers and in the past, and the finish positions are investigated. The authors discuss the differences between Formula One and NASCAR. For example, NASCAR races start with over 2 times more drivers than Formula One races. Another point of difference is the number of races in one season; NASCAR has around twice as many races in one season compared to Formula One. However, there are similarities between these 2 motorized sports. Both sports award points according to the finish position of the race. These points are accumulated over the season and the racer with the highest points at the end of the season wins the championship. In addition, both sports give the drivers the opportunity to practice the circuit and in both sports, there are qualification laps that determine the starting positions for the race. In this research, 4 variables were created. ‘Qualifying’ to measure the qualification results, ‘Practice’ to quantify the practice times, ‘Points’ to account for past performance and ‘Result’ to determine the finish position. In this paper, the variable ‘Points’ has the highest predictive power for NASCAR finish positions and the variable ‘Qualifying’ is best at predicting F1 results. There is a significant relationship between ‘Practice’ and ‘Result’ for NASCAR but there was not one for F1. The authors notify the opportunity to assess the importance of the weather. Unfortunately, only two F1 races were during rain and NASCAR races are canceled when it rains. In the analysis for NASCAR, the model is good at drawing relationships between the finish position and the explanatory variables. However, when the analysis is performed for the ‘Top 20 racers’, the model is not able to make a good prediction of the finish position for NASCAR. In NASCAR literature, most studies pay attention to linear regression as a method to analyze the outcome of NASCAR races. However, there are some papers that deploy more advanced models – e.g. Random Forest – like in the study of Tulabandhula and Rudin (2014). The NASCAR literature is dominated by linear analyses which is in conflict with the more advanced methods used in the related sports literature. There is an absence of more advanced methods and additional research for the performance of, for example, Neural Networks could be assessed in NASCAR.

17

Formula One

There a handful of studies with a specific focus on performance in Formula One. An overview of these studies can be found in Table 39 (Appendix 1). Starting the literate overview, two studies about the best Formula One driver of all time are discussed. Eichenberger and Stadelmann (2009) research the best Formula One driver of all time using data which covers 57 years, from 1950 until 2006, of Formula One racing. The authors state that the performance of a driver is dependent on both talent and the quality of the car. Drivers will not finish a race when a human or technical dropout occurs. That is why the authors distinguish between human dropouts, like collisions, accidents and disqualification, and technical dropouts, like tire problems, engine failures… Furthermore, they only consider drivers that participated in more than 40 races. The search for the best Formula One driver is done via linear regression in this paper. The top 3 drivers in their linear regression analysis are (1950 - 1958), Jim Clark (1960-1968), Michael Schumacher (1991-2006 and 2010-2012). To establish a robust study, the authors included 2 control variables which are the classification of the team partner driver and the home advantage. Home advantage is when the GP is held in the country where the driver was born. Bell, Smith, Sabel and Jones (2016) research the formula for success in Formula One through multilevel modeling of driver and constructor performance. In their analysis, Formula One data from 1950-2014 is used. Firstly, the authors assess the best F1 driver of all time taking into account their designated team. According to their research, the best F1 driver of all time is Juan Manual Fangio. Michael Schumacher comes at an honorable eight place since most of his races were won in a high-performing car. Secondly, this study researches the importance of both teams and drivers. Bell et al. (2016) conclude that team effect outweigh driver effects since team contribution accounts for 86% in driver variation. Lastly, this paper quantifies the change of driver and team effects overtime under the altered racing conditions. This study states that team effects appear to be steady over time and the authors conclude there is a ‘legacy effect’. This effect refers to the accumulated experience that the teams acquire throughout the years. However, there is some disagreement in the literature regarding the contribution of both the team and driver in Formula One performance. In the study of Spurgeon (2009), Nico Rosberg proclaimed that the driver contributed 20% and, hence, the team 80% to the overall performance. On the contrary, Allen (2000) claimed the opposite pointing to the unfortunate Michael

Schumacher11 as an example where driver contribution is overshadowing team contribution.

11 Michael Schumacher had a terrible skiing accident in 2013 and is still recovering.

18

Technology plays a major role in the concurrent Formula One landscape. The 3 succeeding studies focus on technology and the relationship with competitive advantages. Judde, Booth and Brooks (2013) perform an analysis of competitive balance in Formula One racing using data from 1950-2010. The authors argue that changes regarding regulation have an impact on the uncertainty of the championship. However, they state that regulations do not significantly impact race uncertainty or long-term dominance within Formula One. Jenkins and Floyd (2001) study the trajectories of technology in a Formula One setting. They looked into 3 time periods in which a technology defined the rules of a race. From 1967 until 1973, the Ford DFV engine was the winning factor in this period. Next, the Ferrari’s ‘Flat-12’ engine gave Ferrari a competitive advantage from 1974 until 1977. The aerodynamics were redefined when Williams launched the ground-effect design leading to dominance from 1978 until 1982. The authors stress the relevance of technological transparency and its impact on a firm. When transparency is low, firms are competing for constructing the dominant technology that results in a competitive advantage. In an environment where the transparency is high, a dominant design is accepted by the system in which the firms operate leading to the generation of complementary technological applications. Lastly, this paper states that every technological trajectory has its level of power, momentum and uncertainty. Jenkins (2010) investigates the relationship between technological discontinuities and competitive advantage applied to Formula One racing using data spanning 57 years (1950-2006). This study concludes that some teams are on some occasions unable to adapt to exogenous shocks such as a technological discontinuity. Vice versa, some teams can deal with a discontinuity in an outstanding way leading to a competitive advantage. Moreover, a small number of teams are capable of maintaining this competitive advantage throughout successive technological discontinuities. In the study, two different capabilities are defined. The first one is a dynamic capability that allows teams to cope with technological discontinuities leading to a competitive advantage. Secondly, a sustaining capability permits a team to keep its competitive edge over another team when a new discontinuity is successfully treated. The pursued business model in Formula One varies amongst the constructors. Aversa, Furnari and Haefliger (2015) study the relationship between the business model and racing performance. The authors define 4 different business models in F1 which propagate via resources and capabilities into racing performance. Their business models are the following: 1. Internal Knowledge Transfer 2. External Knowledge Transfer 3. Formula 1 Supply

19

4. Talent The first model, ‘Internal Knowledge Transfer’, points to collaboration between automotive manufacturers and F1 teams. The second model, ‘External Knowledge Transfer’, indicates the sales of F1 technology to other industries. The third model, ‘Formula 1 Supply’, is about the supply of F1 teams to other F1 teams. The fourth and last business model, ‘Talent’, represents the investments in existing and future talent through the scouting system. In this study, the resources for an F1 team are gathered through 3 actors; financial, knowledge and human resources. There are 3 types of capabilities in their model: tech development, technical skills and driving skills. This research highlights 2 teams which are Red Bull Racing and Williams and their business models. Red Bull Racing puts a lot of effort into talent and their aim is to discover talent as soon as possible. Contracting racers at a younger age will lead to cost savings which can be spent on technology, data analysis... Williams has focused on delivering solutions to a broad range of industries, known as the ‘External Knowledge Transfer’ business model, through the development of F1 technologies. This business approach had a negative impact on the overall F1 results of Williams. The authors conclude that the indicators for high performance are the ‘Formula 1 Supply’ and the ‘Talent’ business models. Hence, relationships between Formula One teams and investing in talent are 2 important factors to excel as an F1 team. The following study is hypothetical and looks into the decisive factors to hold an F1 race in a specific city. In their study, Büyükyazıcı and Sucu (2003) select the most appropriate city to hold a Formula One race. There was a choice between 3 Turkish cities (Antalya, Izmir and Istanbul) to hold an F1 race and 3 criteria were used to make a suitable decision. These criteria were ’the adequacy of hotels’, ’the availability of fully equipped hospitals’ and ’the renown of the city’. In this study, 2 models are discussed to deal with complex problems: the ‘Analytic Hierarchy Process’(AHP) and the’ Analytic Network Process’ (ANP). The authors applied the latter model on the complex problem of determining the appropriate city to hold an F1 race. The result of this study was Istanbul which scored the best on the criteria ‘availability of fully equipped hospitals’ and ‘renown of the city’. Marino, Aversa, Mesquita and Anand (2015) devote attention to both the drivers of performance in Formula One and the impact of regulations on the Formula One landscape. Particular focus is devoted to a changing environment in an F1 setting. Concretely, they study what the impact of new regulations is on the outcome of F1 races. They research the constructor’s capability to deal with the architectural redesign and how well constructors can deal with time- based limitations in the rapidly changing F1 environment. In this study, both a quantitative and qualitative analysis is performed to detect the factors driving performance in a changing

20

environment. In the quantitative research, the authors design the following measures: performance, change in technical regulations, the extent of the firm exploration and 3 control variables (change drivers, change engineers and adaptation experience). The method used to regress this model was the generalized method of moments (GMM). In their qualitative study, the authors studied the impact of the introduction of KERS (Kinetic Energy Recovery System), a new energy-efficient technology, in the 2009 Formula One season. Sebastian Vettel, a 4-time Formula One World Champion, called this radical change in technology ‘the biggest change in the history of F1’. After years of domination of big players like Ferrari, new players like Brawn GP, the current Mercedes team, started to win races since they dealt extremely well with this new technology. This study highlights the importance of the constructor’s ability to handle radical changes in an appropriate way. The weather is an uncontrollable factor that has an impact on the progress of a Grand Prix. The next study investigates the impact of changing weather conditions in Formula One. Almost all races take place in good weather conditions and, thus, a minority of the races in bad weather conditions. On rare occasions, the weather conditions change during the race and this leads to the purpose of the next study. Rosso and Rosso (2016) apply quantile regression12 to study the relationship between the weather, tire types and race stints in F1 using the data of the 2016. This race was chosen because of the drastic change in weather conditions. The qualifications were during warm and dry weather while the race was held in rainy conditions. The authors denote the superior performance of ultra-soft tires when the track was drying as the race proceeded. According to the authors, their analysis could be refined in 3 different ways. Firstly, taking into account the tire degradation in the research. Secondly, including the driver’s skills by using the free practice times as input. Lastly, making a Monte Carlo simulation to determine the optimal number of pitstops in a race. The following study takes a predictive modeling approach using Artificial Neural Networks (ANN). Stoppels (2017) applies Artificial Neural Networks (ANNs) to predict F1 racing outcomes. In this thesis, the first 17 races of the 2016 Formula One season are used to predict the last 4 races of this season. After an extensive explanation of ANN, this thesis applies these ANN to an F1 environment as experimental research. The predictions are made for 4 different racers: Lewis

12 Quantile regression can be used when the assumption for a standard regression model about independent and identically distributed (IDD) residuals is violated. For example, Dimelis and Louri(2002) use this method to analyze the production efficiency gains in terms of technology transfer and labor productivity changes caused by diverse degrees of foreign ownership.

21

Hamilton (Mercedes), Max Verstappen (Toro Rosso/ Red Bull Racing), Felipe Massa (Williams) and Jenson Button (McLaren). The team switch of Max Verstappen introduces a new insight since racers are able to switch constructors during the season. The author postulates 8 features that will facilitate the predictions in this F1 setting. In this study, the following features were included in the analysis: 1. Circuit length 2. Number of laps 3. Weather 4. Start grid 5. Recent form racer 6. Recent form others 7. Best qualification 8. Race results (dependent variable) The author compares ANN with simple prediction methods, such as using the current form for prediction of the race outcome and with the multiclass logistic regression method. ANN has a predictive accuracy of 78% on the training data and 75% on the validation data. In this study, the author concludes that the ANN algorithm performs better than the multiclass logistic regression method at predicting Formula One results. Enlarging the dataset to 42 races results in predictive accuracies of 77% and 69% for the training and validation data set respectively. The predictive accuracies are lower than using only 21 races and the authors ascribe this to the fact that including 21 races might have led to overtraining. In this case, the ANN model can better predict the outcomes than the simple models and the multiclass logistic regression model. Unfortunately, these claims are not supported by any plots making it hard to follow this analysis. On the other hand, the focus of the authors is on the costs of performing Neural Networks and on the learning rate of this algorithm.

22

Contribution to literature

In Formula One literature, most researches focus on linear regression or similar methods like quantile regression. Nevertheless, Stoppels (2017) performs an Artificial Neural Network analysis to predict outcomes in F1 racing. After assessing the different studies in Formula One and their used methods to predict outcome in F1 races, this dissertation will focus on tree-based methods in a Formula One setting. This is due to the fact there are no previous tree-based studies that have been carried out in the field of Formula One. The tree-based models used in this dissertation will be decision trees, bagging, random forest, adaptive boosting, gradient boosted trees and extreme gradient boosted trees (XGBoost). Notice that these methods are mentioned in an increasing level of flexibility. As a result, this dissertation will thoroughly investigate the interpretability/flexibility trade-off postulated by James, Witten, Hastie and Tibshirani (2013, pp. 24-25). We expect a more flexible model to yield the highest predictive performance since it can generate a wide range of possible shapes to estimate the relationship between the features and the target value in our analyses. However, such a flexible model could capture too much noise in the training set and, consequently, could perform badly at making predictions on the test set. This phenomenon is called ‘overfitting’ which is related to the variance/bias trade-off as defined by the same authors. The authors state that flexible models have higher variance and lower bias whereas interpretable models have lower variance and higher bias. In this dissertation, three different aspects of Formula One will be modeled. Modeling the performance would only tackle part of the problem (cf. supra Problem situation) and, hence, race completion and qualifying ability are modeled as well. These models could contribute to the literature by identifying high-performance drivers, creating a safer environment through discovering patterns of non-completion and assessing qualifying abilities to make sure teams can start with 2 drivers. Conclusively, the research questions section provides a good overview of what the contribution of this dissertation could be to literature. In essence, this dissertation is an addition to Formula One literature for the following 4 reasons: 1. The performance of tree-based models to predict performance, race completion, and qualifying ability in Formula One. 2. The trade-off between interpretable – decision trees - and more flexible models – bagging and boosting – in a sports environment. 3. The effect of class imbalance and hyperparameter tuning on these tree-based models in an F1 setting. 4. A comprehensive study for performance, race completion, and qualifying ability that looks both from a data mining and business logic perspective.

23

Part 3: Methodology and Research Considered existing methodologies

This subsection is dedicated to finding the appropriate methodology for this Master’s Dissertation. Alnoukari and El Sheikh (2012) define all the Knowledge Discovery Process (KDP) models in their book. In this study, the focus will be on 4 modeling techniques. First of all, there are 2 cornerstone modeling techniques: KDD, the initial approach, and CRISP-DM, the centralized approach (do Nascimento and de Oliveira, 2012). Three other methodologies are also considered in this section: SEMMA, SRP-CRISP-DM, and ASD-DM. The content of these 5 methodologies can be found in Table 40 (Appendix 1). KDD First, the KDD (Knowledge Discovery in Databases) methodology, which is the initial approach to modeling in data mining, is discussed. Fayyad, Piatetsky-Shapiro and Smyth (1996) introduce the KDD methodology as a useful framework for pattern discovery in a dataset. The author explains the 9 steps in the KDD process which are listed below. ‘ 1. Understanding the application domain 2. Create a target dataset 3. Clean the data and pre-processing 4. Data reduction and transformation 5. Choosing the data-mining task. 6. Choosing the data-mining algorithm(s). 7. Data mining 8. Evaluating the output of Step 7 9. Incorporating this knowledge into the performance system ‘ (p.23) Fayyad, Haussler and Stolorz (1996) discuss the main issues in exploring a dataset. Moreover, the authors apply the KDD methodology in 5 different fields. According to this study, the issues in data analysis are the lack of domain knowledge of the researcher, the inability to scale algorithms efficiently and the impotence to craft valuable features. The KDD methodology was applied in fields where the data collection abilities have reached the top-level. KDD helped scientists analyzing huge amounts of data in fields like atmospheric science, geophysics and molecular biology.

24

SEMMA

Another promising methodology is SEMMA, a technique developed by SAS13, as a development tool for data mining. SEMMA (Fernandez, 2003) is a process that enables a smooth data mining process by executing the following steps: ‘ 1. Sample 2. Explore 3. Modify 4. Model 5. Assess ‘ (p.10)

This SEMMA procedure is carried out in 5 different steps of which the first letter of every step combined forms the word SEMMA. First, sampling the data is needed to get an understanding of the dataset. Next, trends and anomalies are investigated to obtain further insight. Then, the variables are modified and new variables are created. This is followed by making a model by using software that is searching for variables with predictive power for the desired outcome. Lastly, the accuracy, usefulness and reliability of the findings are assessed. This methodology facilitates the pattern discovery in the data mining analysis and allows analysts to focus on data visualization. CRISP-DM The third methodology is CRISP-DM (CRoss-Industry Standard Process for Data Mining) which was developed because of the lack of an industry standard for data mining. Wirth and Hipp (2000) propose CRISP-DM as a process model for executing data mining projects. The authors state that a standardized approach will have value for analysts, vendors and customers. Moreover, this methodology can be used as a reference point for the market to assess the potential of data mining projects. In this study, a generic CRISP-DM model is postulated in the following steps. ‘ 1. Business understanding 2. Data understanding 3. Data preparation 4. Modeling 5. Evaluation

13 SAS is an American company providing analytics software and solutions.

25

6. Deployment ‘ (p. 5-7) The authors conclude that pursuing this procedure might take a considerable amount of time for documentation. However, they state that this procedure is worth the effort since appropriate documentation leads to a quicker response to flaws or irregularities in the model. CRISP-DM is a useful methodology in a complex environment where changes occur at a fast rate. SRP-CRISP-DM The Sports Results Prediction CRISP-DM framework is proposed by Bunker and Thabtah (2019). This methodology is based on the CRISP-DM framework with adaptations to a sports environment. The ‘Business understanding’ step of the CRISP-DM framework is renamed to ‘Domain understanding’ since a sports domain is studied instead of a specific business. The ‘Data preparation’ from CRISP-DM is renamed to ‘Data preparation & feature engineering’ since additional attention is given the division of features in subsets based on similarity. The authors advise in the modeling phase to split the data between training and test set to fit the models and objectively test their prediction accuracy. The ‘Model deployment’ step comprises the automation of data scraping from the web, adding this data to the database, adjusting the training and test set, retraining the models and predicting new sports results. The steps of the SRP-CRISP-DM framework are: ‘ 1. Domain understanding 2. Data understanding 3. Data preparation & feature engineering 4. Modeling 5. Model evaluation 6. Model deployment ‘ (p. 30-32)

ASD-DM Lastly, ASD-DM (Adaptive Software Development Data Mining) is proposed by Alnoukari, Alzoabi and Hanna (2008) to conduct a predictive data mining analysis. In their paper, the three different steps of their framework are outlined below: ‘ 1. Speculation: business understanding, data understanding and data preparation 2. Modeling 3. Learning: implementation/testing, evaluation ‘(p. 3)

26

Overview Table 3 contains the four postulated methodologies with their designated steps. We notice in the table the lack of business logic and deployment phase in the SEMMA approach.

Table 3. Steps of data mining process methodologies

Methodology KDD CRISP-DM SEMMA SRP-CRISP-DM ASD-DM Developing and Business / Domain Speculation Understanding of Understanding understanding the Application Creating a Data Sample Data Target Data Set Understanding understanding Data Cleaning Explore and Pre- processing Data Data Preparation Modify Data preparation Transformation & feature engineering Choosing the Modeling Model Modeling Modeling suitable Data Mining Task Choosing the suitable Data Mining Algorithm Employing Data Mining Algorithm Interpreting Evaluation Assessment Model evaluation Learning Mined Patterns Using Deployment / Model Discovered deployment Knowledge

27

Selecting the methodology

The followed methodology in this dissertation is chosen from the 5 options listed in the ‘Existing methodologies’ section. The options are KDD, SEMMA, CRISP-DM, SRP-CRISP-DM, and ASD- DM. Azevedo and Santos (2008) describe CRISP-DM as an implementation of the KDD methodology resulting in a better fit with real-time systems. Thus, CRISP-DM is preferred over KDD as the methodology for this study. The SEMMA technique is another potential methodology for this dissertation. Unfortunately, this technique misses a crucial part to include the underlying business logic and the deployment of the analysis. Hence, CRISP-DM is chosen over SEMMA as well. Next, CRISP-DM is compared to SRP-CRISP-DM as the approach for our research. These 2 methodologies are similar in the respect that they both comprise 6 steps to tackle a data mining study. However, the SRP-CRISP-DM methodology is modified to a sports environment for the following 3 reasons (cf. supra Methodology). 1. Bunker and Thabtah (2019), the authors of SRP-CRISP-DM, name the first step ‘Domain understanding’ since the sports analyses are regarding a certain domain and not a business. 2. Extra focus is devoted to creating feature subsets: e.g. race-related features and external features. 3. The deployment phase is specifically focused on open-source data which is not always the case for businesses. Consequently, the SRP-CRISP-DM methodology seems more fit for this sports analytics dissertation. The last methodology in the literature study is the ASD-DM methodology which is a modification of CRISP-DM to facilitate agile software development. Since this study is not focused on agile software development, we will not use this modification. and, hence, opt for SRP-CRISP- DM to conduct as the methodology for this dissertation. In Figure 2, the SRP-CRISP-DM methodology is applied to this Formula One research in which it will act as a compass guiding the research.

28

Figure 2. SRP-CRISP-DM methodology applied on this dissertation

Research Analyses In our research, the SRP-CRISP-DM methodology of Bunker and Thabtah (2019) will be used to perform 3 binary classification analyses. The first analysis will try to discover high performance in a Formula One setting. Concretely, high performance is defined in this study as finishing in the top 3. The second analysis investigates what drives predictive performance to complete a Formula One race. The third analysis studies the determinants of the qualifying ability for an F1 race. These analyses tackle the 3 problems mentioned before (cf. supra Introduction: Problem situation). Implementing the methodology The research will apply the SRP-CRISP-DM methodology to our 3 classification analyses. The SRP-CRISP-DM is a methodology that requires a lot of documentation and this methodology is carried out in a highly iterative manner. The research will walk through the different steps of SRP- CRISP-DM as follows. First, the problems and their associated analysis are discussed from a

29

domain logic perspective in the ‘Domain Understanding’. Second, the ‘Data Understanding’ part is about exploring the open-source Formula One datasets through crafting a data dictionary, an Entity Relation Diagram (ERD), a data exploration, and a missing value analysis. Third, the crafted features for our 3 analyses are described in the ‘Data Preparation & Feature Engineering’. Furthermore, the missing values will be imputed and the features will be divided into three big categories. Fourth, the ‘Modeling’ stage consists of an explanation of the different tree-based models. Special attention is given here to the interpretability-flexibility trade-off of (James et al., 2013). Decision trees will be considered as a proxy for interpretable models due to their easily understood decision-making logic. Bagging and random forest will be considered as models with increased complexity and reduced interpretability. Boosting will be regarded in this study as the most complex of the applied models. Moreover, a rationale about class imbalance treatment and its synergy with tree-based models (Galar et al., 2011) can be found in this section. Fifth, the performance of the models is analyzed in the ‘Evaluation’ section by using appropriate metrics for our binary classification analyses: confusion matrix, sensitivity, specificity, AUC, lift, and F1. Sixth, the models are deployed via establishing a connection with the open-source datasets which enables us to take into account newer data to alter the predictions accordingly.

Domain Understanding

A first acquaintance is made with Formula One in the subsection ‘1.1 Background - Formula One’. The discussed papers are found on either ‘Google Scholar’ or on ‘Web of Science’ using the access rights of a student from the University of Ghent. For the interested reader, all the latest F1 updates can be found on https://www.formula1.com/. Moreover, there is also a documentary about F1 with as title ‘Formula 1: Drive to survive’. Another source is Nico Rosberg, a former Formula One driver and F1 World Champion in 2016, who has a YouTube channel in which he analyses different aspects of Formula One. Examples of his analyses are the influence of the new rules on racing outcomes and the fastest driver of the moment.

In literature, the experience of the driver, the starting position, past performance (Stoppels, 2017; Eichenberger & Stadelmann, 2009) were mentioned as an important factor to determine the finish position of a Formula One race. Also, the home advantage of drivers has been researched in the past as well as the impact of the rain on driver performance. However, there is a lack of an over coupling study that takes all these factors into account in one analysis for driver performance. Good

30

performance is defined in our first analysis as finishing in the top 3 since these drivers are invited on stage after the race to celebrate their ‘top 3’ position.

‘ How will the top 3 finish model be used? ‘

The ‘top 3’ model will be used to discover the determinants of high-performance in Formula One. For example, we expect that good past performance is an important feature to finish in the top 3. Also, the starting position is expected to be important since overtaking is rather difficult in Formula One. After assessing the determinants of driver performance, the focus will be on completing the race. Reasons for not being able to finish a race are either human dropouts or technical dropouts (Eichenberger and Stadelmann, 2009).

‘ How will the race completion model be used? ‘

The race completion model will help to make Formula One a safer environment by finding patterns that signal higher race completion. For example, we except that drivers with better qualifications results are less prone to crashing. We hope that our model will aid teams to understand what thrives technical or, worse, human dropouts. Next, the drivers of qualifying a race are studied. Even though it seems self-evident, drivers are not always able to qualify for the race. A reason for non-qualification can be withdrawing a race, the 107% rule… This 107% rule states that drivers who are not able to have a qualifying lap time less than 107% of the fastest lap time are not able to start the race.

‘ How will the qualification model be used? ‘

Formula One is a multi-billion dollar industry with a big amount of price money related to the finishing position of the driver. The constructors can use this money to further craft their dominance or to improve their cars to become the dominant players. The teams need to maximize their chances to obtain an as-high-as-possible result. One way to do this is to make sure both drivers are present at the start of the race. The qualification model will help teams understand how they can further improve their chances to make sure that the qualification of both of their drivers is beyond dispute.

31

Data Understanding

The Formula One data from 1950 until 201914 was retrieved from the open-source Ergast API. The data was already structured in different datasets that are described below. Furthermore, the data was augmented by weather data15 to include the impact of wet racing conditions. Also, a dataset that converted the country into the related nationality was added to include the potential home advantage for the driver. A remark here is that the exceptional conditions of 2020, COVID-19, do not influence the data considered in our analyses since the scope is defined between 1950 and 2019.

Data dictionary The data dictionary represents the content of these structured datasets as shown in Table 4. We choose not to include the different variables within the dataset in the datasets here. However, a full data dictionary of the 15 used datasets can be found in Appendix 2. Table 4. Datasets available in the Ergast API and their content

Dataset Content Season List All the seasons starting from 1950 and the link to their Wikipedia page Qualifying Results Qualifying results per driver per race determining start position in the race. Constructor Information Show the constructors for each season and a link to their Wikipedia page. Lap Times Lap times per season, round and specific lap. This data is available from 1996. Race Schedule Schedule of every season. It contains the order in which the races are finished. Driver standings It contains the driver standings at every race. Constructor standings Similarly, this dataset contains the constructor standings at every race.

14 The data was retrieved in December 2019 an can be found online at http://ergast.com/mrd/db/#csv

15 The weather data explanation and its sources can be found in Appendix 3.

32

Circuit Information It encompasses the country and city in which the race takes place. Pit Stops Includes the time of a pitstop for a specific combination of lap, driver, circuit and year. Race Results Shows the results of every race and includes the finish results. This dataset is the most important and is the starting point to merge all datasets into an initial base table. Driver Information The driver’s name, nationality, date of birth and Wikipedia page can be found here. Finishing Status Includes the status at the end of the race. Here, it can be found whether the driver had issues with the engine, breaks… or finished the race in the same lap as the winner or +1,2,3… laps of the winner. Constructor results Entails the results per constructor per specific race. Weather Includes all the wet affected races. Country Converts country to nationality.

Entity Relationship Diagram (ERD) To construct the basetable, we have to understand how the different datasets are linked with one another. Al-Masree (2015) defines relationship extraction rules to identify the functional dependence of different datasets. A relationship can be extracted when there is overlap between a primary key, variable or set of variables that uniquely identify a specific observation, and a foreign key, a primary key of another dataset that establishes a relationship. Another aspect to understand the underlying relationships between the different datasets is the cardinality. A cardinality of 0..N means that the belonging primary key can be used in the relationship 0, 1 or multiple times to combine rows between 2 datasets. A cardinality of 1 refers to a single match between rows of 2 different datasets. Lastly, a cardinality of 1..N indicates there should be at least one matching row between the datasets. However, there 2 ways of looking at relationships between dataset X and dataset Y. namely, X Y, Y X and therefore there are 2 cardinalities associated with 2 different datasets. Yeh, Li and Chu (2008) explain the extraction process of an ERD from a table-based legacy dataset and this paper provides additional information on how to construct an ERD as shown in Figure 3. In this figure, the ‘Results’ data set serves the purpose of central tables in our analysis. Moreover, the variables that facilitate the relationship between the different datasets and the cardinality of every relationship can also be found in this figure.

33

Constructor Status Standings Qualifying

Constructor

Pit Stops constructorId

raceId Results

Constructor 1 Results

Lap Times

Driver Standngs

Races

Drivers

0..N year

Weather Countries

Circuits Seasons

Figure 3. Entity Relationship Diagram

34

The relationship between the status and results dataset will be investigated to understand the interpretation of the ERD. The relationship ‘results status’ has a cardinality of 1 meaning that every row in the results dataset has exactly one a match with a statusId in the status dataset. The relationship ‘status results’ has a cardinality of 1..N indicating that every statusId of the status dataset appears at least once in the results dataset.

Data exploration The intention here is to get a first encounter with the data. We will plot some basic figures that visualize potential relationships in these initial tables. This part can be seen as descriptive analytics since we look into what happened in the past16. Races Which circuit has hosted the most Formula One Grands Prix throughout the years?

Figure 4. Number of races on the different F1 circuits

The British and Italian circuits have been part of all the F1 seasons from 1950 until 2019 as can be seen in Figure 4. The third most driven circuit in F1 history is the prestigious Monaco circuit.

16 We will assign a blue color to the most occurring setting in a specific figure.

35

How did the number of races per season evolve over the years?

Figure 5. Races per season throughout the F1 history

In Figure 5, the evolution of races per season is depicted. We notice a gradual increase of races per season with the years 2016, 2018, and 2019 having the highest number of races in that season which follows the increasing global presence of the sport. Status: the status table contains the different finishing statuses the driver can obtain. This is of great relevance for our analysis since the event of finishing a race is the dependent variable of the second study in this dissertation. What are the 10 most occurring driver statuses?

Figure 6. Top 10 most occurring status in the 'Results' data set

36

In Figure 6, we observe the ‘Finished’ status as the most occurring one. This status is reached when a driver finishes on the same lap as the winner of the race. Furthermore, we notice ‘+1 Lap’, ‘+2 Laps’, and ‘+3 Laps’ to be highly occurring statuses. The ‘Finished’ status together with all the ‘+… Lap(s)’ statuses facilitate the creation of the dependent variable for the race completion analysis. The ‘Did not qualify’ status is the starting point for the qualifying ability analysis. Thus, the status of the driver steers two of three analyses behind the scenes. Weather: this table contains all the wet affected F1 races. How many races were affected by rainy conditions?

Figure 7. Proportion of wet affected races

During the 7 decades of F1 racing 1018 races have been completed. Only 149 of these races were affected by the rain; percentage-wise, this is nearly 15% of all races as displayed in Figure 7. The rain-affected races for 1950 until 1989 can originally be found in Jones (1996) and the rain-affected races for 1990 until 2019 are included based on comparing 3 online sources: Autosport Forum, Quora and Wikipedia. While the data is retrieved from popular non-verified sources, it shows high similarity across sources. The rain variable is crafted via majority voting between the 4 sources mentioned above. In the case of a tie, the ranking for the chosen weather variable is: Jones (1996) > Wikipedia > Autosport Forum > Reddit. The rain data included in these sources can be found in Appendix 3. Qualifications: the qualifications determine the starting position for the actual race. What is the distribution of finished first, second, and third-round qualifiers? In Figures 8, 9, and 10, the distribution of the Q1, Q2, and Q3 variables for the observations are shown. We observe an increase in zeros for Q2 compared to Q1 and Q3 compared to Q2. This means that fewer drivers were able to complete the last qualifier compared to the second and first qualifiers. Consequently, the observations having a value for the Q3 variable could potentially be

37

considered as more competent drivers. Hence, we will assess the impact of these variables on ‘top 3 performance’, ‘race completion’, and ‘qualifying ability’.

Figure 8. Proportion of hasQ1

Figure 9. Proportion of hasQ2

Figure 10. Proportion of hasQ3

38

Drivers What are the top 5 most occurring nationalities for F1 drivers?

Figure 11. Top 5 occurring driver’s nationalities

In total, 847 drivers had ever the opportunity to drive an F1 car. In Figure 11, the number of drivers for the 5 most represented countries is shown. From these drivers, 164 represented Great Britain. The podium is completed by the USA – 157 – and Italy – 99. France (73) and Germany (49) represent respectively place 4 and 5. Constructors How many constructors have there been in history and what country is considered the home country?

Figure 12. Constructors per home country

In history, 209 constructors have participated in F1 races. In Figure 12, the distribution of constructors per home country is displayed. Remarkable is the fact that 86 of those constructors were British. The podium is completed by the USA – 39 – and Italy – 29. Notice that the podium is

39

identical to the nationality podium for the drivers. A potential explanation, purely suggestive, could be the chauvinism of the constructors to choose for drivers with a ‘home nationality’ since the constructor could scout more thoroughly in their home country. Circuits How many circuits that ever hosted an F1 race are located in each quadrant based on latitude and longitude? Table 5. Circuits location based on longitude and latitude

West 1 0 North 1 42 25 0 4 3

In total, 74 circuits had the privilege to organize an F1 Grand Prix. Table 5 contains the location of the circuit based on longitude and latitude. 42 circuits are located in the North-West quadrant, 25 in the North-East, 4 in the South-West and 3 in the South-East. This statistic shows that the circuits are mainly located in the Northern hemisphere. However, this statistic may be misleading since Formula One is a sport with a global presence as stated in the ‘Background’ section. Wins Which driver has the most F1 victories of all time?

Figure 13. All-time wins per driver

Figure 13 displays the 3 drivers with the most all-time F1 wins. Schumacher comes first with 91 all- time wins closely followed by Hamilton with 84 wins. The podium is completed by Vettel with 53 wins.

40

How many distinct drivers have ever won an F1 race?

Figure 14. Drivers that won a race once in their career

In Figure 14, the distribution of drivers that either ever won a race or not is shown. In total, 103 out of the 847 drivers have ever won an F1 race which is around 12%. This statistic shows that winning at least one Formula One race in a career is already a big achievement. Missing values Treating missing values is essential to improve the quality of the data. According to Little, Lang, Wu and Rhemtulla (2016), missing values can reduce the power of the model and can lead to biased results since the observed relationship between variables could be incorrect. There are 2 tables which have some missing value issues. The ‘Qualifying’ table has missing values for the q1, q2, and q3 variables. These variables contain the lap time of respectively the first, second, and third qualification round. Fortunately, there is an explanation for these missing values since a driver crashing his car during the practice round might lead to the inability to participate at the other qualification rounds. Sometimes, a driver crashes his car in a qualifying round and, thus, he might not be able to drive in the following rounds. In the ‘Results’ table, there was one major problem. Some drivers had 0 as starting position – grid – in the dataset. When inspecting the data more closely it becomes clear that these observations were non-participations because of withdrawal, non-qualification, or similar statuses. These observations are classified as non-qualifications which are the research topic for our third analysis: the ‘qualifying ability analysis’. Conclusively, the data quality of the Ergast API is high since there are only 2 tables, ‘Qualifying’ and ‘Results’, that suffer from minor missing value issues. However, extra missing values can occur when the different datasets are merged into one comprehensive basetable. The imputation of the missing values for the created features in the basetable will be discussed later (cf. infra Data Preparation & Feature Engineering: Feature description).

41

Data preparation & feature extraction

The initial data preparation, which is the merging procedure of the different tables from the Ergast API, is done via R in the RStudio environment. In total, our basetable contains 23041 observations with 33 different features (X) and 3 dependent variables (Y) related to the 3 different analyses performed in our research. In the 3 conducted types of research, success is defined as either (1) finishing in the top 3, (2) completing the race, or (3) qualifying to start the race. Analysis 1: Top 3 performance Firstly, success is defined as finishing in the top 3 in a Formula One race. The top 3 are invited on the podium and therefore get a certain amount of recognition for their performance. Drivers that finish the in top 3 will get a ‘1’, otherwise, they get a ‘0’ in our basetable. The goal is to help teams identify top-performance in the highly competitive Formula One environment. Analysis 2: Race completion In this analysis, success is defined as being able to complete a race. Reasons for failing to complete a race can be a human error or a technical dropout. The purpose is to obtain insights that could help F1 teams to improve the safety of the racing environment. A regretting situation as the 1976 Nürburgring crash of the late is an event that hopefully will never happen again. We will assign a positive, ‘1’, or negative, ‘0’, outcome based on the finish status of a specific observation. Analysis 3: Qualification ability Success is defined as being able to start a Formula One race. Sometimes a driver is not even permitted to start a race because of disqualification, 107% rule… As mentioned before, this rule forbids drivers to participate when their best qualification time is 7% higher than the fastest qualification time of all participants. However, it should be noted that the execution of this rule has been relaxed for the last couple of years. Drivers that were not able to qualify, receive an outcome equal to ‘0’, otherwise a ‘1’ is assigned to the driver. Table 7 contains the different analyses carried out in this dissertation and their belonging dependent variable values.

Table 6. Dependent variables in our binary classification analyses

Analysis 1: Top 3 Analysis 2: Race Analysis 3 : Qualification performance completion ability 0 = not 0 = not 0 = not 1 = finished in 1 = finished 1 = qualified finished in top completed qualified for top 3 the race for the race 3 the race the race

42

Feature description In our analysis, 33 features are crafted to predict 'top 3 performance', 'race completion', and 'qualifying ability'. The SRP-CRISP-DM methodology devotes specific attention to categorizing the features in related subsets. We divide the features into three big categories: experience-related, dependent-related, and race-related features. The experience-related feature comprises the accumulated experience of the drivers and constructors throughout the years. Dependent-related features give insights into a driver obtaining a top 3 in the past, not completing a race, or not qualifying to start a race. The race-related features consist of the starting position and points obtained in the season as a proxy for the current form of the driver. Moreover, the form of the driver’s team partner is also included in the analyses. Race-related features An important feature for predicting the outcome in many analyses is the starting position of the driver. This starting position is named ‘grid’ in the Formula One jargon. Stoppels (2017) uses the starting position as a feature to predict the race outcome in Formula One by applying ANNs. To fit the predictions to the current Formula One environment in which a maximum of 20 drivers can start the race, the grids will be rescaled between 1 and 20 for every single race using:

표푟푖푔푖푛푎푙 푔푟푖푑 푖푛 푡ℎ푒 푟푎푐푒 푟푒푠푐푎푙푒푑 푔푟푖푑 = 푟표푢푛푑 ( 푥 20) 푚푎푥푖푚푢푚 푔푟푖푑 푖푛 푡ℎ푒 푟푎푐푒

The rescaled grid is rounded to the nearest integer since a starting position of 12.89 is amongst the possibilities for a grid; these should be an integer. Other options were to set all grids above 20 equal to 20, but this would have created a disproportionally and unwanted large number of ‘grid equal to 20’ occurrences. Deleting these observations with a grid larger than 20 might create bias in the data which could influence the conclusion of our analyses. Thus, we opt for rescaling this feature with the formula that is mentioned above. The performance in the current season is measured via pointsSeason which is also a sliding time window variable. This feature is normalized by the current round meaning the points should be divided by 5 in the 5th race of the season, by 11 in the 11th race of the season… Parallel to the sliding time window of nrWins, pointsSeason consists of the normalized points before the start of the race. To weigh the different races throughout the years equally a modification has to be made to the pointsSeason variable. From 1950 until 2009, 8, 9, or 10 points were awarded to the winner whereas 25 points are awarded to the winner from 2010 onwards. To keep it straightforward,

43

the normalized pointsSeason variable is multiplied by 2.5 from 1950 until 2009. Moreover, we will include an integer for the round of the season via nrRaceSeason. This is due to the fact it is way more challenging to obtain the maximum of points after 5 races in the seasons compared to after the second race of the season. Eichenberger and Stadelmann (2009) also include the results of the team partner driver as a control variable in their analysis. Likewise, we construct a variable teammateDiffPointsSeason to include the difference between the driver and his partner normalized points. Some caution is needed here since in the first decade of Formula One constructors could participate with more than 2 drivers. This obstacle is overcome by taking the mean of teammateDiffPointsSeason. In the soccer prediction study of Eryarsoy and Delen (2019), the effect of the home crowd, the 12th man, is included to predict the outcome of a soccer game. A similar ‘home crowd effect’ for a driver in a Formula One race is used in the study of Eichenberger and Stadelmann (2009). Parallel to these studies, we craft a variable homeAdvantage that captures the home advantage of a driver in our 3 analyses. The weather is an uncontrollable variable but could have a high impact on the outcome of a race. The variable rain quantifies the weather conditions of a particular Grand Prix with a race affected by the rain classified as 0 and not affected by the rain as 1. An in-depth explanation of the creation of this features can be found in Appendix 3. The number of laps driven of the circuit will be used as a proxy for the skill level acquired since drivers will get tired throughout the race. This requires sharpness from the driver’s end and we quantify this with lapsRace. Experience-related features To quantify the previous performance for starting position, we crafted 2 variables; pastStart3 and everStart3. pasStart3 is the number of past starting positions in the top 3 of a specific driver and everStart3 registers whether a driver has ever started in one of the first 3 positions. In our analysis, the impact of previous successes should not be neglected and, hence, a variable nrWins is constructed to capture this. ‘nrOfWins’ accounts for the all-time performance of a driver through capturing their all-time wins. Since we aim to make predictions for 'top 3 performance', 'race completion', and 'qualifying ability', features cannot contain information from the future. Therefore, a sliding window, also used in the study of Bastiaans (1985), is applied to the nrWins variable meaning this feature contains the number of wins before the race starts. Technically, the feature is lagged one time period which is to the previous race of that specific driver in our analyses.

44

The impact of the weather was also included in the ANN analysis of Stopples (2017) to predict racing outcomes for the 2016 Formula One season. Furthermore, Rosso and Rosso (2016) study driver performance in changing weather conditions during an F1 race. We include the experience in the rain through recording the number of past races on a wet circuit with the variable rainExp. Teams – constructors in Formula One lingo – play an important role in Formula One. In literature, there is some disagreement regarding the contribution of teams to the overall performance of the drivers. As discussed in the literature, Allen (2000) states that driver contribution can overshadow team contribution. On the contrary, Spurgeon (2009) accounts for the performance of 80% to team contribution and of 20% to driver contribution. Moreover, Bell et al. (2016) conclude that team effect outweigh driver effects for top performance in a race. The authors state that the constructor contributes to the driver's performance with the acquainted knowledge throughout F1 history which they call the ‘legacy effect’. We will include this legacy effect in our analyses with the variable legacyConstructor, also constructed via the sliding window approach, that measures the all-time race participation of the constructors before the race starts. Thus, this variable is also a sliding time window construction as other variables in our analyses like nrWins and pointsSeason. In the analyses, the different constructors are not included since we opt to capture the experience of the constructor via legacyConstructor. This choice is supported by the research questions and literature. The research question is not: ‘Who is the best constructor throughout Formula One history?’ and thus we choose to make this study constructor independent via quantifying the constructor by their legacy. Moreover, there are 212 different constructors in the ‘Constructor’ dataset meaning there are many categories to take into account for this variable. Micci-Barreca (2001) states that high-cardinality categorical attributes set forth a challenge for classification and regression problems. For these 2 reasons, we will not include the high-cardinality variable constructor but this will be captured with the legacy effect of the constructor. To measure the quality level of a specific constructor, we will include the number of all-time wins of that constructor via nrWinsConstructor. This variable is also created via the sliding window approach of lagging the variable with one race. As mentioned in the introduction, Toro Rosso changed their name to Alpha Tauri for the 2020 season. This introduces an extra hiatus in the data since a team that alters their name will be regarded as a new team. To deal with this inconsistency, we looked into the previous names of the current Formula One teams and, thereafter, these team name changes were handled by assigning all these name modifications to the appropriate constructor (Appendix 4).

45

Another way to measure a driver’s abilities to steer his Formula One car is to create 3 extra variables; hasQ1, hasQ2, and hasQ317. ‘hasQ1’ captures the fact that a driver was able to complete the first qualifying round. ‘hasQ2’ is a variable that captures whether a driver has participated in the second qualification round. A reason for non-participation is when the driver crashes his car during the first qualification round. ‘hasQ3’ is a similar variable, but it grasps this non-participation for the third qualification round. Instead of including the hasQx variables, we will make these variables more experienced-focused by dividing the past positive hasQx (pastHasQx) by the total participations for a specific driver. The formula for computing these variables is:

positive pastHasQx of driver 퐩퐚퐬퐭퐇퐚퐬퐐퐱퐏퐫퐨퐩 = for x in {1,2,3} number of participations of driver

This variable is created with the sliding window approach since we cannot be sure before the race starts that the driver will participate in the race. There might be an acute technical error leading to a non-participation in the race or the jury might decide to ban the driver to start the race. The age can be used to explain the maturity of the driving skills of the driver and, hence, the driver might be able to perform better or to prevent crashes due to more careful driving habits. The age will be used as one of the experience-related features for a driver and this feature will be categorized as shown in Table 7.

Table 7. Categorizing the age

Categories age ageUnder21 age21_24 age25_28 age29_32 age33_36 age37_40 ageAbove40

17 These variables can be referred to as hasQx.

46

Another quantification of the experience of a Formula One driver is years of experience of a driver quantified by the variable yearsExp. It is assumed that more experienced drivers will perform better with the same car compared to inexperienced drivers. We will categorize the yearsExp of the driver to understand which yearsExp category have the highest predictive power in our 3 binary classification analyses. The categories for the ‘yearsExp’ feature can be found in Table 8.

Table 8. Categorizing the years Exp

Categories yearsExp yearsExpUnder3 yearsExp3_5

yearsEx6_8

yearsExp9_11

yearsExp12_15

yearsExpAbove15

Another feature included in our analysis is the Circuit to grasp the influence of the circuit in the analysis. There are 41 different circuits in the dataset of which some are much more technical than others and, thus, require a higher skill level from the drivers. Sulsters and Becker (2018) state that each circuit has its characteristics and, thus, the driver is challenged every race to take into account another circuit. For example, the Yas Marian circuit in Dubai regarded as one of the more difficult circuits to drive since it is hard to overtake. The advantage of tree-based models could be the discovery of interesting driver-circuit interactions and this is the reason why the circuit is considered in our analyses. Even though every circuit has its characteristics, we decide not to include the specific circuits since some circuits have become obsolete since they are not part of the Championship races. However, the circuit is incorporated via incorporating the driver experience on every circuit with the variable circuitExp. Moreover, we include the number of previous non-completions on a particular circuit by incorporating the feature pastNoCompCircuit. Lastly, both features are implemented using the sliding window approach.

47

Dependent-related features Next, dependent-related variables are included in our analyses. For the top 3 performance analysis, the amount of past top 3 performances is captured via pastTop3. Furthermore, it is included whether a driver has ever finished in the top 3 through everTop3. Parallel to the top 3 performance analysis, pastNoComp and everNoComp are created for race completion analysis, and pastNoQual and everNoQual are constructed for the qualifying ability. pastNoComp and pastNoQual respectively quantify the number of non-completions and non-qualifications for a particular driver. everNoComp and everNoQual respectively include whether the driver had a non- completion or non-qualification. We include the 2 relevant features for every model to include this lagged component which is related to the dependent variable.

Overview of the crafted features In the SRP-CRISP-DM methodology, extra attention is given to creating a subset of related features. As stated above, we have 3 big feature categories: experience-related, dependent- related, and race-related features. An overview of these 3 feature categories and their associated features can be found in Table 9. Moreover, we include whether these features have been used in previous studies.

Table 9. Features used in analysis

Features used in the analyses Used for Previous Experience-related features analysis analysis rainExp Amount of races driven in the rain All 3 / The age of the driver to grasp the maturity age All 3 / Split up in 7 categories (cf. supra Table 7) Time span between race of a specific yearsExp observation and first F1 race of a driver All 3 / Split up in 6 categories (cf. supra Table 8) Amount of races driven on a particular Sulsters and circuitExp All 3 circuit Becker (2018) Eichenberger and nrWins Including previous successes in the analysis All 3 Stadelmann (2009)

48

nrWins- Number of all-time wins constructor All 3 / constructor Bell et al. (2016); Legacy- Amount of previous participations of the All 3 Marino et al. constructor constructor (2015) Number of times started in first 3 starting pastStart3 All 3 / position everStart3 Ever started in first 3 starting position All 3 / pastHasQ1- The driver’s prestation at the first qualifying All 3 / Prop round pastHasQ2- The driver’s prestation at the second All 3 / Prop qualifying round pastHasQ3- The driver’s prestation at the third qualifying All 3 / Prop round pastNo- Number of past crashes and technical error Comp- All 3 / on the circuit Circuit Used for Dependent-related variables analysis pastTop3 Amount of past top 3 Top 3 / everTop3 Ever top 3 Top 3 / Race Sulsters and pastNoComp Amount of past completion completion Becker (2018) Race everNoComp Ever past completion / completion Qualifying pastNoQual Amount of past qualification / ability Qualifying everNoQual Ever past qualification / ability Used for Race-related features analysis Total points of a driver in the season pointsSeason All 3 Stoppels (2017); normalized by round of season

49

Judde et al. (2013) Integer indicating round of the season since nrRace- it is harder to achieve high normalized All 3 / Season points in the season when the season proceeds Teammate- Difference between the total points of a DiffPoints- driver and his team partner in the season All 3 Stoppels (2017) Season normalized by round of season Eichenberger and home- Home crowd effect on performance All 3 Stadelmann Advantage (2009) Rosso and Rosso (2016); rain Effect of bad weather. 0 = no rain; 1 = rain All 3 Stoppels (2017); Judde et al. (2013) Top 3, Stoppels (2017); grid Starting position in the race race Sulsters and completion Becker (2018) lapsRace The number of laps for a specific circuit All 3 Stoppels (2017)

50

Feature scaling Li, Jing, Ying and Yu (2017) state that tree-based models are scale-invariant meaning these are not influenced by different scaling methods. Models like CART (Classification And Regression Trees), Random Forest… are based on an impurity measure instead of distance-based measures. These impurity measures are applied in the split selection method for classification trees (Loh & Shih, 1997). The authors state that this approach is about iterating over all possible binary splits of the data for all the features resulting in the choice for the split that reduces the measure for node impurity the most. Thus, it is determined which feature with its related threshold level is used to make as many as the right classifications as possible. Conclusively, feature scaling is not needed since tree-based models are based on a split selection method that uses the impurity or a related measure to determine the feature to make a binary split on. Timeline of features The availability of different features is of vital importance for our analysis since the unavailability of a feature for a specific period will lead to missing values. This is the case since some of these features are not available in 1950 as shown in Figure 15. The years 1950-1994 represent the minimum available dataset. The data is augmented with the qualifying round 1, round 2, and round data respectively in 1994, 2005, and 2006. This has to be taken into account and in the next subsection we will discuss the imputation method for these missing values.

Figure 15. Time analysis of the availability of features used in the analysis

51

Imputing missing values 4 features are suffering from missing values: pointsSeason, pastHasQ1Prop, pastHasQ2Prop, and pastHasQ3Prop. As mentioned in the Feature Description section, the pointsSeason contains the points of season obtained by the driver in the season normalized by the round of the season. This variable is based on the sliding time window principle, which is about making sure the feature does not include information of the future, leads to a pointsSeason of 0 for the first race of the season. The other 3 features are have limited availability in time (cf. supra ‘Timeline of features’) leading to a substantial amount of missing data for those variables. There are 2 different missing value problems here: one is created by us to solve the flaw of the sliding time window for the first race of the season whereas the other is based on limited availability in time. Li et al. (2004) state the 3 ways of dealing with missing data: deletion of the row or column, methods that can estimate cases of missing data, or imputing the missing values. As the title of this section suggests, we will opt for the last way of treating missing data: imputation. Scheffer (2002) calls the deletion ways bad practice but he considers mean imputation as much worse since the variance within this variable is destroyed. Moreover, Donders et al. (2006) conclude that overall mean imputation, missing-indicator method, and complete case analysis produce biased results in most situations. An example of a method that can estimate cases when missing data occurs is the EM algorithm (Dempster, Liard & Rubin, 1977). Unfortunately, Celeux and Diebolt (1992) state that the EM algorithm is dependent on the starting position and, in some cases, the convergence of the algorithm is intolerably slow. Thus, we suggest that imputation will be the chosen route to treat the missing values in this dissertation. Li et al. (2004) define 4 approaches to impute the variables: mean, regression, hot deck, and multiple imputation methods. As stated above, mean imputation is not an approach to use for imputing missing values. Von Hippel (2004) states that regression imputation, which uses a regression model to impute missing values, is biased since the regression parameters are derived using pairwise deletion. Thus, we will focus on 'hot deck imputation' and 'multiple imputation' to treat missing values. Clustering is an example of hot deck imputation and, more particularly, the authors state that K-means is a potential method to treat a fuzzy relationship between the different variables as a clustering problem to estimate the value of the missing observation’s characteristic. The K-means method is adapted with a fuzzy membership function to avoid the algorithm being stuck in local minima.

52

Another imputation method is k-Nearest Neighbors (kNN)18 imputation which is listed as one of the missing values treatment techniques by Acuna and Rodriguez (2004). These authors point to the negative points of using this technique which is the arbitrariness of choosing K, the chosen distance function, and the time-consuming nature of the process. The K-means process also depends on arbitrarily chosen a K and for that reason, we decide to take a different path, which is the path of 'multiple imputation'. A Multiple Imputation method (Rubin, 1996) is the proposed method for treating complex data with more than one variable containing missing data. An example of the Multiple Imputation method is the Multiple Imputation by Chained Equations (MICE) method (van Buuren, Boshuizen & Knook, 1999) to craft imputed datasets. MICE performs well at capturing linear relationships via the chained equations to impute missing values (Buuren & Groothuis-Oudshoorn, 2010). However, non-linear relationships are not included in the default MICE models (Shah et al., 2014). The authors incorporate random forest into MICE to capture these nonlinear relations to impute the missing values. To map complex interactions and non-linear relations, a random forest approach with an out-of-bag imputation can be used to impute the missing values (Bühlmann & Stekhoven, 2012). In general, random forest imputation yields good results for mixed-type data. Thus, the MICE method would be advised for linear relationships between the features and the random forest-based methods for non-linear relationships between features. Since the true relationships between the features are blurred, we will opt for the MICE method. Fortunately, tree- based models are performing well at capture the potential non-linear relationships within the analyses (Derrig & Francis, 2006). Thus, the MICE method performs well at capturing linear relationships for imputing the missing values and this will be combined with tree-based models that achieve good results at estimating non-linear dependencies. Conclusively, the MICE method will be used to impute the missing values for the pointsSeason, pastHasQ1Prop, pastHasQ2Prop, and pastHasQ3Prop19 features.

18 K-means and kNN should not be conflicted; K-means is an unsupervised clustering method and, thus whereas kNN is used for supervised classification problems (Quek, Woo & Logenthiran, 2016).

19 To be conclusive, pastHasQ1Prop, pastHasQ2Prop, and pastHasQ3Prop are features created by the sliding window approach. This creates another problem for the first-ever race a driver participates in since this value will be non-existent. These features for the first-ever race of a driver are approximated by 1. Yet, these values could be imputed again, but imputing a variable that is crafted from another imputed variable, hasQx, seems a bit too far-fetched. Therefore, we use 1 as an approximation for the pastHasQxProp of a driver’s first-ever first race.

53

Excluded features We pursue an approach that uses a lot of different variables – data mining – and also thinks critically which ones to include in the different analyses – domain logic. Approaching the analyses from a data mining perspective, we will include all the crafted features in our 3 binary classification analyses. From a domain logic perspective, the question ‘Is it useful to include a specific variable to predict top 3 performance, race completion ,or qualifying ability? In the qualifying ability analysis, we cannot take the grid into account since we will predict whether the driver will be able to start the race. Formulated differently, predicting whether the driver will obtain a grid is the aim of this analysis. Consequently, the grid cannot be used as an independent variable in this analysis because it is steering the classification of our label directly. After assessing the occurrence of these non-participations, we observe that these are present for the entire period of our qualifying ability analysis, 1950-2019. The analyses do not include the team budgets since it was determined that there will be a budget ceiling for the 2021 season. Nevertheless, including the past budget of the teams might be a reasonable explanatory variable for past behavior but this regulation of a budget ceiling will even out this effect for predictions since teams are obliged to keep their budget under a specific amount. Hence, the budget is not included in our prediction analyses. In general, there are no specific drivers, circuits, nationalities, or constructors included in our analyses. The purposes of our analyses are not to identify the best driver of all time, the best driver per circuit... Furthermore, we want to keep the analyses independent from a specific driver, circuit, nationality or constructor. Hence, the scope here is on: 1. specific characteristics of a driver: the experience in the rain, age, the home advantage… 2. the constructor via the legacy effect of the constructor 3. the driver’s team partner via his points In the season 4. the circuit via a driver’s circuit experience and the laps of a circuit

The focus of our analyses is on predicting high-end performance (top3), race completion, and qualifying ability. If the pit stops within a race are included, these analyses would have the notion of being explanatory analyses and, consequently, this would jeopardize the idea of being a predictive study. A solution seems to take the mean of previous pit stops in the driver’s all-time participation. Unfortunately, the pitstop data is only available the last decade, and imputing the other 6 decades based on data from 1 decade seems too far-fetched. Hence, we will not include the pit stop data since we want to craft a predictive study and the limited availability in time.

54

Modeling

In the modeling section, the Formula One basetable created during the ‘Data Preparation & Feature Engineering’ step will be split in a training and test set. In the modeling step of the SRP-CRISP- DM, we will focus on the training set. Hyperparameter tuning via 5-fold cross-validation, class imbalance, and tree-based models are the main concepts discussed in this section. Training and test set The basetable, which is the prepared data set, is transformed into a training and a test set with as splitting percentages respectively 70% and 30%. The train set uses a sample of the data to fit the model (James et al., 2013). In our case, this training set is used to fit tree-based models – decision tree, bagging, random forest, and boosting. The test set, which is unseen data for the model, is used on the fitted model to make predictions and, subsequently, indicates the generalizability of the model to new data. The training and test data will be used to carry out our 3 analyses: (1) top 3 performance, (2) race completion, and (3) qualifying ability. Validation The hyperparameters of the models play an important role in our analyses. Hyperparameters are the parameters exogenously defined while fitting the model and whose value cannot be estimated using data. Technically, Bergstra, Bardenet, Bengio and Kégl (2011) define hyperparameter optimization as the problem of optimizing a loss function over graph-structured configuration space. More particularly, the tuned hyperparameter will be discussed thoroughly for every algorithm. The thought process is that the hyperparameter that is closest related to the flexibility/interpretability trade-off and/or bias/variance trade-off of James et al. (2013) will be considered. In our analyses, a cross-validation procedure is performed to finetune the hyperparameters. Instead of using one training and one test set and neglecting the hyperparameters, we will apply this cross-validation to adjust the hyperparameters according to our research. Fushiki (2011) states that cross-validation is a widely used approach to estimate the prediction error. We consider leave- one-out cross-validation and k-fold cross-validation to cross-validate the models. Wong (2015) explains leave-one-out cross-validation (LOOCV) as a special case of k-fold cross-validation since the number of folds is equal to the number of instances in the training set. The author states that k-fold cross-validation may be chosen from a computational viewpoint. On the other hand, Davison and Hinkley (1997) point to the problem of bias in k-fold cross-validation when k is small. Both LOOCV and K-fold cross-validation have advantages over one another. However, k-fold cross- validation is picked since it is more recommended from a computational viewpoint. A value for this k has to be decided to cross-validate the models in this study. Rodriguez, Perez and Lozano (2010)

55

recommend the use of k=5 or k=10 because of the lower bias compared to k=2 and the higher computational efficiency than k=n, which is LOOCV. Based on these statements in literature, we can conclude that either a 5- or 10-fold cross-validation would be advised in our analysis. However, Tax and Joustra (2015) state that cross-validation is not advised for sport prediction because of the time-ordered nature of the date. Bunker and Thabtah (2019) add that shuffling the data via cross- validation is not a good approach since the order of the instances should be preserved. We do not follow this reasoning of not performing cross-validation since it is essential to tune the hyperparameters of the models. Moreover, our crafted features are insensitive to the notion of time due to the applied sliding window and the independence of the features on a specific driver, team, or constructor. Thus, the k fold cross-validation of Rodriguez, Perez and Lozano (2010) will be followed with k equal to either 5 or 10. To increase the speed of our models, we will use 5-fold cross-validation as shown in Figure 16.

Figure 16. 5-fold cross-validation As mentioned before, the test set will enable to calculate an objective performance measure of the models. This training set is resampled 5 (=k) times with every time another test fold, the blue fold, to validate the model with specific hyperparameter configurations. Hyperparameter configuration introduces the decision upon the search strategy to use for hyperparameter tuning. Two options will be considered: grid search and random search. Bergstra and Bengio (2012) define grid search as a procedure in which a set of values for specific hyperparameters are chosen and every possible combination of these values is formed via a set of trials. In this study, random search refers to taking independent draws from a uniform density from the same configuration space that would be spanned by a regular grid. Concretely, the algorithm randomly picks several combinations to tune the hyperparameters. Bengio, Lamblin, Popovici and Larochelle (2007) optimize the

56

hyperparameters of their neural network via the grid search method. Bergstra and Bengio (2012) suggest using a random search since this method is more efficient compared to grid search both from an empirical and a theoretical perspective. However, we will follow a grid search approach since we would like to input the same grid of trails for our 3 analyses to increase the comparability between the analyses. Furthermore, notice the absence of the test set in our rationale since this set remains untouched during cross-validation. The test set is used to evaluate the predictive performance of the fitted models with tuned hyperparameters on new data. Tree-based models Tree-based models are the fundament on which our dissertation is built meaning these are of paramount importance. It is not our intention to describe every mathematical aspect of these models. However, we will occasionally use a formula to explain the underlying forces of the model. The tree-based models discussed in this dissertation, inspired by the rationale of James et al. (2013), are decision trees, bagging, random forest, and boosting. The considered boosting models are Adaptive Boosting, Gradient Boosting, and XGBoost. Decision trees classifier The Decision Tree Classifier (DTC) is discussed first since it forms the basis on which the other models build extra complexity. Hence, the DTC is the most interpretable model in this dissertation. Safavian and Landgrebe (1991) define this DTC, introduced by Quinlan (1986) as a modeling approach, as an approach to multi-stage decision making. In essence, the DTC breaks a complex decision into several simpler decisions. The logic of the DTC can be assessed in Figure 17 where a fictitious example is presented.

Figure 17. Decision Tree Classifier

57

Figure 16 helps to classify a specific observation based on some features in category 1 or category 0. The second analysis, race completion, is used as a fictitious example to explain the interpretation of a DTC. In Figure 16, a driver with a circuit experience of more or equal than 4 will be classified as 1 (R1) meaning this driver completed the race. Drivers with less than 4 races experience on a circuit who enjoy the cheerios of the home crowd are also classified as 1 (R2) and those who do not have the home advantage will be classified as 0 (R3). The features on which the decision tree splits, circuitExp and homeAdvantage, are determined via splitting measures. We will focus on the Gini index as the used metric at each node to determine which feature to split on. This is next to entropy one of the measures used in Classification and Regression Trees (CART) (James et al., 2013). The formula is: 퐾 퐺푖푛푖 푖푛푑푒푥: 퐺 = ∑ 푃̂푚퐾(1 − 푃̂푚퐾) 퐾=1 The Gini index is used as a measure of impurity in CART meaning the increase of impurity of the nodes is measured when a specific variable is not considered/randomly permuted for the binary splits. Technically, the feature space is partitioned to classify the outcome appropriately. The feature space partitioning is shown in Figure 18 (James et al., 2013) based on the DTC in Figure 16. We see the 2 variables, homeAdvantage and circuitExp, on which the splits are performed on the axes. If the circuitExp is 4 or more (area R1), the driver will complete the race according to the DTC. Similar reasoning can be applied to the two other regions resulting in R2 equal to 1 and R3 equal to 0.

Figure 18. Feature space DTC (James et al., 2013)

58

This model yields interpretable results for the complex problem of race completion by breaking it down in this fictitious example in one or two easier decisions. However, the DTC can be sensitive to noise in the training data and especially when the trees are grown very deep. This sensitivity to noise reduces the generalization of the obtained model to new data. Consequently, other more flexible models will be used to increase the complexity that the model can deal with. The relationship between tree-based models and the interpretability/flexibility trade-off of James et al. (2013) is one of the main research topics of this dissertation (cf. ‘Contribution to literature’). Therefore, the focus will be on tuning the hyperparameter for the decision tree algorithm that is directly related to this trade-off. A decision tree gets more complex to interpret when additional layers are created. Wehenkel, Pavella, Euxibie and Heilbronn (1994) identify the risk α of not growing the tree to lower layers leading to higher class purity. An α of 0 means shrinking the tree to its top node and an α of 1 refers to fully growing the tree. A trade-off here is that a high α will include extra noise in the decision tree of less important variable and a lower α will not make use of all the information available in the training set. A hyperparameter that is closely related to this tree complexity is the cost-complexity measure. This measure is used for tree pruning purposes and it can optimize the trade-off between the cost of misclassification and the complexity of the tree (Yan et al., 2016). Thus, the complexity parameter (cp) will be tuned for the decision tree classifier algorithm according to a grid search.

59

Bagging The decision tree classifier suffers from a great deal of variance when the training set is altered and, thus, other tree-based approaches based were created. One of these approaches ‘Bagging’ introduced by Breiman (1996) which he also calls ‘bootstrap aggregating’ (p.123). This approach generates a multitude of versions of a predictor and uses these versions to obtain an aggregated predictor. First, bootstrapping, a technique that makes multiple random samples with a replacement of observations is performed. Bootstrapping also called Jackknifing, finds its origins in the work of Quenouille (1949) where it was used to conduct approximate tests of correlation in time-series. This Jackknifing technique was introduced since it reduced the bias for the dataset and the standard error could be estimated (Quenouille, 1956; Hinkley, 1977). After the bootstrapping procedure, an aggregation average is taken for predictions with numerical outcomes while a plurality vote is performed to predict a class. Since our analyses are binary classifications, the bagging method will classify the targets based on the plurality voting mechanism. In Figure 19, the bagging procedure, based on Breiman (1996), is shown consisting out of 6 steps20.

Figure 19. Bagging procedure

20 Instead of using mathematical expressions to explain the bagging procedure, we opt for a 6 step approach that gives a comprehensive overview of the mechanics of this algorithm. For the interested reader, we suggest Quenouille (1949, 1956) and Hinkley (1977) for the formulas behind bootstrapping and Breiman (1996) for the formulas behind bootstrap aggregating or bagging.

60

1. The training set is used as input. 2. Bootstrapping, resampling with replacement, is performed on the training set. 3. Multiple resampled datasets are the result of this bootstrapping procedure 4. The decision tree classifier algorithm is performed on every resampled dataset. This DTC has high variance leading to different predicted outcomes with the same features input amongst the different trees. 5. To deal with these different predicted outcomes, plurality voting is applied in the case of a classification problem. The most occurring predicted outcome with specific feature inputs is chosen as the result of the aggregated predicted outcome. In the case of a tie in the majority voting, the estimated class with the lowest class label is chosen. 6. The result is the aggregated predicted outcome for the feature inputs. Bagging is a method that gets a lot of support from literature for multiple reasons. Bühlmann and Yu (2002) praise bagging for its computational efficiency to augment unstable predictors that suffer from high variance. Moreover, methods for voting classification algorithms, such as bagging and adaptive boosting, are successful in improving the accuracy of classifiers for artificial and real- world datasets (Bauer & Kohavi, 1999). Furthermore, Breiman (1996) explains why bagging works for most predictors. A predictor that is correct at ordering most inputs can be aggregated to a high performing predictor. However, Breiman warns that poor predictors can be transformed into worse predictors. Thus, this reasoning suggests that the bagging procedure needs an underlying model that can perform reasonably well at classifying specific instances. Breiman states that bagging results in a higher accuracy since the perturbation of the training set is taken into account. On the contrary, building multiple trees on different bootstrapped datasets comes at a cost of losing a simple and interpretable structure. This loss of an interpretable structure can be linked to the interpretability/flexibility trade-off of James et al. (2013) which is one of the cornerstones of this dissertation (cf. supra Figure 17). Galar et al. (2011) is another important study for this dissertation and we like to assess their statement that tree-based models, especially bagging, perform well in combination with analyses with highly imbalanced classes (cf. infra Class imbalance). Therefore, we would like to assess the raw gain in prediction accuracy for the bagging procedure. Consequently, the hyperparameter grid will be fixed for the bagging procedure meaning the bagging models will have the same hyperparameters in all analyses. This is done to make the results comparable and completely independent of the hyperparameter values. Hyperparameters should normally be tuned, but we decide not to tune these for the bagging procedure since we assess the raw performance of the models based on different class imbalance treatment strategies (cf. infra Class imbalance).

61

Random forest Breiman (2001) presents the Random Forest as a technique that adds another layer of randomness to the bagging procedure. In the aforementioned approaches, DTC and bagging, the nodes are split using the best among all predictors. Random forest randomly chooses a subset of predictors to be considered to make the split at that node. This leads to decorrelating the different trees resulting in an overall variance reduction. Figure 20 shows the random forest procedure which is similar to the bagging process with a modification to the fourth step in which only a limited amount of predictors are available as a split option.

Figure 20. Random Forest procedure

The amount of features considered at every split (=K) is by default set to the square root of the total amount of features in the analysis (= n). Thus, 퐾 = √푛. However, Bernard, Heutte and Adam (2009) state that this default number K can be suboptimal in some cases and they indicate the relevancy of finding the optimal setting of this K. Another hyperparameter of this modeling technique is the node size which determines the minimum number of observations needed in a terminal node, the node at the lowest layer (Propst, Wright & Boulesteix, 2019). The influence on performance can be positive when the minimum node size is tuned. Lin and Jeon (2006) show this positive performance boost via tuning the size of the terminal nodes in their random forest study. Moreover, Hastie, Tibshirani and Friedman (2009) use the minimum size of the terminal nodes to determine the depth of the tree which is related to the complexity. Hence, the number of features considered at every split and the minimum node size will be tuned for the Random Forest model via 5-fold cross-validation. If it was not for the hyperparameter tuning, this algorithm, like Bagging, offers an internal validation approach via the Out-Of-Bag estimates that uses the trees that do not include the specific features in the bootstrap sample and this give an estimate of the prediction performance (James et al., 2013).

62

Boosting Boosting find its origin in the question posed by Kearns (1988) whether a set weak leaners can yield a strong learner. Schapire (1990) proves the fact that a set of weak learners can compose a strong learner and, hereby, opens a new method for algorithm design in machine learning. We will focus on boosting methods that use decision trees as weak learners: adaptive boosting (Freund & Schapire, 1995), gradient boosting (Friedman, 2001), and XGBoost (Chen & Guestrin, 2016). Thus, boosting is an ensemble learning method like bagging and random forest, but the trees are built in a sequential way to improve the performance of the learning algorithm. This in contrast to the bagging and random forest procedure in which the trees are grown in parallel. Figure 21 stresses the difference between the parallel bagging procedure and the sequential boosting procedure.

Figure 21. Parallel vs sequential ensemble methods inspired by Xia, Liu, Li and Liu (2017)

Freund and Schapire (1996) explain 2 interesting prosperities of the boosting algorithm. The first property boosting possesses is the generation of distributions on the harder parts of the sample space challenging the learning algorithm to achieve high performance on these hard parts. The second property is closely related to the bias-variance trade-off of James et al. (2013). Freud and Schapire state that boosting takes a weighted majority over many hypotheses on different samples from the same training set leading to a reduction in variance. Moreover, the authors state that boosting may reduce the bias of the learning algorithm in contrast to bagging. Figure 22 demonstrates the sequential approach of the boosting algorithm21 as postulated by the authors with

21 Again, we use a figure to explain how the algorithm works. For the mathematics behind the algorithms, we refer to the studies for the 3 different boosting approaches used (Freund & Schapire, 1995; Friedman, 2001; Chen & Guestrin, 2016).

63

a specific focus on the hard parts which are the red symbols. Notice that in iteration 1 the bottom left square is classified correctly and in iteration 2 it is classified wrongly as shown by its red color. In this example, 3 criteria are needed to classify the symbols correctly as either a circle or a square.

Figure 22. Boosting procedure

We will focus on three boosting algorithms: Adaptive boosting, Gradient boosting, and XGBoost. Adaptive boosting Freund and Schapire (1995) introduce the adaptive boosting or AdaBoost algorithm which focusses on the dimensionality reduction of the features. Consequently, this algorithm only includes features that show predictive power in the model. Furthermore, the authors state that this boosting algorithm does not require prior knowledge of the weak learner. The Adaptive Boosting algorithm aims to minimize a cumulative loss function which is done using the “weighted majority’ algorithm of Littlestone and Warmuth (1994). Gradient boosting Gradient boosting is a boosting method that sequentially grows trees by minimizing a cost function which is a differentiable loss function. Concretely, the boosting approximates some specified loss function by an additive expansion of base learners that are fit to the training set in a stage-wise manner (Friedman, 2002). Thus, the gradient boosting algorithm is sequentially updated by taking the sum of previous base learners This algorithm solves the cost function via a two-step approach (Friedman, 2002): computing the gradient of the loss function and determining the optimal scale coefficient given the optimal base learner. The gradient boosting approach is more generic than

64

the adaptive boosting since the gradient boosting method discusses some loss function and the adaptive boosting approach discusses a specified loss function. XGBoost XGBoost, a gradient boosting algorithm introduced by Chen (2015), adds an extra layer of randomness via introducing a regularization term to the Gradient boosting approach. Concretely, some specified loss function which is approximated with an additive expansion of base learners via the gradient boosting approach is amplified with this regularization term. Chen and Guestrin (2016) state that XGBoost applies the “exact greedy algorithm for split finding” (p.787) to find an optimal tree structure to solve the loss function. The XGBoost modeling approach is widely used in practice and is the algorithm used for a great number of Kaggle competitions22. An insight here is that, in some way, XGBoost is to Gradient Boosting what Random Forest is to Bagging. XGBoost adds a layer of randomness to the Gradient Boosting approach via a regularization term and Random Forest adds this extra layer to Bagging by only considering a subset of predictors. Lastly, the tuned hyperparameters for the boosting procedures are postulated based on 2 studies that devote specific attention to these hyperparameters. First, Xia et al. (2017) state that the tree complexity in a boosted decision tree approach can be controlled through the maximum depth of the tree. Therefore, we will tune the maximum depth of the tree for the Adaptive and Gradient Boosting approaches. For the XGBoost model, Nguyen, Bui, Bui and Cuong (2019) decide to tune the maximum depth of the tree and nrounds since these are related to the complexity of the grown tree and the overfitting problem. In essence, this overfitting problem is the problem of having a too complex approach to modeling a specific problem leading to capturing noise in the training set and, hence, reducing the generalizability of the fitted model to new data (James et al., 2013).

22 Kaggle is a platform offering coding competitions to grow your data science skills. Moreover, big companies like Walmart ask to solve a problem on this platform with rewards of up to $50,000 for the winning team.

65

Interpretability/flexibility trade-off The decision tree is the most interpretable model since it is the single learner used in the bagging- based and boosted-based approaches. These approaches, also called ensemble learners, wrap extra levels of flexibility which can to grasp extra complexity in the training set to fit the model. Bagging is more flexible because of the ‘aggregating bootstrap procedure’ (Breiman, 1996) that includes resampling with replacement and aggregating. Next, the random forest is even more flexible because of the extra randomness introduced via considering a subset of predictors at every split (Breiman, 2001). Boosting and bagging are positioned close to one another by James et al. (2013). However, we position boosting as more flexible than bagging and random forest. The parallel ensembles, bagging, and random forest, grow the trees based on the different resampled datasets independently and perform majority voting to obtain a target output. Boosting is a sequential ensemble approach that uses information from previous classifiers to eventually build a strong classifier. Boosting also uses resampling with the replacement of observations but does this by using weighted data which introduces extra complexity in understanding the model. This weighted data approach help to tackle the problems more flexibly. We classify adaptive boosting as less flexible than gradient boosting for 2 reasons: 1. Gradient Boosting (Friedman, 2001) is a more generic approach compared to adaptive boosting for finding approximate solutions to the additive modeling problem of composing a strong learner out of a set of weak learners. Adaptive Boosting (Freund & Schapire, 1995) is a special case of a particular loss function and, hence, is less flexible than the generic approach of gradient boosting. 2. Adaptive boosting is based on high-weight data points whereas gradient boosting uses the logic of creating a strong learner based on determining the steepest descent of the gradient.

eXtreme Gradient Boosting is even more flexible than gradient boosting since it adds another layer of flexibility to the gradient boosting method via wrapping a regularization term around it. Assessing the performance of tree-based models in Formula One is one of the aims of this dissertation (cf. supra Contribution to literature). Hence, a considerable amount of attention will be given to tree-based models in the modeling step of the SRP-CRISP-DM methodology. As mentioned before, the trade-off between interpretability and flexibility (James et al., 2013) will be investigated using tree-based models. The used algorithms are shown in Figure 23 with their associated level of interpretability and flexibility.

66

Figure 23. Tree-based models and the flexibility/interpretability trade-off (James et al., 2013)

Table 10. Learners used in the analyses

Type Model Hyperparameter Literature

Wehenkel et al. (1994); Yan et al. Single learner DTC Complexity parameter (2016).

BAG / Galar et al. (2011)23 Ensemble Bernard, Heutte and Adam (2009) learner: based on Number of predictors

the bagging RF principle Hastie, Tibshirani and Friedman Minimum node size (2009); Lin and Jeon (2006) ADA Maximum depth tree Ensemble Xia et al. (2017) learner: based on GB Maximum depth tree the boosting Maximum depth tree Nguyen, Bui, Bui and Cuong principle XGB Nrounds (2019)

23 Galar et al. (2011) do not state that the hyperparameters of the bagging procedure should not be tuned. The aim is to assess the impact of class imbalance treatment strategies independent of the hyperparameters. We want to study the synergy between bagging predictors and class imbalance treatment as stated by the authors. Thus, the standard-setting for the hyperparameters of the bagging procedure is used meaning we will not go into detail for these values. Consequently, the tuning parameters for this model remain a black box (implemented in R via caret and treebag).

67

Class imbalance The ground for integrating class imbalance treatment in our tree-based models study in Formula One can be found in the paper of Galar et al. (2011). The authors review the relationships between ensembles and the class imbalance problem. In their study, ensembles comprise bagging-, boosting- and hybrid-based approaches and these show good behavior when combined with sampling strategies to treat the class imbalance. The high relevance of this study is reinforced by the fact that the focus of this study is on two-class imbalanced datasets which is the same situation as our 3 binary classification analyses. In a binary setting, the authors define class imbalance as one of the classes having a high occurrence and, consequently, the other class low occurrence. Concretely, the target classes are unevenly distributed leading to a disproportionally large share of one class that the learner can use to train the models. Hence, the learner has more observations from class to learn from and, thus, more information to fit the model for this class. Class imbalance treatment approaches solve this issue via evening out the proportions of both classes. Moreover, the authors stress the positive synergy between the bagging procedure and these treatment strategies. Hence, we will investigate this positive synergy in Formula One (cf. supra Literature: Contribution to literature, cf. infra Discussion: Results). Looking at our analyses, we discover highly imbalanced classes in our analysis, especially in the top 3 performance analysis and the qualifying ability analysis as shown in Table 11. Table 11. Class imbalance in training set

Analysis Class Occurrence in Expressed in % training set 0 13976 86.65% Top 3 performance 1 2154 13.35% 0 6804 42.18% Race completion 1 9325 57.82% 0 718 4.26% Qualifying ability 1 16121 95.74%

Moreover, Chawla (2009) states that many problems in the field of machine learning are characterized by imbalanced data. Many algorithms suffer from reduced performance when the data contains imbalanced classes (Van Hulse, Khoshgoftaar & Napolitano, 2007). Batista, Prati and Monard (2004) denote that the learning system might have difficulties to learn when the minority class occurs in imbalanced data. Thus, highly imbalanced target values can be rather hard

68

to predict. Therefore, He, Bai, Garcia and Li (2008) define 5 different approaches to deal with class imbalance. These approaches comprise (1) using sampling strategies, (2) generating synthetic data, (3) applying cost-sensitive learning, (4) utilizing active learning, and (5) integrating this imbalance in kernel-based methods. We will focus on the first two techniques to deal with class imbalance based on 2 contributions of the paper of Burez and Van den Poel (2009). The authors state that under-sampling the majority class, part of treatment techniques (1), can lead to higher prediction accuracy and especially when the accuracy is evaluated with the AUC. Moreover, the authors suggest that an advanced technique like SMOTE, part of technique (2), might perform better than over-sampling the minority class, part of technique (1). To assess these statements in a Formula One setting, we will include the use of (1) sampling techniques and (2) synthetic data to deal with the imbalance of the target classes for our analyses. (1) The sampling strategy approach provides a way to cope with class imbalances via over- sampling24 the minority class or under-sampling the majority class. The minority class is the least occurring binary outcome for the response and the majority class is the most occurring binary response in the outcome. Over-sampling applied to the training set of the top 3 performance analysis will result in both classes having 11919 (cf. supra Table 11) and under-sampling will result in both classes having 2142 observations. Barandela, Valdovinos, Sánchez and Ferri (2004) state that under-sampling the majority class is advised when the class imbalance is not severe, otherwise they propose over-sampling. The class imbalance in the top 3 performance and qualifying ability have severe class imbalance whereas the race completion analysis has a moderate class imbalance. Consequently, we will apply both sampling strategies on our 3 analyses and evaluate their results using the evaluation metrics (cf. Model evaluation). Another insight from Burez and Van den Poel (2009) is that there is no need to perform a sampling strategy when your training set has as many positives as negatives. (2) Generating synthetic data can be a means to overcome the class imbalance. An example is the SMOTE algorithm which shifts the learning bias to the minority class. This is done by generates an arbitrary number of synthetic minority examples. Drummond and Holte (2003) conclude that over-sampling is ineffective since little to no change in performance occurs. Therefore, Chawla, Bowyer, Hall & Kegelmeyer (2002) propose SMOTE (Synthetic minority over- sampling technique) to deal with class imbalance instead of the (1) sampling strategies. The authors state that over-sampling will lead to overfitting in a decision tree since the algorithm learns

24 Over-sampling and up-sampling as well as under-sampling and down-sampling are synonyms and will be used interchangeably.

69

more and more specific regions of the minority class. Therefore, they create SMOTE, inspired by a technique in handwritten character recognition (Ha & Bunke, 1997), that over-samples the minority class by crafting synthetic examples instead of over-sampling with replacement. We will apply ADASYN (Adaptive Synthetic Sampling Method for Imbalanced Data ) (He, Bai, Garcia & Li, 2008), a technique part of the SMOTE algorithm family, as a second measure to deal with class imbalance. ADASYN is an adaptive synthetic sampling approach for imbalanced learning that applies a weighted distribution based on the difficulty level in learning. The algorithm reduces the introduced bias by the class imbalance and shifts the classification decision boundary toward the more difficult examples. The authors describe their ADASYN procedure, which uses the training data set as input, in 6 steps: “

푚푠 # 푚푖푛표푟푖푡푦 푒푥푎푚푝푙푒푠 1. Compute the ratio of minority to majority examples via 푑 = = 푚푙 # 푚푎푗표푟푖푡푦 푒푥푎푚푝푙푒푠 2. Calculate the number of synthetic minority data to generate 퐺 as 퐺 = (푚푙 − 푚푠) 푥 훽 with 훽 equal to the desired level of class imbalance

3. Locate the k-Nearest Neighbors of every minority example and calculate the ri value The ri value refers to the dominance of the majority class in a neighborhood.

푟푖 = 훥푖/퐾푖

4. Normalize the ri values to make sure that all ri values sum up to 1.

푟푖 푟푖 = 푚푠 ∑푖=1 ri

5. Calculate the number of synthetic examples to generate for every neighborhood.

푔푖 = 푟푖 푥 퐺 6. Generate new data for each neighborhood. 푠푖 = 푥푖 + (푥푧푖 − 푥푖) 푥 λ “ with 훌 (being a random number between 0 and 1” (p. 1323-1324) Conclusively, we will use 4 different approaches for treating the class imbalance in the training set for our training set are: 1. No class imbalance treatment as the baseline approach 2. Over-sampling the minority class as a sampling strategy 3. Under-sampling the majority as a sampling strategy 4. ADASYN as part of the SMOTE family that generates synthetic examples

70

Model Evaluation

The ‘Model Evaluation’ part will measure the predictive performance of the tree-based methods in Formula One by using 7 evaluation metrics and 3 plotting techniques. The aim of the evaluation section is threefold: 1. Applying threshold-dependent and threshold-independent metrics (accuracy, sensitivity, specificity, Area Under the Receiver Operator Curve, lift) 2. Discovering the relevant features (Variable Importance Plot) 3. Visualizing the relationship between relevant features and target on an individual level (ICE and SHAP plot) Concretely, the seven evaluation metrics we will use are the confusion matrix, accuracy, sensitivity, specificity, AUC, lift, F1 score and the three evaluation plots are VIPs, ICE, and SHAP plots. Confusion matrix The confusion matrix gives insights into the amount of correctly and wrongfully classified predictions based on the actual values. The confusion matrix is used in the study of Townsend (1971) in which the performance of a model for alphabetical recognition is assessed. Moreover, this evaluation instrument is applied in many other fields. For example, Inouye, van Dyck, Alessi, Balkin, Siegal and Horwitz (1990) apply the confusion matrix as an evaluation instrument for positive and negative prediction accuracy of delirium detection in high-risk settings. Since this technique is highly used in literature as an evaluation metric, the confusion matrix is also used in our Formula One research as a measure of the amount of correctly predicted outcomes. The confusion matrix introduces the concepts of True Positives (TP), False Positives(FP), True Negative(TN), and False Negatives(FN).

TP = correctly classified positive25 target TN = correctly classified negative target FP = predicted as positive, labeled as negative FN = predicted as negative, labeled as positive

The positioning of these 4 metrics in the confusion matrix is shown in Table 12.

25 A positive target is a 1 and a negative target is a 0.

71

Table 12. Confusion matrix

Actual (label) Positive (1) Negative (0) Positive (1) True Positive (TP) False Positive (FP) Prediction Negative (0) False Negative (FN) True Negative (TN)

Accuracy, sensitivity and specificity The TP, TN, FP, and FP of the confusion matrix introduce the following concepts: accuracy, sensitivity, and specificity. The accuracy (ACC) is the total proportion of correctly classified outcomes. Story and Contalgon (1986) define the overall accuracy as the number of entries on the diagonal divided by the number of samples. The total number of observations is equal to the actual positives (푃표) plus the actual negatives (푁표).

퐴푐푡푢푎푙 푃표푠푖푡푖푣푒푠: 푃표 = 푇푃 + 퐹푁 퐴푐푡푢푎푙 푁푒푔푎푡푖푣푒푠: 푁표 = 퐹푃 + 푇푁 Thus, the accuracy is calculated as follows based on Table 12. 푇푃 + 푇푁 퐴퐶퐶 = 푃표 + 푁표 Sensitivity and specificity are 2 concepts that focus respectively on positive and negative predicted outcomes. Parikh, Mathai, Parikh, Sekhar and Thomas (2008) discuss the essentials of sensitivity and specificity in a day-to-day clinical practice application and the formulas the authors apply are modified in this analysis based on the confusion matrix in Table 12. Sensitivity (SEN) is defined as the proportion of positives labels correctly predicted. 푇푃 푆퐸푁 = 푃표 Specificity (SPC) is defined as the proportion of negative labels correctly predicted. 푇푁 푆푃퐶 = 푁표

The ACC, SEN, and SPC give a first indication of the prediction accuracy of the models. However, the accuracy of a model that always predicts the majority class is high in an imbalanced dataset. For example, in the test set are 85% of the target values equal 0 and are 15% equal to 1. A model that always predicts 0 will reach a test set accuracy of 85% based on the ACC formula. This is

72

rather misleading since the model is ‘it is always 0’. SEN and SPC are better in the sense that they can indicate problems with predicting one class in a binary classification setting. Unfortunately, the SEN and SPC have to be combined to have an overall overview of the model performance and this brings us back to the ACC. To have a better understanding of model performance, the AUC, a threshold-independent approach, is discussed.

AUC The Area under the Receiver Operator Curve (AUROC, in short AUC) is a performance measure that has been used in much class-imbalance learning and cost-sensitive learning studies (Gao et al., 2013). Metz (1978) explains the basic principles of ROC analysis and the author points to the threshold-independence of the AUC. Thus, the AUC has a major advantage over the accuracy metric since it does not depend on the arbitrary selection of a decision threshold. In Figure 24, an example of a ROC curve is shown. The True Positive Rate (TPR) is represented by the y-axis and the False Positive Rate (FPR) by the x-axis. The TPR is equal to the sensitivity (SEN) and the FPR can be calculated via 1 – specificity (SPEC).

Figure 24. ROC curve

Metz states that through varying the decision threshold a compromise between the TPR and the FPR can be made. Thus, the ROC can be interpreted as the TPR versus the FPR over all possible thresholds. The AUC is the integral under the orange ROC curve as shown in Figure 24 and is calculated via the following formula: 1 푇푃 퐹푃 퐴푈퐶 = ∫ 푑 0 푃표 푁표

73

The AUC is a value between 0.5 and 1 (Rosner et al., 2015) with 0.5 indicating no class distinction capability and 1 indicating perfect classification. An AUC of less than 0.5 means that the model is better at predicting the opposite class than the actual class meaning the model can be flipped around yielding in an AUC of above 0.5. The AUC has support from literature in analyses where statistical performance is paramount (Backiel, Baesens & Claeskens, 2014). On the contrary, Lobo, Jiménez-Valverde and Real (2008) call the AUC a misleading metric for the performance of predictive models. Nevertheless, we will use the AUC as our most important metric since it will play an important part in the hyperparameter tuning. Concretely, the AUC will be the considered metric to optimize the hyperparameter grid of our fitted tree-based models during the 5-fold cross-validation (cf. supra Modeling: Validation).

Lift The lift evaluation metric describes the improvement of the model’s selection compared to a random selection (Lo, 2002). We will discuss 3 different formulas related to the interpretation of lift: baseline lift, top N lift, and max lift. To compute the top N lift, we take the top N observations who are most likely to be 1 and, afterward, calculate the proportion of the right classification with the actual labels. Van den Poel, De Schamphelaere and Wets (2004) introduce the lift concept in the do-it-yourself market to quantify the direct and indirect effects of retail promotions on sales and profits. The authors define lift as a measure for the strength of a complementary relationship measured via: 푃(푌|푋) 푃(푋⋀푌) " 퐿푖푓푡 = = " (p.56) 푃(푌) 푃(푋).푃(푌)

The formulas for top N lift, baseline lift, and the max lift are based on this formula.

푡표푝 푁 푝푟표푝표푟푡푖표푛 표푓 1 푃(푌|푡표푝푁) 푡표푝 푁 푙푖푓푡 = = 표푣푒푟푎푙푙 푝푟표푝표푟푡푖표푛 표푓 1 푃(푌)

The baseline lift is equal to 1 since no complementary relationship is introduced:

표푣푒푟푎푙푙 푝푟표푝표푟푡푖표푛 표푓 1 푃(푌) 푏푎푠푒푙푖푛푒 푙푖푓푡 = = = 1 표푣푒푟푎푙푙 푝푟표푝표푟푡푖표푛 표푓 1 푃(푌)

74

The max lift is calculated to know how far a random selection from the dataset is compared to the maximum, which is classifying all 1 correctly. Thus, the formula for max lift is as follows:

1 1 max 푙푖푓푡 = = 표푣푒푟푎푙푙 푝푟표푝표푟푡푖표푛 표푓 1 푃(푌)

푡표푝 10 푝푟표푝표푟푡푖표푛 표푓 1 We will focus on the top decile lift (N=10) via as an evaluation metric to 표푣푒푟푎푙푙 푝푟표푝표푟푡푖표푛 표푓 1 determine the augmented performance of the model’s selection when a complementary relationship is known.

F1 score The F1 score, inspired by the work of Van Rijsbergen (1979), can be calculated using the confusion matrix in Table 12. The F1 measure uses the recall, which is the sensitivity as defined above, and the precision, which is the true positives compared to the total predicted positives.

푇푃 푇푃 푅푒푐푎푙푙 = 푠푒푛푠푖푡푖푣푖푡푦 = = 푁표 푇푃 + 퐹푁

푇푃 푃푟푒푐푖푠푖표푛 = 푇푃 + 퐹푃

푃푟푒푐푖푠푖표푛 ∗ 푅푒푐푎푙푙 퐹1 = 2 ∗ 푃푟푒푐푖푠푖표푛 + 푅푒푐푎푙푙

Powers (2011) states that the F1 metric neglects the True Negatives meaning it is a misleading metric and, especially, in cases of class imbalance. However, the F1 score endures this criticism since it is still used in many studies (Chicco & Jurman, 2020). Moreover, ‘what would a Formula One study be without including the F1-score metric?’. Thus, we will include this metric in the results, but it will be interpreted with caution just as the accuracy. This introduces the elaboration of the behavior of the metrics in case of class imbalance.

75

Metric behavior in imbalanced classes Not all the metrics mentioned above have an equal amount of impact on the interpretation of the results. Some metrics like the accuracy will be mentioned, but a model that always predicts one of the classes will have good results in a highly imbalanced setting based on the accuracy. The F1 score is also affected in case of severe class imbalance. The lift calculates the improved performance of the model if the relationship between X and Y is known. The Sensitivity and Specificity depict the prediction accuracy for respectively the positive (푃표) and negative labels (푁표). These metrics will be discussed together to get a parsimonious overview of predictive performance. However, the most important evaluation metric in our analyses is the threshold- independent AUC. Conclusively, we use 7 different metrics to measure prediction accuracy in our 3 analyses. The results in the discussion section will be discussed taking into account these 7 evaluation metrics to build a well-considered conclusion. These metrics, their relationship with class imbalances, and their references are postulated in Table 13.

Table 13. Metrics, reasoning regarding class imbalance and reference

Metric Reasoning Reference

A threshold-dependent metric highly Story and Contalgon ACC dependent on class imbalance (1986)

Parikh, Mathai, Used in combination to overcome flaw of only SEN/SPC Parikh, Sekhar and considering one class Thomas (2008)

Threshold-independent metric, performs well AUC Metz (1987) in cases of class imbalance

Van den Poel, De Shows potential, more applicable in database Lift Schamphelaere and marketing Wets (2004)

Highly dependent on class imbalance, focus Van Rijsbergen F1 on positive predictions (1979)

76

Three evaluation plots support the discussion of the different model’s performance in our 3 analyses. (1) Variable Importance Plots, (2) Individual Conditional Expectation plot, and (3) SHAP plot. Variable importance A crucial aspect is finding out which variables play an important role in the predicted outcome. The feature permutation importance was initiated by Friedman (2001) in which he applied it for random forests. Molnar (2019) states that a feature can be classified as important if the model accuracy is decreased when its values are randomly shuffled meaning the model heavily depends on this feature to make predictions. On the contrary, the author denotes a feature as unimportant when permuting that feature does not have an impact on the model accuracy. Genuer, Poggi and Tuleau-Malot (2010) study the variable importance in variable selection using random forests. The authors investigate the sensitivity of the importance plots for random forests to altering the number of considered predictors at every split and the number of trees grown. They find that the variables are ranked the same when these hyperparameters are altered. However, increasing the number of trees reduces the confidence interval of the importance of a specific feature. Moreover, the differences between variable importance are bigger when more predictors are considered at every split. We consider two ways of determining the feature: Gini importance and Permutation importance. The Gini importance method looks at the mean decrease in Gini when the variable is left out of the analysis. Differently put, this determines the mean decrease in node impurity when the variable is left out of the analysis. Thus, this measure indicates the purity added to the model when this variable is added. The feature importance method stipulates the mean decrease in accuracy when the feature is permuted with random noise. Strobl et al. (2007) state that the Gini importance is not a reliable metric when features vary in their scale of measurement or their number of categories, especially in the fields of genomics and computational biology. In this study, permutation importance is advised as the metric to use when estimating feature importance. However, Strobl et al. (2008) postulate that the permutation importance overestimates the importance of correlated features. We will pick the Gini importance in our analysis as a metric for the variable importance plots for 2 reasons:

1. In our study, we do not have many features with a lot of different categories since these can cause severe problems for classification studies (Micci-Barreca, 2001). 2. There might be a high correlation between some features leading to overestimations by the permutation importance method. (Strobl et al., 2008).

77

Concretely, a variable importance plot (VIP) displays the features on the x-axis and the mean decrease in Gini importance on the y-axis. This Gini importance measure is rescaled to 100 for the most important features making it easier to compare it to the importance of the other features. These VIPs are bar plots and the height of the bar for a specific feature represents its importance. Multiple VIPs will be used when discussing the results in Part 4 of this dissertation to discover the most important for a specific analysis with a tree-based model using one of the 4 class approaches to class imbalance.

Individual Conditional Expectation (ICE) The Individual Conditional Expectation plot shows the relationships between a predictor, for example, the ‘homeAdvantage’ feature, and the target output at the level of individual observations (Goldstein, Kapelner, Bleich & Pitkin, 2015). A related plot is the Partial Dependence Plot (PDP) which depicts the relationship between the outcome and the feature of interest in a low-dimensional way (Greenwell, 2017). The difference between ICE and PDP is that the PDP is a more global approach compared to the instantiated approach of ICE. We prefer the latter approach since ICE scrutinizes the relationship between the feature and the target. Moreover, one could derive a non- existent relationship between feature and target based on a flat line in a PDP. However, there could be a relationship present at a specific instance level (Molnar, 2019). We will investigate the ICE plots of the most important features of the considered model based on the feature importance measures of the VIP.

SHapley Additive exPlanation (SHAP) The SHAP (SHapley Additive exPlanation) values (Lundberg & Lee, 2017) find their origin in game theory (Shapley, 1953) where it was used to distribute the gains and costs to actors fairly working in coalition via these Shapley values. Lundberg and Lee use these SHAP values to facilitate the interpretation of prediction results for complex models. Hence, these values aid to interpret flexible models meaning they are closely related to the interpretability/flexibility trade-off (James et al., 2013). Moreover, these values could potentially help to mitigate this trade-off since a complex model could be made interpretable. We will test this by applying the SHAP values to the most flexible model included this dissertation: XGBoost. Lundberg, Erion and Lee (2018) show that the feature attribution methods are inconsistent and, hence, they propose the SHAP values which yield locally accurate attribution values.

78

Lundberg et al. (2019) state that the SHAP values can help to explain tree by taking “black box” prediction models as input and generating “white box” local explanations. Combining these local explanations can lead to global model insights utilizing feature dependence, interaction effects, model monitoring, explanation embeddings, and model summarization.

Model deployment Idea The main idea is to establish a connection with the Ergast API to automate the SRP-CRISP-DM process based on the elaboration of Bunker and Thabtah (2019). This is done to automate this process by regularly updating the basetable with new data. The augmented basetable will be used to adjust the training and the test set. These adjusted training set will be applied to retrain the model and make predictions for new races. The prediction results will be evaluated with the adjusted test set to discover the potential increase in prediction accuracy when new data is added to the basetable. An interesting study that could add complementary value here is the paper of Oza (2005). The author explains ‘online learning’ as an approach in which the models are not continuously retrained. This is because the data only arrives after some time which is the case here since the data becomes available after a GP is completed and the data is adapted in the Ergast API.

Implementation This idea is implemented in RStudio through directly linking the Ergast API’s zip file which contains the used data sets in our analyses. Thereafter, this file is unzipped and the data sets are loaded into the RStudio environment. Functions were constructed for the different steps of the SRP- CRISP-DM process to enable all sorts of readers to make predictions for top 3 performance, race completion, and qualifying ability. The build-up of the code can be found in Appendix 5. A side note here is that the ‘rain’, ‘rainExp’, and ‘homeAdvantage’ features are not included in these analyses since this information is not present in the Ergast API. However, we believe the other 34 features will give valuable insights into these 3 analyses.

79

Back on track

In Figure 25, a recap of the followed steps, as defined by SRP-CRISP-DM, is shown to bring the reader ‘back on track’ and ready for the discussion section.

Figure 25. SRP-CRISP-DM as fundament for our analyses

80

Part 4: Discussion Results Class imbalance In our analysis, the imbalance in the classes for the target is treated via different methods: over- sampling the minority class, under-sampling the majority class, and ADASYN. These class imbalance treatment techniques have generated very positive outcomes for the performance of the applied models (Galar et al., 2011). However, applying these techniques to classes that are already perfectly balanced does not produce additional prediction accuracy for the model (Burez & Van den Poel, 2009). The class imbalance assessment (cf. supra Modeling: Class Imbalance) indicates some extent of imbalance in the 3 analysis leading to an opportunity to increase the prediction accuracy by applying treatment approaches. Since the models are fit on the training set, we will only assess the class imbalance on this set. The distribution of the classes in the target outputs can be found in Tables 14, 15, and 16.

Table 14. Class imbalance in the training set of the top 3 performance analysis

# occurrences # occurrences # occurrences in # occurrences in in the down- Top 3 in the training the over-sampled the ADASYN sampled set training set training set training set 0 13976 (86.65%) 13976 (50%) 2154 (50%) 13976 (50.42%) 1 2154 (13.35%) 13976 (50%) 2154 (50%) 13741 (49.58%)

Table 15. Class imbalance in the training set of the race completion analysis

# occurrences # occurrences # occurrences in # occurrences in Race in the under- in the training the over-sampled the ADASYN completion sampled set training set training set training set 0 6804 (42.18%) 9325 (50%) 6804 (50%) 8492 (47.66%) 1 9325 (57.82%) 9325 (50%) 6804 (50%) 9325 (52.34%)

81

Table 16. Class imbalance in the training set qualifying ability

# occurrences # occurrences # occurrences in # occurrences in Qualifying in the under- in the training the over-sampled the ADASYN ability sampled set training set training set training set 0 718 (6.42%) 16121 (50%) 718 (50%) 16187 (50.10%) 1 16121 (93.58%) 16121 (50%) 718 (50%) 16121 (49.90%)

The 3 tables above show that the top 3 analysis and the qualifying ability analysis suffer from a higher class imbalance than the race completion analysis. We will assess whether these treatment techniques add more prediction accuracy to the top 3 and qualifying ability analyses compared to the race completion analysis. The ADASYN approach creates highly balanced classes by generating synthetic observations for the minority class.

Performance of tree-based models In this section, the predictive performance results for the tree-based models (cf. supra Part3 Research: Modeling: Tree-based models), (1) Decision Tree Classifier, (2) Bagging, (3) Random Forest, (4) Adaptive Boosting, (5) Gradient Boosting and (6) eXtreme Gradient Boosting, will be discussed. The evaluation metrics used to assess the prediction accuracy for our 3 analyses are (1) Confusion Matrix, (2) Accuracy, (3) Sensitivity, (4) Specificity, (5) AUC, (6) Lift, (7) F1 score and the 3 used plots to support the findings are (1) Variable Importance Plot, (2) Individual Conditional Expectation Plot and (8) SHAP Plot (cf. supra Part 3: Evaluation). Specific attention will be given to three important influencing factors for our predictions: 1. The used tree-based model, its position on the interpretability/flexibility trade-off (James et al., 2013), and the impact on the prediction accuracy. We decide to focus on 6 evaluation metrics with each a different metric importance (cf. supra Research: Model evaluation: Metric importance) as the ground of comparison for the different models. 2. The class imbalance treatment via a sampling strategy: (1) no sampling strategy, (2) over- sampling the minority class, (3) under-sampling the majority class, and (4) ADASYN. Furthermore, we will assess whether the bagging and boosting procedures create synergy together with these sampling strategies for the predictive performance of the tree-based models (Galal et al., 2011).

82

3. The relationship between the most important feature(s), as determined by the VIP, and the target output on a specific instance level is discussed via the ICE and the SHAP plots. A remark here is that not all the VIPs, ICE plots, and SHAP plots will be included in our analyses. We use 6 different tree-based models in our 3 analyses with 4 approaches to class imbalance leading to a total of 72 different configurations. Every single configuration is presented by a tree in Figure 26.

Figure 26. All ‘model’-‘class imbalance’ configurations of our analyses

The results of the evaluation for the top 3 performance can be found in Tables 18, 19, 20, and 21 filtered by class imbalance treatment strategy. Parallel to these results for ‘top 3 performance’, the result of race completion and qualifying ability predictions can respectively be found in Tables 22- 25 and 26-29 A different approach is put forward for the evaluation plots; we will focus the evaluation plots for every analysis on one particular class imbalance treatment strategy. We will assess which class imbalance treatment strategy achieves the highest prediction accuracy for every analysis. Then, we will focus on the best performing class imbalance strategy to construct the VIPs, ICE, and SHAP plots for that analysis.

83

To reduce the repetitiveness of the same rationale for every model, a focus for the models is defined as shown in Table 17 with the support from literature for this focus.

Table 17. Focus of every model Model Focus

The decision trees (Quinlan, 1986), as the only single learner in our analyses, Decision tree deserve considerable attention since it is at the basis of the ensemble models discussed.

Galar et al. (2011) discuss the synergy between the bagging procedure and Bagging class imbalance treatment. Hence, we will focus on this synergy in a Formula One environment.

Breiman (2001) introduces with the Random Forest an algorithm that can Random forest devote attention to the importance of the features.

Chu and Zaniolo (2004) describe adaptive boosting as a fast, few resources Adaptive demanded and an easily adaptable modeling approach. We will assess boosting whether this model uses a limited amount of features to make predictions.

The greedy approach of the Gradient Boosting (Friedman, 2002) model will Gradient be compared to the extra greedy approach of the XGBoost model (Chen, boosting 2015).

Differently put, we will study whether the regularization term (Chen & Guestrin, 2016), which is the extra flexibility of the model, yields better or XGBoost worse results in a Formula One setting. Better results can be obtained since the model captures extra complexity or worse results can be obtained since the model overfits the training set resulting in extra noise in the model.

84

Top 3 finish analysis Evaluation metrics In the 4 tables below, the 6 evaluation metrics for the prediction accuracy of the tree-based models for top 3 performance will be discussed. The results of the best performing model for every class imbalances approaches based will be shown marked in blue in the tables below. Moreover, the results of the best model for every metric will be put in bold. Table 18. No class imbalance treatment 'top 3 analysis'

DTC BAG RF ADA GBM XGB

ACC 0.864 0.873 0.882 0.884 0.883 0.885

SEN 0.403 0.369 0.335 0.424 0.412 0.431

SPC 0.934 0.951 0.966 0.955 0.956 0.955

AUC 0.669 0.660 0.650 0.690 0.684 0.693

Lift 3.688 3.753 3.515 4.263 4.165 4.306

F1 0.441 0.437 0.430 0.494 0.485 0.501

Table 19. Over-sampling 'top 3'

DTC BAG RF ADA GBM XGB

ACC 0.825 0.865 0.858 0.810 0.810 0.843

SEN 0.581 0.470 0.664 0.834 0.856 0.720

SPC 0.863 0.926 0.888 0.807 0.803 0.861

AUC 0.722 0.698 0.776 0.821 0.830 0.791

Lift 2.81 3.645 3.460 3.026 3.124 3.243

F1 0.470 0.481 0.555 0.540 0.546 0.550

85

Table 20. Under-sampling 'top 3'

DTC BAG RF ADA GBM XGB

ACC 0.759 0.791 0.799 0.801 0.804 0.803

SEN 0.777 0.846 0.863 0.846 0.857 0.868

SPC 0.758 0.783 0.789 0.795 0.797 0.793

AUC 0.767 0.814 0.826 0.820 0.827 0.830

Lift 2.538 2.896 2.961 2.951 3.005 2.983

F1 0.463 0.519 0.534 0.532 0.539 0.540

Table 21. ADASYN 'top 3'

DTC BAG RF ADA GBM XGB

ACC 0.856 0.870 0.874 0.875 0.850 0.884

SEN 0.458 0.466 0.505 0.510 0.689 0.459

SPC 0.917 0.932 0.930 0.931 0.874 0.950

AUC 0.687 0.699 0.718 0.721 0.782 0.704

Lift 3.374 3.862 3.905 3.949 3.352 4.437

F1 0.459 0.489 0.516 0.523 0.550 0.514

The prediction accuracy of the baseline model for the ‘top 3 performance’ is noticeably increased when class imbalance treatment strategies are applied. For example, the Random Forest model yields an AUC of 0.650 with no class imbalance treatment whereas an AUC of at least 0.718 (ADASYN) and at most 0.826 (under-sampling). In general, the highest evaluation scores are obtained for the under-sampling approach due to its highest sensitivity in general of the 4 approaches and the highest AUC (0.830 for Gradient Boosting which is tied with under-sampling combined with XGBoost) which are both important metrics. This supports the statement that under- sampling techniques can lead to higher prediction accuracy when evaluated with the AUC (Burez & Van den Poel, 2009). However, the over-sampling approach combined with the boosting models yields similar prediction accuracy compared to the under-sampling for all models. But, the under-

86

sampling method achieves better results for the DTC, Bagging, and Random Forest compared to the results of their over-sampling approach. Consequently, we will focus on the evaluation plots of the under-sampling strategy in the next section. In Figure 27, the ranking averaged over the 4 class imbalance treatment approaches for every model per performance metric is shown. This figure indicates the better performance of ensemble learners compared to the single learner. Moreover, bagging-based procedures perform worse than boosting-based procedures. Additionally, we observe that the XGBoost model obtains the highest average rank when eye-balling this plot. Hence, we will devote some extra attention to the XGBoost model in combination with the under- sampling approach.

Figure 27. Average rank per performance metric for top 3 analysis

Decision tree

In Figure 28, a decision tree can be found for the top 3 analysis with the 'grid' determining the binary split. Thus, this decision tree suggests that one feature is enough to quantify a driver as either in the top 3 or not. However, a single decision tree suffers from high variance (James et al., 2013) and, thus, we will look at either the cross-validated results for the DTC with tuned hyperparameters or the other models included in this dissertation.

87

Figure 28. Decision tree top 3 under-sampled

The Variable Importance Plot (VIP) in Figure 29 shows the importance of the ‘grid’ and ‘pointsSeason’ variable in our top 3 finish analysis. The top 10 most important features of the top 3 analysis with under-sampling the majority class are demonstrated on the y-axis. The x-axis contains the importance value which is the mean decrease in Gini impurity (cf. supra Research: Modeling: DTC).

Figure 29. VIP DTC top 3 analysis under-sampled

88

The Individual Conditional Expectation (ICE) Plot displays the relationship of a specific instance between a predictor and the target output. In Figure 30, the relationship between the grid, the most important feature for the DTC model, and the target ‘top 3’ are displayed. The yellow line on the plot is the aggregated result of the specific instances. We see that a higher grid will result in a lower probability of finishing in the top 3.

Figure 30. ICE DTC top 3 analysis

Bagging

We assess the synergy of the bagging procedure and class imbalance treatment approaches (Galar et al., 2011). The biggest improvement for prediction accuracy is noticed for the under- sampling approach based on AUC and SEN. The over-sampling method yields an extremely high SPC in combination with Bagging. In Figure 31, the VIP of Bagging in case under-sampling shows the importance of the ‘grid’ and ‘pointsSeason’ features to predict op 3 performance.

Figure 31. VIP bagging top 3 under-sampled

89

Random forest Similar to Bagging, Random Forest yields the best prediction accuracy for top 3 performance in combination with under-sampling the majority class. Again, the ‘grid’ and ‘pointsSeason’ are the most important features in the Random Forest approach.

Figure 32. VIP random forest top 3 under-sampled

Adaptive boosting

This approach is known for its few resources utilized meaning the algorithm tries to use as few features as possible to make predictions resulting in an adaptive and lightweight algorithm (Chu and Zaniolo, 2004). This can be seen in Figure 33 where the predictions are almost solely based on the ‘grid’ and the other features have little to none importance. Minor importance is found by this algorithm in ‘pointsSeason’, ‘pastTop3’, and ‘teamMatePointsSeason’.

Figure 33. VIP adaptive boosting top 3 under-sampled

90

Gradient boosting

Similar to Adaptive boosting, we notice the high importance in Figure 34 of the ‘grid’ to predict top 3 performance using Gradient boosting compared to the other features. Moreover, the pointsSeason feature is double as important compared to the others.

Figure 34. VIP gradient boosting top 3 under-sampled

XGBoost

As for the two other boosting approaches, the high importance of the ‘grid’ catches the eye using the XGBoost model. Modest importance can be found in the ‘pointsSeason’ and ‘pastTop3’ features.

Figure 35. VIP XGBoost top 3 under-sampled

91

Since XGBoost is the best performing model in combination with under-sampling in the top 3 performance analysis, we will focus on the ICE plots of the 3 most important features: ‘grid’, ‘pointsSeason’, and ‘pastTop3’. In Figure 36, the relationship between the grid and ‘top 3 performance’ is shown. As expected, we notice that a better grid will result in a higher probability for an observation to be predicted as 1.

Figure 36. ICE XGBoost top 3 under-sampled with grid

In Figure 37, the relation between the normalized ‘pointsSeason’ and ‘top 3 performance’ is displayed. This figure exhibits a positive relationship between the normalized points obtained in the season. However, for normalized points higher than 17 there is a decrease in the probability to predict an observation as finishing in the top 3. Obtaining a normalized points season of 17 or more is extremely hard and, especially, as the season continues to the end. This requires a high performing driver with a considerable amount of consistency throughout the season. Thus, an explanation for this decrease in this ICE plot is that the amount of drivers with higher than 17 normalized pointsSeason is limited. Moreover, a driver with 17 or more normalized points per season that finishes fourth will get 12 points leading to a decrease of the normalized pointsSeason for the next driver’s participation. Thus, it does make some sense in the way that the driver cannot easily correct one miss of not finishing in the top 3 since the driver will have lower normalized pointsSeason for the next race.

92

Figure 37. ICE XGBoost top 3 under-sampled with pointsSeason

In Figure 38, the link between the past number of top 3s, which is a feature based on the dependent variable, and the top 3 performance is shown. We observe a higher probability for ‘top 3 performance’ when the past number of top 3s increases.

Figure 38. ICE XGBoost top 3 under-sampled with pastTop3 We will provide two extra plots related to the XGBoost model: a SHAP plot to interpret the results of this highly flexible model and a VIP for the XGBoost model without considering the ‘grid’. In Figure 39, the SHAP plot is shown which contains the 5 most important features in this analysis.

93

The SHAP plot demonstrates with color codes whether these features are important for predicting an outcome in a specific direction. One should focus on the purple color and the direction of the purple color compared to the ‘0.0’ line. To the left of this line is important to predict zeros and to the right is important to predict ones. Thus, we can see that the grid, legacyConstructor, pointsSeason, and nrWinsConstructor perform well at predicting negative outcomes whereas pointsSeason, nrWinsConstructor, pastTop3 and are used by the XGBoost model to predict positive outcomes. Moreover, the high feature importance of the ‘grid’ and the feature’s higher ability to predict zeros might be an explanation for the high specificity and lower sensitivity for the top 3 performance. For predicting a top 3, we should preferably look at pointsSeason, pastTop3, and nrWinsConstructor which is valuable insight from this SHAP plot26. Concretely, a bad grid makes it impossible to achieve a top 3 result and the variables related to past performance differentiate the topper from the sub toppers. Furthermore, we notice the importance of the constructor via ‘legacyConstructor’ and ‘nrWinsConstructor’.

Figure 39. SHAP XGBoost top 3 under-sampled

26 The SHAP plots are constructed on a non-cross-validated XGBoost model. However, we used the tuned hyperparameters to construct these plots. This can lead to features having different importance scores when, in this case, Figure 38 and Figure 39 are compared.

94

Moreover, we would like to assess the robustness of the XGBoost model when the most important feature ‘grid’ is left out of the analysis. In Figure 40, the VIP of the XGBoost model with the grid excluded shows that the pointsSeason takes on the role of the most important feature.

Figure 40. VIP XGBoost top 3 under-sampled without grid

The resulting AUC is 0.800 compared to 0.830 with the grid included meaning the other features, especially pointsSeason, are doing a great job of contributing to the prediction in the absence of the most important predictor. This is an indication of the robustness of the XGBoost model for top 3 performance predictions. Since the grid is deleted which is a variable, as shown in the Shapley plot in Figure 34, used for making predictions towards negative targets. Thus, excluding the grid from the model will result in a loss of information about drivers not able to complete the race. In general, under-sampling the majority class yields the highest prediction accuracy of the considered class imbalance treatments. The best performing model for ‘top 3 performance’ based on the average ranking over the 4 approaches is the XGBoost model. All the models show that the grid is the most important feature to predict ‘top 3 performance’. However, the pointsSeason deserves an honorable mention since it is the second most important feature and this feature carries the model when the grid is removed. As mentioned before, the grid is a necessary but not sufficient condition for top 3 performance. To discover true, ‘top 3 performance’ we should look into past performance (pastTop3, pointsSeason) and the prestige of the constructor (legacyConstructor, nrWinsConstructor).

95

Race completion analysis Evaluation metrics In the 4 tables below, the 6 evaluation metrics for prediction accuracy of the tree-based models for race completion will be discussed considering the 4 class imbalances approaches. The results of the best performing model based on the different metrics will be shown in blue. As in the top 3 performance analysis, the results of the best model for every metric will be put in bold.

Table 22. No class imbalance treatment 'race completion'

DTC BAG RF ADA GBM XGB

ACC 0.611 0.593 0.620 0.622 0.618 0.617

SEN 0.595 0.674 0.703 0.701 0.673 0.675

SPC 0.632 0.483 0.507 0.512 0.544 0.537

AUC 0.613 0.578 0.605 0.607 0.608 0.606

Lift 1.207 1.121 1.149 1.136 1.149 1.151

F1 0.639 0.657 0.682 0.682 0.671 0.671

Table 23. Over-sampling 'race completion'

DTC BAG RF ADA GBM XGB

ACC 0.561 0.596 0.610 0.609 0.605 0.601

SEN 0.595 0.641 0.583 0.526 0.511 0.586

SPC 0.515 0.534 0.647 0.723 0.734 0.621

AUC 0.555 0.587 0.615 0.624 0.623 0.603

Lift 1.101 1.114 1.189 1.239 1.234 1.184

F1 0.611 0.647 0.634 0.608 0.600 0.629

96

Table 24. Under-sampling 'race completion'

DTC BAG RF ADA GBM XGB

ACC 0.597 0.592 0.605 0.606 0.605 0.606

SEN 0.478 0.569 0.511 0.516 0.511 0.520

SPC 0.761 0.624 0.734 0.730 0.734 0.724

AUC 0.619 0.597 0.622 0.623 0.623 0.622

Lift 1.249 1.161 1.239 1.217 1.234 1.222

F1 0.578 0.618 0.599 0.602 0.600 0.604

Table 25. ADASYN 'race completion'

DTC BAG RF ADA GBM XGB

ACC 0.563 0.592 0.610 0.606 0.617 0.612

SEN 0.621 0.642 0.641 0.622 0.606 0.665

SPC 0.484 0.523 0.567 0.584 0.633 0.540

AUC 0.552 0.583 0.604 0.603 0.619 0.603

Lift 1.071 1.131 1.172 1.187 1.217 1.126

F1 0.622 0.645 0.655 0.646 0.647 0.665

A first assessment of the evaluation metrics of the race completion analysis leads to the insight that the prediction accuracy for race completion is not high. This could be because of using more or less27 the same features for all the analyses and, thus, the potential disregard for additional features that could grasp more prediction accuracy for race completion. Hence, the results for the race completion could serve as a system that indicates early warnings of non-completion. However, the AUC increases with 0.011 when comparing the AUC of the baseline model (0.613), DTC, and the best model with the class imbalance treatment strategies (0.624), Adaptive Boosting with over-

27 The lagged dependent variables, like pastNoComp and everNoComp, were only included in the related analyses, which is the race completion analysis.

97

sampling. Thus, the focus of plots in the race completion analysis will be on the over-sampling approach. Before we consider these plots, the average ranking plot, shown in Figure 41, is discussed. In general, the Random Forest and the Adaptive Boosting models yield the highest performance for the race completion analysis. However, the Gradient Boosting model performs well, based on the average ranking, on the AUC and specificity.

Figure 41. Average ranking models race completion analysis

Decision tree Figure 42, a decision tree for the race completion analysis with the over-sampling approach is shown. The 3 features that determine the binary splits of the algorithm are ‘pastHasQ2Prop’, ‘pointsSeason’, and ‘pastHasQ3Prop’.

Figure 42. Decision tree race completion over-sampled

98

In Figure 43, the most important features of the DTC model are shown for the race completion analysis. These features are ‘legacyConstructor, ‘pointsSeason’, and ‘teamMateDiffPointsSeason’.

Figure 43. VIP DTC race completion over-sampled

The most important feature in Figure 43 is the legacyConstructor variable and, hence, this variable is shown on the ICE plot in Figure 44. This plot exhibits the relationship between this feature and the target output showing an increase in race completion probability when the constructor participates in more and more races. These participations allow the constructor to accumulate experience which is referred to as the legacy effect by Bell et al. (2016).

Figure 44. ICE DTC race completion over-sampled

99

Bagging For the Bagging procedure, there is some synergy between the model and the class imbalance approaches in the race completion analysis. The Bagging model performs worse than the decision tree model and this is related to the statement of Breiman (1996) in which he warns ‘that bagging poor predictors can result in even worse ones in a classification setting’ (p.11). This can be perceived when considering the lower AUC of the baseline model for the Bagging procedure (0.578) compared to the DTC (0.613). The most important feature of the bagging model in this analysis is the ‘legacyConstructor’ as shown in Figure 45.

Figure 45. VIP bagging race completion

Random forest

The Random Forest model yields better results than the DTC and the bagging procedure. Yet, these results are mediocre with an AUC of 0.605 for the baseline approach. On the contrary, a sensitivity of 0.703 for this the baseline approach is a bright spot. The Random Forest model combined with over-sampling reaches an AUC of 0.615 which is a minor improvement. In Figure 46, the VIP for the Random Forest model is shown with the same important features indicated by the DTC and Bagging models. We will focus on the 3 most important ones by constructing an ICE plot that scrutinizes their relationship with race completion.

100

Figure 46. VIP random forest race completion over-sampled

In Figure 47, the relation between legacyConstructor and race completion in case of over-sampling is displayed. Thus, the higher the experience of the constructor, the higher the probability of race completion based on this plot. This relationship is flattened after 600 participations for the constructor and, thus, suggests that it takes a long time to reach optimal configuration to learn from previous errors.

Figure 47. ICE plot random forest over-sampled with legacyConstructor

101

The link between the normalized points in the season and the race completion can be found in Figure 48. We see a relationship between the race completion and the normalized points in the season when the latter are low, between 0 and 3. For higher normalized points in the season, the link with the race completion is limited.

Figure 48. ICE plot random forest over-sampled with pointsSeason

In Figure 49, the effect of different values for the past ratio of finishes in the 2 nd qualifier to participations, this is the description of the ‘pastHasQ2Prop’ feature, on the probability of completing the race is shown.

Figure 49. ICE random forest over-sampled with pastHasQ2Prop

102

Adaptive Boosting

This model gives some kind of importance to 5 features to predict race completion as shown in Figure 50. The most important features for this model are ‘pastHasQ2prop’, ‘pointsSeason’, ‘legacyConstructor, ‘pastHasQ3prop’and ‘nrWinsConstructor’.

Figure 50. VIP adaptive boosting race completion

The ‘pastHasQ2Prop’, the most important feature, is imputed via the MICE method and, thus, contains arbitrarily crafted data to impute missing values. Consequently, we will consider the complete data, period 2005-2019, for this feature and the effect of this consideration on the VIP of the Adaptive Boosting model. This effect of using this limited period on the feature importance can be found in Figure 51. Herein, the importance of pastHasQ2Prop diminishes, but it retains moderate importance. We observe the pointsSeason, teamMateDiffPointsSeason, and legacyConstructor as the most important features. The features also had high importance when the entire period, 1950- 2019, is considered with the imputed pastHasQ2Prop feature.

103

Figure 51. VIP adaptive boosting race completion for 2005-2019

Gradient boosting The Gradient boosting applied to the baseline class imbalance approach appoints a high amount of importance to the ‘hasPastQ2Prop’ and ‘pointsSeason’ features as displayed in Figure 52. The difference with the Adaptive Boosting approach is that Gradient boosting will give some importance to the lesser important features.

Figure 52. VIP gradient boosting race completion over-sampled

104

XGBoost

In Figure 53, the importance of the ‘hasPastQ2Prop’, ‘legacyConstructor’ and ‘pointsSeason’ catches the eye similar to the other modeling approaches.

Figure 53. VIP XGBoost race completion

Comparing the results of the Gradient boosting and XGBoost models lead to the conclusion that the difference between the prediction accuracies of both models for the baseline class imbalance treatment strategy is neglectable. In general, the predictive performance of both models is mediocre at best just as the accuracy of the other models. In Figure 54, the XGBoost model is interpreted on the level of specific instances for the most important feature by means of the SHAP plot. This SHAP plot is constructed on an XGBoost model with the tuned hyperparameters (cf. infra Hyperparameters). This plot depicts the teamMateDiffPointsSeason as the most important feature in the analysis. Furthermore, the high feature value of the pastHasQ1Prop catches the eye. We will be cautious to make any statements towards the direction of feature importance. Based on this plot, the normalized points in the season do aid at predicting zeros and the past number of non-completions contributes to predicting ones.

105

Figure 54. SHAP XGBoost race completion

We conclude this elaboration regarding race completion analysis with two possible explanations for the low predictive performance of the tree-based models in this analysis:

1. We did not include enough relevant features and, hence, we were not able to grasp the relevant drivers of race completion. 2. A human error or a technical difficulty can hardly be predicted resulting in poor performance of the tree-based models. However, the prediction accuracy of the model is higher than random, AUC of 0.5, meaning these models could be used as a system for early warnings.

106

Qualifying ability analysis In the 4 tables below, the 6 evaluation metrics for prediction accuracy of the different tree-based models for qualifying ability will be discussed considering the 4 class imbalances approaches. The results of the best performing model based on the different metrics will be shown in blue. Parallel to the 2 previous analyses, the results of the best model for every metric will be put in bold.

Table 26. No class imbalance treatment 'qualifying ability'

DTC BAG RF ADA GBM XGB

ACC 0.961 0.966 0.966 0.966 0.967 0.966

SEN 0.986 0.989 0.996 0.993 0.995 0.992

SPC 0.397 0.436 0.287 0.358 0.355 0.365

AUC 0.692 0.713 0.642 0.676 0.675 0.679

Lift 1.010 1.013 1.004 1.008 1.008 1.007

F1 0.980 0.982 0.983 0.982 0.983 0.982

Table 27. Over-sampling ‘qualifying ability’

DTC BAG RF ADA GBM XGB

ACC 0.939 0.958 0.960 0.878 0.865 0.944

SEN 0.955 0.979 0.976 0.876 0.862 0.955

SPC 0.596 0.485 0.583 0.912 0.932 0.694

AUC 0.775 0.732 0.780 0.894 0.897 0.824

Lift 1.021 1.013 1.018 1.043 1.040 1.027

F1 0.968 0.978 0.979 0.932 0.924 0.970

107

Table 28. Under-sampling ‘qualifying ability’

DTC BAG RF ADA GBM XGB

ACC 0.844 0.854 0.852 0.840 0.853 0.848

SEN 0.843 0.851 0.849 0.837 0.850 0.845

SPC 0.866 0.912 0.915 0.909 0.922 0.912

AUC 0.855 0.882 0.882 0.873 0.886 0.879

Lift 1.040 1.039 1.042 1.042 1.040 1.037

F1 0.912 0.918 0.917 0.910 0.917 0.914

Table 29. ADASYN 'qualifying ability'

DTC BAG RF ADA GBM XGB

ACC 0.933 0.948 0.947 0.948 0.900 0.958

SEN 0.958 0.975 0.971 0.970 0.910 0.986

SPC 0.362 0.332 0.404 0.463 0.678 0.326

AUC 0.660 0.654 0.688 0.716 0.794 0.656

Lift 1.014 1.016 1.017 1.017 1.029 1.007

F1 0.965 0.973 0.972 0.973 0.946 0.978

The highest AUC (0.897) amongst all configurations (cf. supra Figure 26) is obtained with the Gradient Boosting model using over-sampling the minority class as a class imbalance treatment strategy. Moreover, this configuration has high scores on sensitivity (0.82) and, especially, specificity (0.932). The fact that the over-sampling approach is used in the best performing configuration follows the study of Barandela, Valdovinos, Sánchez and Ferri (2004) (cf. supra Class imbalance) in which the authors postulate that in highly imbalanced classes over-sampling the minority class is advised. Since we have highly imbalanced classes in the training set, 4.26% 1s and 95.74% 0s, over-sampling the minority class will be needed to obtain enough 1s for the model to learn from. Consequently, we will specifically focus on the VIPs, ICE plots, and SHAP plots of the over-sampling scenario. However, it should be mentioned that under-sampling approach

108

generates similar yet slightly worse prediction accuracy for the boosting approaches and better accuracy for the Bagging and Random Forest model. In Figure 55, the average ranking over the class imbalance approaches of the models for the qualifying ability analysis demonstrates a distinction between the single learner, the decision tree model, and the ensemble learners, bagging and boosting. Within the ensemble learners, the Gradient Boosting mode seems to be ranked the highest on average closely followed by the Random Forest model. Additionally, the lower average ranking of the scalable XGBoost model stands out.

Figure 55. Average ranking models for qualifying ability analysis

Decision tree In Figure 56, a decision tree is portrayed for the qualifying ability analysis. For this particular decision tree in the case over-sampling, the ‘pointsSeason’ variable determines the first split in the decision tree. The other features included in this decision tree are ‘everStart3’, and ‘legacyConstructor’. As mentioned before, a single decision tree suffers from high variance and, hence, we should look at other models that diminish this variance (James et al., 2013). On the other hand, a decision tree is an interpretable instrument that breaks down a complex problem, like predicting qualifying ability, into simpler decisions based on binary splits.

109

Figure 56. DTC qualifying ability over-sampled

For the cross-validated Decision Tree Classifier, we observe an extremely high sensitivity which could be related to the high class imbalance in the target output of the baseline configuration. The baseline DTC model achieves an AUC of 0.692 and performs better at making predictions than the ADASYN DTC model (AUC of 0.660). For this ADASYN model, the specificity of the model, predicting accuracy towards negatives, is worse than the baseline model. In Figure 57, the most important features for the DTC model are shown. The number of constructor participations is the most important feature, using the DTC model, followed by the feature ‘pastHasQ1Prop’ that grasps the ratio between the amount of finished first qualifiers to the number of participations. We should be careful since this feature was imputed using MICE for the period 1950-1994 and, thus, we mention ‘pastNoQual’ and ‘pointsSeason’ as important features.

Figure 57. VIP DTC qualifying ability over-sampled

110

In Figure 58, the relationship between ‘legacyConstructor’, the most important feature for the DTC, and the qualifying ability is displayed. In this ICE plot below, we notice a strong increase in the probability of qualifying ability, meaning more likely to predict as 1, when the constructor has participated in over 350 races. Based on this ICE plot, we can state that there could be ‘a legacy effect’, inspired by Bell et al. (2016), present for the qualifying ability. The constructor learns for 350 races how to optimize the settings for the qualifying ability of their drivers.

Figure 58. ICE DTC qualifying ability over-sampled

Bagging The focus for the bagging procedure is on the synergy between the model and class imbalance treatment strategies (Galar et al., 2011). We find a great increase in specificity when the under- sampling technique is used and a decrease of the model predictive performance when the ADASYN technique is applied. In Figure 59, the VIP indicates ‘legacyConstructor’ as the most important feature, parallel to the DTC, followed by the ‘past performance’-related features ‘everStart3’, ‘pointsSeason’, and ‘pastNoQual’ and another the constructor-related feature ‘nrWinsConstructor’.

Figure 59. VIP bagging qualifying ability over-sampled

111

Random forest

The Random Forest model has a lower AUC on the baseline than DTC and Bagging. On the other hand, the prediction accuracy of the Random Forest is better accompanied by the class imbalance treatment approaches. In Figure 60, the ‘everStart3’ and ‘pointsSeason’ features are the most important ones for the Random Forest model. Again, the features ‘legacyConstructor’ and ‘pastHasQ1Prop’ play an important role in the qualifying ability.

Figure 60. VIP random forest qualifying ability over-sampled

Adaptive boosting As demonstrated in Figure 61, the Adaptive boosting model identifies, just as the Random Forest model, ‘everStart3’ and ‘pointsSeason’ as the 2 most important features for the qualifying ability analysis. Furthermore, we notice the lesser importance of ‘pastHasQ1Prop’ for this model.

Figure 61. VIP adaptive boosting qualifying ability over-sampled

112

Gradient boosting

We will scrutinize the Gradient Boosting model since it yields the best results for predicting qualifying ability when combined with over-sampling. This model has the highest AUC and specificity in this case and, thus, we will scrutinize this model. In Figure 62, the VIP for this configuration with the 2 most important features: ‘pointsSeason’ and ‘everStart3’. The ‘rainExp’ will be included as a wildcard based on its moderate importance and the curiosity of the relationship between the rain-experience of the driver and its impact on his qualifying ability. Hence, these ICE plots for these features will be included to understand their relationship with the qualifying ability.

Figure 62. VIP gradient boosting qualifying ability over-sampled

In Figures 63, 64, and 65, the relations between the qualifying ability and ‘pointsSeason’, ‘everStart3’, and ‘rainExp’ are shown. In Figure 63, we notice an increase in the qualifying probability for normalized points in the season between 0 and 3. The qualifying probability increases when a driver has ever started the race in the top 3 as can be found in Figure 64. This feature is a binary feature that explains the fact that all observations, the dark grey dots, lay on 2 vertical lines. In Figure 65, a link between the qualifying probability and the rain experience is noticed when this experience is below 20 races. Since it rains, on average, in 15 % of the races, it will take 133 races to gain this experience which is around 6 seasons based on the current situation28 in Formula One.

28 Disregarding COVID-19, the 2020 season was planned to contain 22 races, and, consequently, we use this as a benchmark to calculate the number of seasons needed to gain this experience in the rain.

113

Figure 63. ICE plot random forest qualifying ability with pointsSeason

0 1

Figure 64. ICE plot random forest qualifying ability with everStart3

Figure 65. ICE plot random forest qualifying ability with rainExp

114

XGBoost The 2 most important features for the qualifying ability analysis using the XGBoost model, ‘pointsSeason’ and ‘everStart3’, are similar to most of the other models. Moreover, the Random Forest, Adaptive Boosting, Gradient Boosting, and XGBoost identify the same 2 features as the most important ones in the qualifying ability analysis.

Figure 66. VIP XGBoost qualifying ability over-sampled

Next, the performance of Gradient Boosting versus XGBoost will be scrutinized (cf. supra Table 17). XGBoost performs worse than Gradient Boosting at predicting the qualification of a driver to the race meaning that XGBoost might be overfitting the analysis leading to lower generalizability to new data. However, another influencing factor could be the impact of tuning the hyperparameter on the prediction accuracy. In general, we notice a high prediction accuracy for the qualifying ability analysis. Reflecting critically on this high accuracy at making predictions might raise some questions. Not qualifying for a race is an event that seldom happens and, hence, it might be approached from another approach like anomaly detection. To end the discussion of the results, we display one more SHAP plot in Figure 66 that demonstrates the most important features and their contribution to predicting positives and negatives. The ‘everStart3’ feature, which grasps the fact that a driver has ever had the chance to start the race in the top 3, is the most important feature. This feature has high value towards predicting negative outcomes meaning a driver that has never started a previous race in the top 3, is much more likely to be predicted as a negative. The pointsSeason and pastStart3 seem important

115

features for the qualifying ability analysis. However, these features have low feature value towards predicting zeros and mediocre value towards predicting ones. The past number of non- qualifications and the number of all-time participations for a specific constructor are used for predicting positive target values. Thus, a driver should have an experienced constructor and little to none previous non-qualifications in order to qualify for the race. Non-qualification can be found when looking into whether a driver has never started the race in the top 3.

Figure 67. SHAP XGBoost qualifying ability over-sampled

Lastly, the average impact of the class imbalance treatment approaches on the AUC for the three analyses can be found in Appendix 6. This closes the results section and opens the last parts of the ‘Discussion’ which are the conclusion, ‘Limitations’, and ‘Further research’. The conclusion will focus on the research questions (cf. supra Research questions), the limitations will be regarded from a technical and a sports perspective and the further research will give both potential routes to augment this study and ideas for other studies in the field of Formula One.

116

Tuned hyperparameters values Decision tree By default, the complexity parameter (cp) is set equal to 0.01, but we will allow the model to choose between the values of a predefined grid {0;005;0.01;0.02;0.05}. The basic rule is that a smaller value of cp gives the model more freedom towards complexity (Yan et al., 2016). On the contrary, this extra freedom for the decision trees could result in overfitting the analysis leading to lower generalizability (James et al., 2013). In Table 30, we notice low values for this complexity parameter meaning the decision tree models opt to pick more freedom to grasp higher complexity in the analysis over overfitting issues which would result in a higher cp value.

Table 30. Complexity parameter Decision Tree Classifier

Complexity No treatment Over-sampling Under-sampling ADASYN parameter Top3 0 0 0 0 Race completion 0.005 0 0.005 0 Qualifying ability 0 0 0 0

Random forest For the Random Forest model, we consider the number of predictors used at every split and the minimum node size. In Table 31, the tuned number of predictors for every configuration is shown. The Random Forest model uses by default 퐾 = √푛 (푛 = 33) with n being the total amount of predictors. Thus, 6 predictors would have been used by default (√33 = 5.745 ≈ 6) in this case,. We define a grid for this hyperparameter as {2:10}. The number of predictors is in most cases 8,9, or 10 and, thus, this suggests that this default K might not be ideal in our analysis.

Table 31. Number of predictors Random Forest

Number of No treatment Over-sampling Under-sampling ADASYN predictors Top3 9 10 9 10 Race completion 3 8 3 10 Qualifying ability 8 8 10 6

117

The grid for the minimum number of observations for the leaf node is defined as {50;100;200}. This hyperparameter is implicitly linked to the depth of the tree (Hastie, Tibshirani & Friedman, 2009) and our tuned hyperparameters, as shown in Table 32, suggest that the trees of the Random Forest model embrace the offered complexity. This statement is based on the dominance of 50 as the most chosen options for this hyperparameter.

Table 32. Minimum node size Random Forest

Minimum node No treatment Over-sampling Under-sampling ADASYN size Top3 200 50 100 50 Race completion 100 50 100 50 Qualifying ability 50 50 50 50

Adaptive boosting The maximum tree depth hyperparameter for the Adaptive Boosting model was altered within the range of {1;4}. Table 33 demonstrates that all configuration bar one adopt the maximum flexibility level offered which is a tree depth of 4.

Table 33. Maximum depth tree Adaptive Boosting

Maximum tree No treatment Over-sampling Under-sampling ADASYN depth Top3 4 4 4 4 Race completion 4 4 4 4 Qualifying ability 4 4 3 4

Gradient boosting Parallel to Adaptive Boosting, Gradient Boosting adopts the same flexibility which is offered by the same hyperparameter within the same range {1:4}. Parallel to the tree depth of the Adaptive Boosting approach, we notice the embracement of high flexibility since the tuned hyperparameter equals 4 in almost all cases as can be found in Table 34.

118

Table 34. Maximum depth tree Gradient Boosting

Maximum tree No treatment Over-sampling Under-sampling ADASYN depth Top3 4 4 3 4 Race completion 4 4 4 4 Qualifying ability 4 4 4 4

XGBoost The options for the maximum depth hyperparameters are, similarly to the two other boosting approaches, between 1 and 4. However, the results are different in the case of XGBoost with the over-sampling and ADASYN using the offered complexity. The baseline and the over-sampling approach take a more careful approach with lower levels of maximum tree depth.

Table 35. Maximum depth tree XGBoost

Maximum tree No treatment Over-sampling Under-sampling ADASYN depth Top3 1 4 2 4 Race completion 1 4 2 3 Qualifying ability 4 4 2 4 Another tuned hyperparameter influencing the model’s complexity is the number of rounds the model sequentially uses to build trees upon insights from previous trees. The options for this hyperparameter are {100;150;200;250} with the over-sampling method choosing the highest offered complexity (250) and the under-sampling approach the lowest level of complexity (100) as shown in Table 36.

Table 36. Nrounds tree XGBoost

Nrounds No treatment Over-sampling Under-sampling ADASYN Top3 150 250 100 100 Race completion 250 250 100 150 Qualifying ability 100 250 100 250

119

Conclusion

In this dissertation, three analyses are conducted in a Formula One environment using the Sport Result Prediction-CRISP-DM methodology (Bunker & Thabtah, 2019) as the guiding compass. The first analysis quantifies high-performance by investigating the determinants of prediction accuracy for ‘top 3 finishes’. Human or technical errors that induce the driver to stop their race early are the research topic of the second analysis. Concretely, the predictors that steer race completion in Formula One are brought up herein. The third analysis focusses on predicting whether a driver can qualify for the race based on his and his team’s specific characteristics. Two main research questions were defined for this dissertation and these will be answered to-the-point with the gained insights.

Research question 1 (RQ1): ‘What drives predictive accuracy in top 3 performance, race completion and qualifying ability in Formula One?’

For predicting the top 3 performance, the starting position is a necessary but insufficient determinant. Thus, a good starting position will give the driver a certain edge over the drivers with a worse starting position. A driver that starts the race in third position will have a 75% predicted probability towards finishing in the top 3. Hence, the grid is insufficient to be sure to finish in the top 3 and, therefore, we consider other features to support the prediction. The most important past variables are the number of previous top 3 finishes and the normalized points obtained in previous races in that season. Moreover, the experience of the constructor, legacy effect (Bell et al., 2016) also plays a major role in the distinction between these top drivers. The ratio of the number of finished 2nd round qualifiers to the number of race participations for a driver, which serves as a proxy for the driver’s steering capability, is an important feature to predict race completion. Other relevant features are the normalized points in the season and the legacy of the constructor. The analysis shows take it can take up to 600 participations to reach an optimal setting for finishing the race. However, the accuracy of the predictions for this analysis is mediocre (AUC of 0.60, which is still better than random) meaning the results of this analysis can be used as a system that provides early warnings for race completion. The features with high importance in the qualifying ability analysis are related to the past performance and the experience of the constructor. The legacy of the constructor and the past number of non-qualifications should be regarded to predict whether a driver will qualify for the race. The more experienced the constructor and the lesser the number of no-qualifications, the higher

120

the probability that a driver will be predicted as ‘qualified for the race’. For predicting non- qualification, attention should be given towards the fact that a driver has ever started in the top 3 of a race as a quantification for the driver’s qualifying capabilities. Conclusively, the normalized points and the legacy of the constructor play a major role in all 3 analyses. The importance of both the normalized points in the season and the constructor’s legacy can be explained from a Formula One domain perspective. The normalized points are a proxy for the performance in the season for the driver as an indication of the driver’s skill level. Thus, drivers that have accumulated more points in the season will have a higher probability of finishing in the top 3, completing the race and qualifying for this race. The drivers are connected to a constructor that shares its expertise through its car, tactics, mental and physical training, and much more. Moreover, the constructor importance in all three analyses demonstrates that Formula One is a team effort since these teams support their drivers in numerous ways.

Research question 2: ‘How well do tree-based models perform at making predictions in the field of F1 top 3 performance, race completion and qualifying ability?’

The more flexible tree-based models, like Gradient Boosting and XGBoost, perform better than the interpretable tree-based model, Decision Tree Classifier, for the 3 conducted analyses. Moreover, the values of the tuned hyperparameters of the models indicate a certain level of complexity in the analyses. Additionally, we notice a diminishment of the interpretability/flexibility trade-off (James et al., 2013) between these models due to the recently developed techniques, like SHAP plots (Lundberg & Lee, 2017), which enable to conveniently interpret a flexible model such as XGBoost. Additionally, the tree-based models perform well in a Formula One setting and especially when combined with class imbalance treatment approaches. Thus, there is a synergy between the tree-based models and these class imbalance treatment strategies (Galar et al.,2011). The over- sampling approach generates, on average, a considerable increase in prediction accuracy (increase in AUC of 0.06) whereas the under-sampling approach produces a larger increase in prediction accuracy (increase in AUC of 0.12). Furthermore, ADASYN, an algorithm of the SMOTE family, also has a positive influence on the model’s accuracy, but this influence is less distinct than the sampling approaches (increase in AUC of 0.02 on average).

121

Limitations Technical perspective Data quality The number of missing values in the datasets in extremely limited since the data was already online available via the Ergast API. However, the ‘Qualifying’ and ‘ Results’ datasets suffered from missing values, because of limited availability in time, and these were imputed with MICE. Furthermore, the ‘Weather’ dataset is manually constructed which potentially introduces small data inadequacies. Feature selection Reunanen (2003) states that only a subset of variables should be included in the model for the construction of the best predictor. However, the amount of features in our analysis is limited and, thus, feature selection would merely serve the purpose of a robustness check. An interesting approach to feature selection is BART (Backward Regression Trimming), explained by Baesens (2019), in which the most important features are selected. Sports perspective Domain expert’s insight A Formula One expert can be consulted to create additional features to grasp aspects that are not captured by the current features. The knowledge regarding F1 was accumulated via (1) watching races on lazy Sunday afternoons, (2) watching YouTube videos of Nico Rosberg, (3) the F1 documentary ‘Formula One: Drive to survive’, and (4) a literature study for F1 predictive studies. Top 3 performance analysis Our choice to mode top 3 performance might seem arbitrary, but drivers that finish in the top 3 get recognition when they are invited to celebrate their’ top 3 position’. However, we understand that this analysis might reap some criticism since ‘second place is the first of losers’ as titled by Judde, Booth and Brooks (2013). Moreover, this analysis can be redefined to a multiclass classification by constructing multiple categories for different levels of performance. Race completion analysis The second analysis in this dissertation is a race completion analysis in which the determinants of finishing a race are studied. This analysis does not distinguish between human and technical errors since we investigate the overall race completion, but a distinction could be made in the future. Qualifying ability analysis This analysis seems the most trivial one since qualifying for a race is taken for granted. But in the unlikely scenario of not starting a race with 2 drivers, teams might wonder whether different factors can explain this unlikely outcome. This is why we include this qualifying analysis in this dissertation.

122

Further research Augmenting this study Distinguish between technical and human dropouts As mentioned in the ‘Limitations section’, the race completion analysis purposely includes both human and technical errors. However, a future path might be the distinction between human and technical errors leading to conducting separate analyses for both types of dropouts. MetaCost and ROSE Another approach is building a cost-matrix that facilitates learning from imbalanced datasets. Concretely, this approach, called MetaCost (Domingos, 1999), wraps a cost-minimization problem around the classification. This technique is called cost-sensitive learning. Experiments in this study show that MetaCost can systematically reduce cost compared to error-based classification and stratification. Furthermore, the MetaCost approach can be efficiently applied to a large dataset. Another approach for binary imbalanced learning is ROSE (Lunardon, Menardi & Torelli, 2014) which creates synthetic data to deal with class imbalance. Other Formula One study ideas Entertainment study Formula One has a very loyal fan base but also a demanding one. Most fans want to be entertained during their Formula One experience. Entertainment can be related to the circuit, the weather, and many other factors. Quantifying entertainment is a problem that deserves attention with the omnipresence of this sport in mind. People want to be entertained but ‘what drives entertainment?’. The dependent variable could be the rating of a Formula One race on IMDB and features could be the number of turns per race, race during day/night, at least one home driver present… Related studies to this research topic are the studies of Remenyik and Molnár (2017) and Henderson, Foo, Lim, and Yip (2010). In-race strategies A potential route for in-race strategies is stipulated by Stoppels (2017) to predict lap times using ANN or researching the impact of pit stop strategy on the outcome of the race. The author indicates the added value of comparing Formula One races with NASCAR races. Concretely, this master’s thesis refers to the paper of Allender (2008) and the paper of Pfitzner and Rishel (2005) as studies in which inspiration can be found.

123

Reference list

Abut, F., Akay, M. F., Daneshvar, S., & Heil, D. (2017, September). Artificial neural networks for predicting the racing time of cross-country skiers from survey-based data. In 2017 9th International Conference on Computational Intelligence and Communication Networks (CICN) (pp. 117-121). IEEE. Al-Masree, H. K. (2015). Extracting Entity Relationship Diagram (ERD) from relational database schema. International Journal of Database Theory and Application, 8(3), 15-26. Allen, J., & Schumacher, M. (2000). Michael Schumacher: driven to extremes. Bantam. Allender, M. (2008). Predicting the outcome of NASCAR races: The role of driver experience. Journal of Business & Economics Research (JBER), 6(3). Alnoukari, M., Alzoabi, Z., & Hanna, S. (2008, August). Applying adaptive software development (ASD) agile modeling on predictive data mining applications: ASD-DM methodology. In 2008 International Symposium on Information Technology (Vol. 2, pp. 1- 6). IEEE. Acuna, E., & Rodriguez, C. (2004). The treatment of missing values and its effect on classifier accuracy. In Classification, clustering, and data mining applications (pp. 639-647). Springer, Berlin, Heidelberg. Aversa, P., Furnari, S., & Haefliger, S. (2015). Business model configurations and performance: A qualitative comparative analysis in Formula One racing, 2005–2013. Industrial and Corporate Change, 24(3), 655-676. Azevedo, A. I. R. L., & Santos, M. F. (2008). KDD, SEMMA and CRISP-DM: a parallel overview. IADS-DM. Backiel, A., Baesens, B., & Claeskens, G. (2014, June). Mining telecommunication networks to enhance customer lifetime predictions. In International Conference on Artificial Intelligence and Soft Computing (pp. 15-26). Springer, Cham. Baesens, B. (2019). BART: BAckward Regression Trimming. Big data, 7(3), 207-213. Barandela, R., Valdovinos, R. M., Sánchez, J. S., & Ferri, F. J. (2004, August). The imbalanced training sample problem: Under or over sampling?. In Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR) (pp. 806-814). Springer, Berlin, Heidelberg. Bastiaans, M. (1985). On the sliding-window representation in digital signal processing. IEEE transactions on acoustics, speech, and signal processing, 33(4), 868-873.

I

Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), 20-29. Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning, 36(1-2), 105-139. Bekker, J., & Lotz, W. (2009). Planning Formula One race strategies using discrete-event simulation. Journal of the Operational Research Society, 60(7), 952-961. Bell, A., Smith, J., Sabel, C. E., & Jones, K. (2016). Formula for success: multilevel modelling of Formula One driver and constructor performance, 1950–2014. Journal of Quantitative Analysis in Sports, 12(2), 99-112. Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. In Advances in neural information processing systems (pp. 153-160). Bentley, F., & Murray, J. (2016, June). Understanding Video Rewatching Experiences. In Proceedings of the ACM International Conference on Interactive Experiences for TV and Online Video (pp. 69-75). Bergstra, J. S., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. In Advances in neural information processing systems (pp. 2546-2554). Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of machine learning research, 13(Feb), 281-305. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140. Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. Bernard, S., Heutte, L., & Adam, S. (2009, June). Influence of hyperparameters on random forest accuracy. In International Workshop on Multiple Classifier Systems (pp. 171-180). Springer, Berlin, Heidelberg. Bühlmann, P., & Yu, B. (2002). Analyzing bagging. The Annals of Statistics, 30(4), 927-961. Bunker, R. P., & Thabtah, F. (2019). A machine learning framework for sport result prediction. Applied computing and informatics, 15(1), 27-33. Burez, J., & Van den Poel, D. (2009). Handling class imbalance in customer churn prediction. Expert Systems with Applications, 36(3), 4626-4636. Buuren, S. V., & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation by chained equations in R. Journal of statistical software, 1-68. Büyükyazıcı, M., & Sucu, M. (2003). The analytic hierarchy and analytic network processes. Hacettepe Journal of Mathematics and Statistics, 32, 65-73. Chawla, N. V. (2009). Data mining for imbalanced datasets: An overview. In Data mining and

II

knowledge discovery handbook (pp. 875-886). Springer, Boston, MA. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357. Celeux, G., & Diebolt, J. (1992). A stochastic approximation type EM algorithm for the mixture problem. Stochastics: An International Journal of Probability and Stochastic Processes, 41(1-2), 119-134. Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794). Chen, T., He, T., Benesty, M., Khotilovich, V., & Tang, Y. (2015). Xgboost: extreme gradient boosting. R package version 0.4-2, 1-4. Chen, H., Rinde, P. B., She, L., Sutjahjo, S., Sommer, C., & Neely, D. (1994). Expert prediction, symbolic learning, and neural networks. An experiment on greyhound racing. IEEE Expert, 9(6), 21-27. Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC genomics, 21(1), 6. Choo, C. L. W. (2015). Real-time decision making in motorsports: analytics for improving professional car race strategy (Doctoral dissertation, Massachusetts Institute of Technology). Chu, F., & Zaniolo, C. (2004, May). Fast and light boosting for adaptive mining of data streams. In Pacific-Asia conference on knowledge discovery and data mining (pp. 282-292). Springer, Berlin, Heidelberg. Cortes, C., & Mohri, M. (2004). AUC optimization vs. error rate minimization. In Advances in neural information processing systems (pp. 313-320). Davies, A. (2014, November 21). This Is How You Ship an F1 Car Across the Globe in 36 Hours. Retrieved from https://www.wired.com/2014/11/ship-f1-car-across-globe-36-hours/ Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application (Vol. 1). Cambridge university press. Davoodi, E., & Khanteymoori, A. R. (2010). Horse racing prediction using artificial neural networks. Recent Advances in Neural Networks, Fuzzy Systems & Evolutionary Computing, 2010, 155-160. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1-22.

III

Derrig, R., & Francis, L. (2006, March). Distinguishing the forest from the TREES: a comparison of tree based data mining methods. In Casualty Actuarial Society Forum (pp. 1-49). Dietterich, T. G. (2000, June). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1-15). Springer, Berlin, Heidelberg. do Nascimento, G. S., & de Oliveira, A. A. (2012, November). An agile knowledge discovery in databases software process. In International Conference on Data and Knowledge Engineering (pp. 56-64). Springer, Berlin, Heidelberg. Domingos, P. (1999, August). Metacost: A general method for making classifiers cost- sensitive. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 155-164). Donahay, B., & Rosenberger III, P. J. (2007). Using brand personality to measure the effectiveness of image transfer in formula one racing. Marketing Bulletin, 18. Donders, A. R. T., Van Der Heijden, G. J., Stijnen, T., & Moons, K. G. (2006). A gentle introduction to imputation of missing values. Journal of clinical epidemiology, 59(10), 1087- 1091. Drummond, C., & Holte, R. C. (2003, August). C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II (Vol. 11, pp. 1-8). Washington DC: Citeseer. Edelmann-Nusser, J., Hohmann, A., & Henneberg, B. (2002). Modeling and prediction of competitive performance in swimming upon neural networks. European Journal of Sport Science, 2(2), 1-10. Eichenberger, R., & Stadelmann, D. (2009). Who Is The Best Formula 1 Driver? An Economic Approach to Evaluating Talent. Economic Analysis & Policy, 39(3). El Sheikh, A. A. R., & Alnoukari, M. (2012). Business intelligence and agile methodologies for knowledge-based organizations: Cross-disciplinary applications. Business Science Reference. Eryarsoy, E., & Delen, D. (2019, January). Predicting the Outcome of a Football Game: A Comparative Analysis of Single and Ensemble Analytics Methods. In Proceedings of the 52nd Hawaii International Conference on System Sciences. Fayyad, U. M., Haussler, D., & Stolorz, P. E. (1996, August). KDD for Science Data Analysis: Issues and Examples. In KDD (pp. 50-56). Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37-37. Fernandez, G. (2003). SAS Applications.

IV

F1. (2019, December 17). 10 records that have been broken in 2019: Formula 1®. Retrieved from https://www.formula1.com/en/latest/article.10-records-that-have-been-broken-in 2019.5TfXLot24fpHzRWikUTiy6.html F1. (2019, January 18). Formula 1's TV and digital audiences grow for the second year running: Formula 1®. Retrieved from https://www.formula1.com/en/latest/article.formula-1s-tv-and- digital-audiences-grow-for-the-second-year-running.OqTPVNthtZKFbKqBaimKf.html F1. (2019, November 07). How F1 technology has supercharged the world: Formula 1®. Retrieved from https://www.formula1.com/en/latest/article.how-f1-technology-has-supercharged-the- world.6Gtk3hBxGyUGbNH0q8vDQK.html Freund, Y., & Schapire, R. E. (1995, March). A decision-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory (pp. 23-37). Springer, Berlin, Heidelberg. Freund, Y., & Schapire, R. E. (1996, July). Experiments with a new boosting algorithm. In icml (Vol. 96, pp. 148-156). Frick, B., & Humphreys, B. R. (2011). Prize structure and performance: Evidence from NASCAR (No. 2011-12). Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232. Friedman, J. H. (2002). Stochastic gradient boosting. Computational statistics & data analysis, 38(4), 367-378. Fushiki, T. (2011). Estimation of prediction error by using K-fold cross-validation. Statistics and Computing, 21(2), 137-146. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2011). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484. Genuer, R., Poggi, J. M., & Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern recognition letters, 31(14), 2225-2236. Goldstein, A., Kapelner, A., Bleich, J., & Pitkin, E. (2015). Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. Journal of Computational and Graphical Statistics, 24(1), 44-65. Graves, T., Reese, C. S., & Fitzgerald, M. (2003). Hierarchical models for permutations: Analysis of auto racing results. Journal of the American Statistical Association, 98(462), 282-291.

V

Greenwell, B. M. (2017). pdp: An R package for constructing partial dependence plots. The R Journal, 9(1), 421-436. Ha, T. M., & Bunke, H. (1997). Off-line, handwritten numeral recognition by perturbation method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5), 535-539. Harville, D. A. (1973). Assigning probabilities to the outcomes of multi-entry competitions. Journal of the American Statistical Association, 68(342), 312-316. Hastie, T., Tibshirani, R., & Friedman, J. (2009). Random forests. In The elements of statistical learning (pp. 587-604). Springer, New York, NY. He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (pp. 1322-1328). IEEE. Henderson, J. C., Foo, K., Lim, H., & Yip, S. (2010). Sports events and tourism: The Singapore formula one grand prix. International Journal of Event and Festival Management. Henery, R. J. (1981). Permutation probabilities as models for horse races. Journal of the Royal Statistical Society: Series B (Methodological), 43(1), 86-91. Hinkley, D. V. (1977). Jackknifing in unbalanced situations. Technometrics, 19(3), 285-292. Hucaljuk, J., & Rakipović, A. (2011, May). Predicting football scores using machine learning techniques. In 2011 Proceedings of the 34th International Convention MIPRO (pp. 1623- 1627). IEEE. Hughes, M. (2004). The unofficial Formula One encyclopedia (Rev. ed.). London: Anness. Inouye, S. K., van Dyck, C. H., Alessi, C. A., Balkin, S., Siegal, A. P., & Horwitz, R. I. (1990). Clarifying confusion: the confusion assessment method: a new method for detection of delirium. Annals of internal medicine, 113(12), 941-948. Iyengar, R. (2017, November 24). The Logistics behind F1. Retrieved from https://medium.com/speedbox-is-typing/the-logistics-behind-f1-7537e445de20 James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, pp. 3-7). New York: springer. Jenkins, M. (2010). Technological discontinuities and competitive advantage: A historical perspective on Formula 1 motor racing 1950–2006. Journal of Management Studies, 47(5), 884-910. Jenkins, M., & Floyd, S. (2001). Trajectories in the evolution of technology: A multi-level study of competition in Formula 1 racing. Organization studies, 22(6), 945-969. Jones, B. (1996). The ultimate encyclopedia of Formula One: the definitive illustrated guide to Grand Prix motor racing. MotorBooks International.

VI

Judde, C., Booth, R., & Brooks, R. (2013). Second place is first of the losers: An analysis of competitive balance in Formula One. Journal of Sports Economics, 14(4), 411-439. Karetnikov, A., Nuijten, W., & Hassani, M. (2019). Predicting Race Performance in Professional Cycling. Kearns, M. J. Thoughts on hypothesis boosting, 1988. ML class project, 319, 320. Kira, K., & Rendell, L. A. (1992). A practical approach to feature selection. In Machine Learning Proceedings 1992 (pp. 249-256). Morgan Kaufmann. Li, D., Deogun, J., Spaulding, W., & Shuart, B. (2004, June). Towards missing data imputation: a study of fuzzy k-means clustering method. In International conference on rough sets and current trends in computing (pp. 573-579). Springer, Berlin, Heidelberg. Li, T., Jing, B., Ying, N., & Yu, X. (2017). Adaptive Scaling. arXiv preprint arXiv:1709.00566. Lin, Y., & Jeon, Y. (2006). Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474), 578-590. Little, T. D., Lang, K. M., Wu, W., & Rhemtulla, M. (2016). Missing data. Developmental psychopathology, 1-37. Littlestone, N., & Warmuth, M. K. (1989). The weighted majority algorithm (pp. 256-261). University of California, Santa Cruz, Computer Research Laboratory. Liu, C., White, M., & Newell, G. (2009, July). Measuring the accuracy of species distribution models: a review. In Proceedings 18th World IMACs/MODSIM Congress. Cairns, Australia (pp. 4241-4247). Lo, V. S. (2002). The true lift model: a novel data mining approach to response modeling in database marketing. ACM SIGKDD Explorations Newsletter, 4(2), 78-86. Lo, V. S., & Bacon‐Shone, J. (1994). A Comparison between Two Models for Predicting Ordering Probabilities in Multiple‐Entry Competitions. Journal of the Royal Statistical Society: Series D (The Statistician), 43(2), 317-327. Lobo, J. M., Jiménez‐Valverde, A., & Real, R. (2008). AUC: a misleading measure of the performance of predictive distribution models. Global ecology and Biogeography, 17(2), 145-151. Lunardon, N., Menardi, G., & Torelli, N. (2014). ROSE: A Package for Binary Imbalanced Learning. R journal, 6(1). Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., ... & Lee, S. I. (2019). Explainable ai for trees: From local explanations to global understanding. arXiv preprint arXiv:1905.04610. Lundberg, S. M., Erion, G. G., & Lee, S. I. (2018). Consistent individualized feature attribution for

VII

tree ensembles. arXiv preprint arXiv:1802.03888. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. In Advances in neural information processing systems (pp. 4765-4774). Luce, R. D. (1959). Individual choice behavior, John Wiley and Sons. Marino, A., Aversa, P., Mesquita, L., & Anand, J. (2015). Driving performance via exploration in changing environments: Evidence from formula one racing. Organization Science, 26(4), 1079-1100. Maszczyk, A., Roczniok, R., Czuba, M., Zajαc, A., Waśkiewicz, Z., Mikołajec, K., & Stanula, A. (2012). Application of regression and neural models to predict competitive swimming performance. Perceptual and Motor Skills, 114(2), 610-626. McGivney, B. A., Hernandez, B., Katz, L. M., MacHugh, D. E., McGovern, S. P., Parnell, A. C., ... & Hill, E. W. (2019). A genomic prediction model for racecourse starts in the Thoroughbred horse. Animal genetics, 50(4), 347-357. Metz, C. E. (1978, October). Basic principles of ROC analysis. In Seminars in nuclear medicine (Vol. 8, No. 4, pp. 283-298). WB Saunders. Micci-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1), 27-32. Molnar, C. (2019). Interpretable machine learning. Lulu. com. Nguyen, H., Bui, X. N., Bui, H. B., & Cuong, D. T. (2019). Developing an XGBoost model to predict blast-induced peak particle velocity in an open-pit mine: a case study. Acta Geophysica, 67(2), 477-490. Noble, J., & Hughes, M. (2004). Formula One Racing for Dummies. John Wiley & Sons. Oza, N. C. (2005, October). Online bagging and boosting. In 2005 IEEE international conference on systems, man and cybernetics (Vol. 3, pp. 2340-2345). Ieee. Parikh, R., Mathai, A., Parikh, S., Sekhar, G. C., & Thomas, R. (2008). Understanding and using sensitivity, specificity and predictive values. Indian journal of ophthalmology, 56(1), 45. Pfitzner, B., & Rishel, T. D. (2005). Do reliable predictors exist for the outcomes of NASCAR races. The Sport Journal, 8(2). Piatetsky-Shapiro, G., & Steingold, S. (2000). Measuring lift quality in database marketing. ACM SIGKDD Explorations Newsletter, 2(2), 76-80. Powers, D. M. (2011). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Probst, P., Wright, M. N., & Boulesteix, A. L. (2019). Hyperparameters and tuning strategies

VIII

for random forest. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(3), e1301. Przednowek, K., Iskra, J., & Przednowek, K. H. (2014, October). Predictive Modeling in 400- Metres Hurdles Races. In icSPORTS (pp. 137-144). Pudaruth, S., Medard, N., & Dookhun, Z. B. (2013). Horse Racing Prediction at the Champ De Mars using a Weighted Probabilistic Approach. International Journal of Computer Applications, 72(5). Quek, Y. T., Woo, W. L., & Logenthiran, T. (2016, November). DC equipment identification using K-means clustering and kNN classification techniques. In 2016 IEEE Region 10 Conference (TENCON) (pp. 777-780). IEEE. Quenouille, M. H. (1949, July). Approximate tests of correlation in time-series 3. In Mathematical Proceedings of the Cambridge Philosophical Society (Vol. 45, No. 3, pp. 483- 484). Cambridge University Press. Quenouille, M. H. (1956). Notes on bias in estimation. Biometrika, 43(3/4), 353-360. Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-106. Remenyik, B., & Molnár, C. (2017). The role of the Formula 1 Grand Prix in Hungary's tourism. Prosperitas, 4(3), 92-112. Reunanen, J. (2003). Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research, 3(Mar), 1371-1382. Rodriguez, J. D., Perez, A., & Lozano, J. A. (2009). Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE transactions on pattern analysis and machine intelligence, 32(3), 569-575. Rosner, B., Tworoger, S., & Qiu, W. (2015). Correcting AUC for measurement error. Journal of biometrics & biostatistics, 6(5). Rosso, G., & Rosso, A. F. (2016). Statistical Analysis of F1 Monaco Grand Prix 2016. Relations Between Weather, Tyre Type and Race Stints. Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American statistical Association, 91(434), 473-489. Safavian, S. R., & Landgrebe, D. (1991). A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics, 21(3), 660-674. Schapire, R. E. (1990). The strength of weak learnability. Machine learning, 5(2), 197-227. Scheffer, J. (2002). Dealing with missing data. Shah, A. D., Bartlett, J. W., Carpenter, J., Nicholas, O., & Hemingway, H. (2014). Comparison

IX

of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. American journal of epidemiology, 179(6), 764-774. Shapley, L. S. (1953). A value for n-person games. Contributions to the Theory of Games, 2(28), 307-317. Silva, K. M., & Silva, F. J. (2010). A tale of two motorsports: A graphical-statistical analysis of how practice, qualifying, and past success relate to finish position in NASCAR and Formula One racing. Retrieved from http://newton.uor.edu/FacultyFolder/Silva/NASCARvF1.pdf Spurgeon, B. (2009). Age Old Question of Whether it is the Car or the Driver that Counts. Formula 1, 101. Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118. Stern, H. (1990). Models for distributions on permutations. Journal of the American Statistical Association, 85(410), 558-564. Stoppels, E. (2017). Predicting race results using artificial neural networks (Master's thesis, University of Twente). Story, M., & Congalton, R. G. (1986). Accuracy assessment: a user’s perspective. Photogrammetric Engineering and remote sensing, 52(3), 397-399. Sulsters, C., & Bekker, R. (2018). Simulating Formula One Race Strategies. Stuart, G. (2016, January 27). How Daniel Ricciardo gets fit for F1. Retrieved from https://www.redbull.com/sg-en/daniel-ricciardo-f1-workout-training Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC bioinformatics, 9(1), 307. Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1), 25. Tax, N., & Joustra, Y. (2015). Predicting the Dutch football competition using public data: A machine learning approach. Transactions on knowledge and data engineering, 10(10), 1- 13. Townsend, J. T. (1971). Theoretical analysis of an alphabetic confusion matrix. Perception & Psychophysics, 9(1), 40-50. Tulabandhula, T., & Rudin, C. (2014). Tire changes, fresh air, and yellow flags: challenges in predictive analytics for professional racing. Big data, 2(2), 97-112. Ulmer, B., Fernandez, M., & Peterson, M. (2013). Predicting soccer match results in the

X

English Premier League (Doctoral dissertation, Ph. D. thesis, Doctoral dissertation, Ph. D. dissertation, Stanford). van Buuren, S., Boshuizen, H. C., & Knook, D. L. (1999). Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in medicine, 18(6), 681-694. Van den Poel, D., De Schamphelaere, J., & Wets, G. (2004). Direct and indirect effects of retail promotions on sales and profits in the do-it-yourself market. Expert Systems with Applications, 27(1), 53-62. Van der Heijden, G. J., Donders, A. R. T., Stijnen, T., & Moons, K. G. (2006). Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. Journal of clinical epidemiology, 59(10), 1102-1109. Van Hulse, J., Khoshgoftaar, T. M., & Napolitano, A. (2007, June). Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th international conference on Machine learning (pp. 935-942). Van Rijsbergen, C. J. (1979). Information retrieval. Von Hippel, P. T. (2004). Biases in SPSS 12.0 missing value analysis. The American Statistician, 58(2), 160-164. Wehenkel, L., Pavella, M., Euxibie, E., & Heilbronn, B. (1994). Decision tree based transient stability method a case study. IEEE Transactions on Power Systems, 9(1), 459-469. Wiktorowicz, K., Przednowek, K., Lassota, L., & Krzeszowski, T. (2015). Predictive modeling in race walking. Computational intelligence and neuroscience, 2015. Wirth, R., & Hipp, J. (2000, April). CRISP-DM: Towards a standard process model for data mining. In Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining (pp. 29-39). London, UK: Springer-Verlag. Wong, T. T. (2015). Performance evaluation of classification algorithms by k-fold and leave- one-out cross validation. Pattern Recognition, 48(9), 2839-2846. Xia, Y., Liu, C., Li, Y., & Liu, N. (2017). A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring. Expert Systems with Applications, 78, 225- 241. Yan, R., Ma, Z., Zhao, Y., & Kokogiannakis, G. (2016). A decision tree based data-driven diagnostic strategy for air handling units. Energy and Buildings, 133, 37-45. Yeh, D., Li, Y., & Chu, W. (2008). Extracting entity-relationship diagram from a table-based legacy database. Journal of Systems and Software, 81(5), 764-771.

XI

Zhu, X. J. (2005). Semi-supervised learning literature survey. University of Wisconsin-Madison Department of Computer Sciences.

XII

Appendix Appendix 1: Literature Tables

Table 37. Literature table of the related sports studies

RELATED SPORTS LITERATURE Swimming Author(s) Content Model Edelmann-Nusser Predicting the performance of 200m Neural Networks et al. (2002) backstroke swimmer at the Olympic Leave-one-out method Games of Sydney. Linear regression Horse racing Author(s) Content Model Lo and Bacon‐ Predicting ordering probabilities in a Model of Hartville (1973) Shone (1994) multiple-entry competition. Model of Henery (1981) ANN (Artificial Neural Davoodi and Networks) Predicting the outcome of horse races Khanteymoori Time Series Analysis using 3 different types of algorithms. (2010) Regression Engelbrecht (2007) Pudaruth, Medard Predictions of horse races using 9 Weighted probabilistic and Dookhun factors. This analysis leads to a success model (2013) rate of 58,33% for predicting the winner. Assessing the relevance of the heredity for the durability traits for Thoroughbred McGivney et al. RFME (Random Forest horses. Horses with higher generic (2019) with Mixed Effects) potential have less non-participated races and better race outcomes. Cycling Author(s) Content Model Karetnikov, This study includes the impact of a Linear Regression Nuijten and training schema on the performance of a Lasso Regression Hassani (2019) cyclist. The XGBoost model is the best Long short-term memory

XIII

for predicting mountain stages and (LSTM) CatBoost for flat races. Decision Tree Random Forest Regression CatBoost XGBoost Soccer Author(s) Content Model Naive Bayes Predictive model for the outcome of a Eryarsoy and Decision Trees soccer game (a win, a loss or a draw) Delen (2019) Gradient Boosting Trees using the CRISP-DM methodology. Random Forest One-vs-all SGD Ulmer, Fernandez Prediction of soccer matches in the Linear Support Vector and Peterson English Premier League using machine Machine (2013) learning techniques. Random Forest SVM with RFB kernel Naïve Bayes, LogitBoost, Tax and Joustra Match-based predictions for soccer in the Neural Network, Random (2015) Dutch Eredivisie. Forest, Genetic Programming Skiing Author(s) Content Model The authors examine the predictive Abut, Akay, accuracy of artificial neural networks Artificial Neural Networks Daneshvar and (ANN) for the racing time of cross- (ANN) Heil (2017) country skiers. Hurdle races Author(s) Content Model Ordinary Least Przednowek, Squares(OLS) regression Iskra and A predictive modeling study in the field of Ridge regression Przednowek 400-meter hurdle races LASSO regression (2014) Neural networks

XIV

Table 38. Literature table of the NASCAR studies

NASCAR LITERATURE Author(s) Content Model Graves, Using finishing probabilities in NASCAR Probability model Reese and races. Assessing the track-independent Model of Luce (1959) Fitzgerald driver abilities and predicting future Model of Stern (1990) (2003) stars. Pfitzner and Assessing the reliability of predictors for Correlation analysis Rishel (2005) NASCAR races outcomes. Influence of driver experience on the Allender Linear regression with focus on predictions of NASCAR races. There is (2008) the interaction effects a focus on the interaction effects. 2 linear baseline models Assessing the possibility to determine SVR (support vector regression) Tulabandhula, further race outcome based upon past Ridge regression and Rudin performance. Optimizing re-fueling LASSO (least absolute shrinkage (2014) strategies and tire-change decisions. and selection operator) regression Random Forest Further research on paper of Machine learning approach confer Tulabandhula, T., & Rudin, C. (2014). previous paper of Tulabandhula Choo (2015) The authors makes an in-race and Rudin (2014) performance prediction software. Naïve predictions. Humphreys Analyzing the tournament theory in and Frick NASCAR. The impact of price money Linear reduced form model (2019) on performance and driver speed. Comparative analysis: NASCAR vs F1. Looking into the differences and Silva and similarities of these motorized sports. Linear regression Silva (2010) Specific interest in the correlation coefficients.

XV

Table 39. Literature table of the F1 studies

F1 LITERATURE Author(s) Content Model The authors research the best F1 driver of all Eischenberger time. They conclude Juan Manuel Fangio is and Stadelmann Linear regression the best driver of all time closely followed by (2009) Jim Clark and Michael Schumacher. The authors assess: Bell, Smith, • the best F1 driver of all time taking Sabel and Linear regression into account their designated team. Jones (2016) • importance of both teams and drivers ANN (Artificial Neural Dissertation about predicting F1 results for Network) the 2016 F1 season. Predicting the outcome Stoppels (2017) Multiclass logistic of the last 4 races based upon result in other regression Linear races of 2016. regression Aversa, Furnari Defines 4 different business models in F1 QCA (Qualitative and Haefliger and the relationship with performance. Comparative Analysis) (2015) Büyükyazıcı and AHP (Analytic Determining the city to hold an F1 race Sucu, (2003) Hierarchy Process) depending upon hospitals, hotels and renown ANP (Analytic Network of city. Process) Marino, Aversa, Changes in the F1 environment and the GMM (Generalized Mesquita and impact on performance of the drivers Method of Moments) Anand (2015) Analysis of competitive balance in Formula Judde, Booth One racing using data from 1950-2010. and Brooks Linear regression Changes regarding regulation have an impact (2013) on the uncertainty of the championship.

XVI

The dominance of a team due to technology Jenkins and is investigated. This study states that Theoretical sampling29 Floyd (2001) transparency is closely related with competitive advantage. The relationship between technological Descriptive analysis Jenkins (2010) discontinuities and competitive advantage is conducted via in-depth discussed. interviews Impact of changing weather conditions during Rosso and Quantile regression an F1 race with the Monaco 2016 GP as Rosso (2016) POT analysis example.

Table 40. Literature table of the potential methodologies METHODOLOGY LITERATURE Author(s) Content Methodology KDD (Knowledge Discovery in Databases) as Fayyad(1996) KDD framework for pattern discovery in a dataset. Main issues in exploring datasets and the use of Fayyad et al. (1996) KDD KDD in 5 examples. SEMMA, developed by SAS, as a methodology Fernandez (2003) SEMMA which enables a smooth data mining process. CRISP-DM (CRoss-Industry Standard Process for Wirth and Hipp Data Mining) to fill the gap of a lack of industry CRISP-DM (2000) standard for data mining. Azevedo and Comparing these 3 methodologies and assess KDD, SEMMA, Santos (2008) their value for a data mining process. CRISP-DM Bunker and Proposition of an extension of CRISP-DM in a SRP-CRISP- Thabtah (2019 sport environment DM Alnoukari, Alzoabi Specifically designed methodology to be used in ASD-DM and Hanna (2008) an agile environment

29 The theoretical sampling approach finds its origin in the grounded theory approach by Glaser and Strauss (1967). This approach aims at establishing a theory via collecting and analyzing qualitative data.

XVII

Appendix 2: Extended Data Dictionary

The open-source data for this study can be found at http://ergast.com/downloads/f1db_csv.zip. These data sets with their variables are thoroughly explained below with as extra data sets ‘Weather’ (Appendix 3) and ‘Country’. The ‘Weather’ data set facilitated the creation of the ‘rain’ and ‘rainExp’ features whereas the ‘Country’ data set enabled us to create the ‘homeAdvantage’ feature.

Table 41. Circuits

Variable Explanation circuitId ID to identify circuit circuitRef Short name of circuit name Full name of circuit location City of country Country of circuit lat Latitude lng Longitude alt No added value, just a column with ‘/N’ url Link to Wikipedia circuit

Table 42. Constructor results

Variable Explanation constructorResultsId ID to identify result of constructor raceId ID to identify race constructorId ID to identify constructor points Points for constructor in particular race status No added value, just a column with ‘/N’

XVIII

Table 43. Constructor standings

Variable Explanation constructorStandingsId ID to identify result of constructor raceId ID to identify race constructorId ID to identify constructor points Accumulated constructor points position Standing of constructor in Championship positionText Position converted to character wins Wins of constructor in season

Table 44. Constructors

Variable Explanation constructorId ID to identify constructor constructorRef Short name of circuit name Full name of circuit nationality Nationality of constructor url Wikipedia link to constructor

Table 45. Driver standings

Variable Explanation driverStandingsId ID to identify constructor raceId ID to identify race driverId ID to identify driver points Accumulated points of driver position Standing of constructor in Championship positionText Position converted to character wins Wins of driver in season

XIX

Table 46. Drivers

Variable Explanation driverId ID to identify driver driverRef Reference to the driver number Driver’s racing number code 3 letter code for driver forename Driver’s forename surname Driver’s surname dob Date of birth nationality Driver’s nationality url Link to driver’s Wikipedia

Table 47. Lap times

Variable Explanation raceId ID to identify race driverId ID to identify driver lap The lap of which the time is shown position Position of driver in race time The actual time of the lap milliseconds Converted to milliseconds

Table 48. Pit stops

Variable Explanation raceId ID to identify race driverId ID to identify driver stop Pit stop number x in particular race lap Lap of pit stop time Time of pit stop duration Duration of pit stop milliseconds Converted to milliseconds

XX

Table 49. Qualifying

Variable Explanation qualifyId ID to identify qualification result raceId ID to identify race driverId ID to identify driver constructorId ID to identify constructor Number Racing number of driver position Starting position based on qualifications q1 Results of first qualifier q2 Results of second qualifier q3 Results of third qualifier

Table 50. Races

Variable Explanation raceId ID to identify race year Year of race round Umpteenth race circuitId ID to identify circuit name Name of GP date Date of race time Starting time of the race url Wikipedia of GP of particular season

Table 51. Results

Variable Explanation resultId ID to identify result raceId ID to identify race driverId ID to identify driver constructorId ID to identify constructor number Driver’s racing number grid Starting position

XXI

position Finish result positionText Converted to character positionOrder Finish result with non-finishing incorporated points Points for driver in a particular race laps Amount of laps driven time Time of race for winner, time after winner for the others milliseconds Expressed in milliseconds fastestLap The fastest lap of race rank Rank of fastest lap fastestLapTime Time of the fastest lap of driver fastestLapSpeed Speed of the fastest lap of driver statusId Status of driver

Table 52. Seasons

Variable Explanation year Year of season url Wikipedia link to season

Table 53. Status

Variable Explanation statusId ID to identify status status Related status

Table 54. Weather

Variable Explanation year Year of race name Name of circuit rain Wet affected situation

XXII

Table 55. Country

Variable Explanation country Country nationalityCircuit Nationality of circuit

XXIII

Appendix 3: All Wet Affected Races The wet affected races come from different sources: 1. https://forums.autosport.com/topic/193061-list-of-all-wet-f1-grands-prix/ 2. https://www.reddit.com/r/formula1/comments/91676l/a_list_of_the_most_recent_rain_affe cted_races/ 3. Wikipedia 4. Jones (1996)

Source 1 refers to Jones (1996) for the wet affected races as follows:

Figure 68. Popular non-verified source referring to Jones (1996) (source 4) for weather data

XXIV

Source 1 also contains other rain affected races as can been found in Figure 69.

Figure 69. Source 1 (Autosport Forum) wet affected races

Source 2 consists of additional data for rain affected races as shown in Figures 70, 71 and 72.

Figure 70. Source 2 (Reddit) wet affected races

XXV

Figure 71. Source 2 wet affected race part 2

Figure 72. Source 2 wet affected part 3 Lastly, we checked claims of the sources above via Wikipedia. For example, the weather conditions for the 2000 Canadian GP was studied as shown in Figure 73.

XXVI

Figure 73. Source 3 (Wikipedia) as extra control instrument

The rain feature for this Canadian GP in 2000 was determined as follows. First, we check all sources for the availability of the raining data for the year of the request. Jones(1996) does not contain this information since 2000 is in the future for this source. However, the other 3 sources confirm that this GP was affected by the rain. Thus, we will include all the observations of this race with a ‘rain’ feature equal to 1. In the case of a disagreement between the sources, we will perform majority voting (three 0s and one 1 will result in a 0). In the case of a tie (2 vs 2), we will follow Jones (1996) as this is a verified source. All wet affected races constructed following the principles stated above can be found in Table 56.

Table 56. Wet affected races per decade

Decade Number of Wet affected races wet affected races

50s 14 1950 Indianapolis 500 1951 1952 1952 1952 1953 1954 1954 French Grand Prix 1954 British Grand Prix

XXVII

1954 Swiss Grand Prix 1955 Dutch Grand Prix 1956 Belgian Grand Prix 1956 1958

60s 16 1960 Monaco Grand Prix 1961 British Grand Prix 1961 1962 German Grand Prix 1962 Italian Grand Prix 1963 Belgian Grand Prix 1963 French Grand Prix 1965 Belgian Grand Prix 1965 1966 Belgian Grand Prix 1966 British Grand Prix 1966 German Grand Prix 1967 1968 Dutch Grand Prix 1968 French Grand Prix 1968 German Grand Prix

70s 21 1971 Dutch Grand Prix 1971 Canadian Grand Prix 1972 1972 Monaco Grand Prix 1972 United States Grand Prix 1973 Canadian Grand Prix 1974 Brazilian Grand Prix 1974 Spanish Grand Prix 1974 German Grand Prix 1975 Monaco Grand Prix 1975 Dutch Grand Prix 1975 British Grand Prix 1975 1976 German Grand Prix 1976

XXVIII

1977 Belgian Grand Prix 1977 Austrian Grand Prix 1977 United States Grand Prix West 1978 Austrian Grand Prix 1979 1979 United States Grand Prix West

80s 17 1981 Brazilian Grand Prix

1981 Italian Grand Prix

1981 Belgian Grand Prix

1981 French Grand Prix

1981 Canadian Grand Prix

1981 German Grand Prix

1983 Monaco Grand Prix

1984 Monaco Grand Prix

1985 Portuguese Grand Prix

1985 Belgian Grand Prix

1987

1988 British Grand Prix

1988 German Grand Prix

1988 Japanese Grand Prix

1989 Canadian Grand Prix

1989 Belgian Grand Prix

1989

90s 28 1991 Brazilian Grand Prix 1991 San Marino Grand Prix 1991 Australian Grand Prix 1991 Spanish Grand Prix 1992 Spanish Grand Prix 1992 French Grand Prix 1992 Belgian Grand Prix 1993 South African Grand Prix 1993 Brazilian Grand Prix

XXIX

1993 1993 Japanese Grand Prix 1994 Japanese Grand Prix 1995 San Marino Grand Prix 1995 French Grand Prix 1995 Belgian Grand Prix 1995 European Grand Prix 1995 Japanese Grand Prix 1996 Brazilian Grand Prix 1996 Spanish Grand Prix 1996 Monaco Grand Prix 1997 Monaco Grand Prix 1997 Belgian Grand Prix 1997 French Grand Prix 1998 Belgian Grand Prix 1998 Argentine Grand Prix 1998 British Grand Prix 1999 French Grand Prix 1999 European Grand Prix

00s 28 2000 German Grand Prix 2000 Canadian Grand Prix 2000 European Grand Prix 2000 Belgian Grand Prix 2000 United States Grand Prix 2000 Japanese Grand Prix 2001 2001 Brazilian Grand Prix 2002 British Grand Prix 2003 Brazilian Grand Prix 2003 Australian Grand Prix 2004 Malaysian Grand Prix 2004 Italian Grand Prix 2004 Brazilian Grand Prix 2005 Belgian Grand Prix 2006 2006

XXX

2007 European Grand Prix 2007 Japanese Grand Prix 2007 Chinese Grand Prix 2008 British Grand Prix 2008 Monaco Grand Prix 2008 Belgian Grand Prix 2008 Italian Grand Prix 2008 Brazilian Grand Prix 2008 French Grand Prix 2009 Malaysian Grand Prix 2009 Chinese Grand Prix

10s 25 2010 Australian Grand Prix 2010 2010 Chinese Grand Prix 2010 Belgian Grand Prix 2010 Brazilian Grand Prix 2011 Canadian Grand Prix 2011 British Grand Prix 2011 German Grand Prix 2011 Hungarian Grand Prix 2011 Korean Grand Prix 2012 Malaysian Grand Prix 2012 Brazilian Grand Prix 2013 Malaysian Grand Prix 2014 Hungarian Grand Prix 2014 Japanese Grand Prix 2015 Belgian Grand Prix 2015 United States Grand Prix 2015 British Grand Prix 2016 Monaco Grand Prix 2016 British Grand Prix 2016 Brazilian Grand Prix 2017 Chinese Grand Prix 2017 2018 German Grand Prix 2019 German Grand Prix

XXXI

Appendix 4: Current Formula One teams and their previous team names In Table 57, the constructors, the year of their first participation and their previous team name is shown. To take account for the acquired legacy (Bell et al., 2016), we consider these different team names to be the same constructor since we the acquired experience will be more or less preserved. Maybe the approach of the team changes, but they do not start from scratch. Table 57. Constructor and their legacy

Current Formula One team Year of first GP Previous team names name 1950 Sauber (1993-2005,2011- Alfa Romeo 2018); BMW Sauber (2006- 2010) / (due to COVID-19 no Minardi (1985-2005); Toro Alpha Tauri races in the first part of Rosso (2006-2019) 2020) Ferrari 1950 / Haas 2016 / McLaren 1966 / Mercedes 1954 Tyrrell (1970-1998); BAR (1999-2005); Honda (2006- 2008); Brawn (2009) Racing Point 2019 Jordan (1991-2005); Midland (2006); Spyker (2007); Force India (2008-2018) Red Bull 2005 Stewart (1997-1999); Jaguar (2000-2004) Renault 1977 Toleman (1981-1985); Benetton (1986-2001); Lotus (2012-2015) Williams 1978 /

Source: https://en.wikipedia.org/wiki/List_of_Formula_One_constructors

XXXII

Appendix 5: Build-up code in R

The R code follows the SRP-CRISP-DM methodology. In the ‘Domain Understanding’, the intentions of the different models is defined. Furthermore, the 3 analyses are explained with the related problem they try to solve. In the ‘Data Preparation’ step, the required packages and the different Ergast API data tables are loaded in. Moreover, the data is explored and the data quality is assessed via a missing value analysis. An ERD is constructed to understand the relationships between the different datasets. In the ‘Data Preparation & Feature Engineering’, the datasets are merged and the features are crafted. Also, the features are categorized in subset: experience- related features, lagged features, and starting features. The end result of this step is the basetable containing all the features and the dependent variables for our 3 analyses. The ‘Modeling’ step creates training and test set from the basetable. The training set is fitted with decision trees, bagging, random forest, and boosting to make predictions for top 3 performance, race completion, and qualifying ability. 5-fold cross-validation is performed to tune the hyperparameters of the models with the grid search technique. Afterward, the fitted models with tuned hyperparameters are used to make predictions for unseen data. In the ‘Model evaluation’ phase, these predictions are compared to the real target values of the test set. We perform different evaluation techniques: threshold-based metrics, threshold-independent metric, and graphical representations via variable importance and partial dependence. An extra here is the variable selection procedure which acts as a robustness check for our fitted models. In the ‘Model Deployment’ phase a connection with the Ergast API is made to modify the data in real-time.

The structure of the code can be seen on the following pages and is shown in these figures: Figure 74: Domain Understanding, Data understanding, Data preparation & Feature Engineering Figure 75: Modeling Figure 76: Evaluation Figure 77: Deployment, Feature selection as robustness check (Extra)

Lastly, we would like to acknowledge the developers of the different used R packages: caret (Kuhn), mice (van Buuren), tibble (Müller), lift (Hoornaert, Ballings & Van den Poel) ,xgboost (Chen, He, Benesty &Khotilovich), AUC (Ballings & Van den Poel) , dplyr (Wickham) ,lubridate (Spinu), smotefamily (Siriseriwan), stringr (Wickham), MLmetrics (Yan), iml (Molnar)

XXXIII

Figure 74. Code layout step 1, 2 and 3 SRP-CRISP-DM

XXXIV

Figure 75. Code layout step 4 SRP-CRISP-DM

XXXV

Figure 76. Code layout step 5 SRP-CRISP-DM

XXXVI

Figure 77. Code layout step 6 SRP-CRISP-DM and feature selection

XXXVII

Appendix 6: Impact on AUC of class imbalance treatment

Table 58. Average AUC for different analysis with different class imbalance approaches

Baseline Over-sampling Under-sampling ADASYN Top 3 0.674 0.714 0.818 0.718 Race 0.602 0.601 0.618 0.594 completion Qualifying 0.680 0.817 0.876 0.695 ability Average difference 0.04 − 0.001 + 0.137 0.144 + 0.016 + 0.196 0.044 − 0.008 + 0.015 0 with 3 3 3 baseline Equals 0 0.059 0.06 0.119 0.12 0.017 0.02

XXXVIII