Predictive Performance of Tree-Based Models

SIFTING THROUGH THE NOISE IN FORMULA ONE: PREDICTIVE PERFORMANCE OF TREE-BASED MODELS Aantal woorden / Word count: 34854 Léon Sobrie Stamnummer / student number : 01502643 Promotor / supervisor: Prof. Dr. Dirk Van den Poel Co-promotor / Co-supervisor: Bram Janssens Masterproef voorgedragen tot het bekomen van de graad van: Master’s Dissertation submitted to obtain the degree of: Master in Business Engineering: Data Analytics Academiejaar / Academic year: 2019-2020 ii Confidentiality agreement PERMISSION I declare that the content of this Master’s Dissertation may be consulted and/or reproduced, provided that the source is referenced. Léon Sobrie, 02/06/2020 iii Preface Dear reader, This thesis is written to achieve my Master’s Degree in Business Engineering at the University of Ghent. I have chosen Formula One as the field of research due to my interest in sports. I started playing soccer when I was 6 years old. Nowadays, running and cycling are my sports of preference. My interest in analytics stirred up during my 5-year education in Business Engineering at the University of Ghent. After examining the literature of different sports, I concluded that the literature regarding motorized sports and more particularly Formula One is rather limited and additional research can add value to literature. A Formula One Grand Prix is also one of my favorite sports to watch on a lazy Sunday afternoon. What inspires me is the fact that Formula One drivers push their bodies to the limit for their passion. They undergo extreme training and they have finetuned diets to reach a level of fitness that will allow them to have higher chances of surviving when crashing their car. In the race, they need a jet fighter mindset to steer their cars at a mind-blowing pace. The following persons deserve an acknowledgment of their contribution to this Master’s Dissertation. For this dissertation, I am profoundly grateful to Bram Janssens for his professional guidance during the past year. Next, I would like to thank prof. dr. Dirk Van den Poel, the promotor of this dissertation, who inspired me to pick 'Data Analytics' as my main subject. Lastly, my research would not have been possible without the support of my parents and my friends. Enjoy reading my Master's Dissertation. Léon Sobrie iv Preamble The COVID-19 virus has a severe impact on the economy and social life in 2020. All the universities are closed in Belgium and they had to switch to online learning. I would like to give my praising for the flexibility of the University of Ghent to provide these online learning opportunities conveniently. COVID-19 has no major impact on writing this dissertation since the period of the analysis is defined as 1950-2019. However, a connection with the Ergast API, the provider of the database, is established meaning the data sets can be updated with new races. This would lead to altering the basetable and augmenting the predictions with new test data. The implementation of this connection is part of the deployment phase of the SRP-CRISP-DM (Bunker & Thabtah, 2019). The purpose of this connection is to conveniently retrieve the Formula One data sets that are used to conduct this research. Thus, this connection can be considered as an add-on for this dissertation. With this being said, the conducted analyses: (1) top 3 performance, (2) race completion, and (3) qualifying ability have been carried out in an undisturbed way. Hence, this dissertation has not been altered because of this exceptional situation. This preamble is drawn up in consultation between the student and the supervisor and is approved by both. v Abstract Formula One is one of the most highly anticipated sports and, yet, the literature regarding Formula One is rather limited. Hence, our research will contribute to literature via 3 analyses on publicly available F1 data. More particularly, tree-based models will be used to model our analyses. The first model studies factors of ‘top 3 finish’ performance leading to additional insights in high- performance in Formula One. Achieving high-performance will help teams to boost their reputation potentially leading to better sponsor deals. Success is a proxy for discovering top-performing drivers in the first analysis. The second analysis studies determinants of ‘race completion’ aiding to make Formula One a safer environment for the drivers. Both technical and human errors occur in Formula One and we hope to contribute in some sense to help to reduce those. The third analysis comprises the ‘qualifying ability’ for the race leading to make sure teams have 2 drivers at the start. Our 3 analyses are conducted using the state-of-the-art SRP-CRISP-DM (Sport Result Prediction CRoss Industry Standard Process for Data Mining) methodology of Bunker and Thabtah (2019). First, the domain logic behind our analyses is explained and we elaborate on how the analyses can be used. Second, the structure of the publicly available data is assessed and, third, a comprehensive basetable is constructed with the relevant features for our analyses. In total, 33 features are crafted to be included in our analyses. Purposely, the number of features is kept considerably limited since the best of both ‘data mining’ and ‘business logic’ worlds is pursued. Fourth, the performance of tree-based models on our Formula One analyses is studied. The reason for choosing tree-based models is due to the absence of tree-based models in previous Formula One studies. Furthermore, we study the impact of the class imbalance in the target output and its treatment on the prediction accuracy of our 3 analyses. Fifth, the results are evaluated using the metrics available in a binary classification setting: AUC, accuracy, sensitivity, specificity, lift, and F1 score. Sixth, a connection with the open-source database is made that facilitates online learning. Conclusively, this dissertation can be classified as ‘applying statistical learning in Formula One to predict top 3 performance, race completion, and qualifying ability’. The results show that the starting position in the race is the most important feature in the top 3 performance analysis. Yet, we see that the points accumulated in the season, the number of past top 3s, and the number of wins of the constructor are used by the models for predicting top 3s whereas the starting position is more used to predict non-top 3s. The prediction results of the race completion analysis are poor meaning that either it is hard to predict or we did not include enough relevant features. Qualifying ability is predicted accurately with as most important features the past number of qualifications, points in the season, and whether the driver has ever started in the top 3. vi Table of contents Confidentiality agreement ................................................................................................................... iii Preface ............................................................................................................................................... iv Preamble ............................................................................................................................................. v Abstract .............................................................................................................................................. vi List of figures ...................................................................................................................................... xi List of tables ..................................................................................................................................... xiv List of abbreviations .......................................................................................................................... xvi Part 1: Introduction .............................................................................................................................. 1 Background – Formula One .......................................................................................................................... 2 Impact of COVID-19 on Formula One ........................................................................................................... 5 Problem situation ......................................................................................................................................... 6 Research questions ....................................................................................................................................... 7 Part 2: Literature .................................................................................................................................. 9 Related sports ............................................................................................................................................... 9 NASCAR ....................................................................................................................................................... 14 Formula One ............................................................................................................................................... 18 Contribution to literature ........................................................................................................................... 23 Part 3: Methodology and Research ..................................................................................................... 24 Considered existing methodologies ...........................................................................................................

Load more