Analysis and Development of Methods for Automotive Market Segmentation
Total Page:16
File Type:pdf, Size:1020Kb
POLITECNICO DI TORINO Master degree course in Computer Engineering Master Degree Thesis Analysis and Development of Methods for Automotive Market Segmentation Supervisor Candidate prof. Paolo Garza Marco Angelo Guttadauria Internship Tutor Dott. Ing. Silvia Delsanto Accademic year 2020-2021 This work is subject to the Creative Commons Licence Summary In the automotive sector in recent years the use of data has become increas- ingly important to allow car manufacturers to analyse the market, competing vehicles, their specifications and their prices so that they could use this in- formation to create a product offering that adapts as much as possible to the market. Some of the more significant analyses require vehicle segmentation. Segmentation represents the division of vehicles into commercially homo- geneous groups. The market analysis required by car manufacturers are typ- ically conducted using as comparison vehicles from the same segment. This type of analysis is called basket analysis, where the basket would be the set of vehicles considered most similar to the vehicle being compared. Jato has developed its own more general definition of segmentation and based on this subdivision catalogues the models present in the databases. The definition of the segments and the choice of the segment to be assigned to the vehicles, however, are not immediate processes. The first part of the thesis is aimed at defining the characteristics that allow the segmentation to be carried out and to develop a tool capable of carrying out an automatic segmentation of the models that can be used as a support for the segmentation phase. The model developed is based on a random forest algorithm that allows to obtain an automatic classifier to support the choice of the segment. The second part of the work is aimed at allowing to systematically define the vehicles that are considered similar to a given vehicle in order to be able to define more precise baskets for basket analysis. Two similarity measures were developed to identify similar vehicles at model level and similar vehicles at version level. Given the presence of categorical and continuous features, the proposed solution is based on the Gower distance measurement that allows to manage both types of variables by combining the use of the Dice coefficient for categorical features and the Manhattan distance for continuous features. 3 Acknowledgements Con queste poche righe colgo l’occasione per ringraziare tutte le persone che mi sono state vicine in questo percorso. Desidero ringraziare il mio relatore, il Prof. Paolo Garza, per avermi seguito con competenza e disponibilità e la Dott.ssa Silvia Delsanto, per avermi guidato in questo percorso di tirocinio in Jato e per avermi aiutato a crescere con i suoi preziosi suggerimenti. Vorrei ringraziare particolarmente la mia famiglia che mi ha sempre sup- portato e permesso di raggiungere questo importante obiettivo. So che in alcuni periodi non è stato facile starmi vicino ma anche in quei momenti siete stati pazienti e mi siete stati di grande supporto. Un pensiero speciale va ai miei “colleghi”, ma soprattutto amici, Salvo C., Salvo P., Andrei, Luigi, Giuseppe e Vincenzo, con cui ho condiviso questo periodo universitario fatto di alti e bassi ma sempre tra mille risate, senza di loro non sarebbe stato lo stesso. E per ultimi, ma non per importanza, vorrei ringraziare i miei amici di una vita, in particolare Aldo, Giorgio Khaleesi, Giorgio Prì, Laura, Monica e Noemi con cui ho condiviso alcune delle esperienze più belle della mia vita. Grazie per esserci sempre anche a distanza, con voi le giornate sono sempre un pò più leggere. P.S. quasi dimenticavo di ringraziare Ermenegildo per gli ottimi Spritz offerti. 4 Contents List of Tables 7 List of Figures 8 1 Introduction 9 1.1 JATO role in automotive sector.................9 1.2 Automotive data complexity................... 10 1.2.1 Model Vs Version and their data representation diffi- culties........................... 11 1.2.2 Options, features and their data representation diffi- culties........................... 12 1.3 Jato data............................. 13 1.3.1 Data collection complexity................ 13 1.3.2 Vehicle Segmentation................... 13 1.3.3 Data structure, schema_id hierarchy.......... 15 1.3.4 Databases......................... 15 2 Vehicle segment classifier 19 2.1 Introduction to classifiers..................... 19 2.1.1 Classifiers explained................... 20 2.1.2 Steps for building the classifier.............. 21 2.1.3 Introduction to Decision tree............... 28 2.1.4 Introduction to Random Forest............. 31 2.2 Data collection and cleaning................... 32 2.3 Decision Tree Classifier...................... 36 2.3.1 Decision tree implementation.............. 36 2.3.2 Decision Tree training and hyperparameter tuning us- ing cross-validation.................... 36 2.3.3 Decision Tree evaluation................. 37 5 2.3.4 Decision Tree structure analysis............. 38 2.4 Random Forest Classifier..................... 40 2.4.1 Implementation...................... 41 2.4.2 Training phase...................... 41 2.4.3 Classifier evaluation.................... 43 2.4.4 Features importance analysis............... 47 2.4.5 Random Forest implementation in other markets... 49 2.5 Classification algorithms comparison.............. 52 2.5.1 Naive Bayes classifier implementation.......... 52 2.5.2 Support Vector Machine implementation........ 52 2.5.3 Results and Comparisons................. 53 3 Vehicle similarity 55 3.1 Use case.............................. 55 3.2 Introduction to similarity measures............... 56 3.3 Features Selection......................... 58 3.3.1 Model features...................... 58 3.3.2 Version features...................... 60 3.4 Selection of distance metrics................... 61 3.4.1 Distance metrics per model............... 62 3.4.2 Distance metrics per version............... 62 3.5 Distance algorithm application.................. 62 3.5.1 Model distance algorithm................ 62 3.5.2 Version distance algorithm................ 64 3.6 Validation............................. 66 3.6.1 Model Validation..................... 66 3.6.2 Version Validation.................... 69 4 Conclusions and future works 75 4.1 Vehicle segment classifier conclusions.............. 75 4.2 Vehicle similarity conclusions.................. 76 4.2.1 Model similarity validation................ 77 4.2.2 Version similarity validation............... 77 4.3 Future works........................... 77 Bibliography 79 6 List of Tables 2.1 Decision Tree evaluation metrics................. 38 2.2 Random Forest segments evaluation metrics.......... 45 2.3 Random Forest evaluation metrics................ 45 2.4 Features importance values.................... 48 2.5 Classification algorithms validation measures.......... 53 3.1 Top5, Top 10, Top 15 training accuracies in model similarity measure.............................. 64 3.2 Model Validation......................... 69 3.3 Model Validation Accuracy.................... 69 3.4 Version Validation Top 5..................... 71 3.5 Version Validation Top 10.................... 72 3.6 Version Validation Top 15.................... 72 4.1 DT and RF classification results in the analysed market.... 76 4.2 Random Forest results on dataset subsets............ 76 7 List of Figures 2.1 Multiclass Confusion Matrix................... 25 2.2 Cross validation procedure.................... 28 2.3 Decision tree structure...................... 30 2.4 Decision tree models accuracy after cross validation...... 37 2.5 Root node with its branches of the best decision tree model.. 39 2.6 First decision tree structure detail................ 39 2.7 Second decision tree structure detail............... 40 2.8 Validation table detail...................... 44 2.9 Frequently misclassified vehicle dataset extract......... 46 3.1 Trim level distance matrix.................... 65 8 Chapter 1 Introduction In recent years, it has increasingly been heard that the most important assets owned by companies are the data they acquire, manage and store. Nowa- days most companies have databases containing various types of data, these databases constitute a potential mine of useful information for various pur- poses. Analyzing this data correctly can give the company support in decision making and produce a competitive advantage for the company on the mar- ket. The type of data and the use made of it changes as the sector of interest changes. The sector of interest for this thesis is the automotive one. We will see how data is important in this area and how Jato plays an important role in collecting and analyzing it. 1.1 JATO role in automotive sector During the 1980s the automotive sector was in its phase of maximum expan- sion. In those years the economy was changing and the concept of global- ization was gaining ground and, as in all sectors, the automotive sector was also involved. Car manufacturers were starting to expand into the interna- tional market, to do this they therefore needed to learn more about the global market, competing vehicles, their specifications and their prices so that they could use this information to create a product offering that adapts as much as possible to new markets. Jato was therefore born to harness the power of information in this new environment by making data collection and analysis his business. Over the years It has collected various types of data to be able to provide different information and different tools to its custom. The