Masaryk University Faculty of Informatics

Car Insurance Recommendation System

Bachelor’s Thesis

Jakub Smatana

Brno, Spring 2021 Masaryk University Faculty of Informatics

Car Insurance Recommendation System

Bachelor’s Thesis

Jakub Smatana

Brno, Spring 2021 This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document. Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Jakub Smatana

Advisor: Ing. Lukáš Grolig

i Acknowledgements

I would like to thank my advisor for guidance in the whole process of making this thesis. The acknowledgements also belong to my family and friends for moral support.

ii Abstract

This thesis aims to help insurance agents with product decisions as the goal was to construct a car insurance recommendation system using the algorithm with the addition of comparing results from two different algorithms. The predictions were focused on the insurance company and front glass recommendations. The models were trained on the data provided by the Able company. The thesis presents an application accepting the input information about the driver and the car and providing the recommendation. This system can be extended by adding more features with the aim of predicting complete insurance coverage.

iii Keywords

CatBoost, decision trees, , LightGBM, machine learn- ing, motor insurance, reccomender system, supervised learning

iv Contents

1 Introduction 1

2 State of the Art 2 2.1 Car Insurance ...... 2 2.2 Machine Learning Recommendations ...... 2 2.3 Car Insurance Recommendation System ...... 3

3 Insurance 5 3.1 Motor insurance ...... 5 3.2 Front glass insurance ...... 6

4 Machine Learning 7 4.1 Machine Learning Approaches ...... 7 4.1.1 Supervised Learning ...... 7 4.1.2 Unsupervised Learning ...... 8 4.1.3 Semi-supervised Learning ...... 8 4.1.4 Reinforcement Learning ...... 8 4.2 Classification algorithms ...... 8 4.2.1 K-Nearest Neighbor ...... 9 4.2.2 Support Vector Machine ...... 9 4.2.3 Logistic Regression ...... 9 4.2.4 Decision Trees ...... 10

5 Used Technologies 11 5.1 CatBoost ...... 11 5.2 LightGBM ...... 11 5.3 REST API ...... 12 5.4 Docker ...... 14 5.5 Pandas ...... 14

6 Methodology 15 6.1 Data Collection ...... 15 6.2 Data Preparation ...... 15 6.2.1 Formatting data ...... 16 6.2.2 Resulting file structure ...... 16 6.2.3 Choosing relevant data ...... 17

v 6.2.4 Flattening data ...... 17 6.2.5 Dealing with missing data ...... 17 6.2.6 Creating new variables ...... 18 6.2.7 Updating current variables ...... 18 6.2.8 Splitting data ...... 18 6.3 Choose an Algorithm ...... 19 6.4 Train the Model ...... 19 6.5 Evaluate the Model ...... 19 6.6 Parameter Tuning ...... 21 6.7 Make Predictions ...... 23

7 Research 24 7.1 Comparing Results ...... 24 7.1.1 Model Producer ...... 24 7.1.2 Model FrontGlass ...... 24 7.2 Model Analysis ...... 25 7.3 Potential improvements ...... 26

8 Conclusion 27

A An appendix 29

Bibliography 30

vi List of Tables

6.1 Evaluation results after first CatBoost model training predicting the "Producer". 21 6.2 Evaluation results after changing the random seed used for training. 22 6.3 Results after training the model predicting "FrontGlass" on dataset with different number of negative rows. 23 7.1 Evaluation results from CatBoost and LightGBM classifiers for "Producer". 24 7.2 Evaluation results from LightGBM model predicting "FrontGlass" with different number of negative rows. 25

vii List of Figures

viii 1 Introduction

One of the many steps in purchasing a vehicle is insuring it. Insuring a vehicle protects the driver from financial loss in case of an accident. Many companies are offering an insurance package that includes ba- sic insurance, and the customer can decide to extend this contract with other optional packages. With insurance package customiza- tion, companies are trying to offer the best packages for their clients to maximalize the income of the company. An insurer that receives information about the driver, including their personal data and vehicle registration details, must decide which package is the best to offer, so both sides are satisfied. The goal of this thesis is to make a Car Insurance Recommendation System trained on actual data from insurance companies that were provided by the Able company1. I compare the result of machine learning algorithms that the system is trained on, such as CatBoost (see section 5.1) and LightGBM (see section 5.2). The system should help agents select the right insurance company for their customers, and suggest to the agent if they should recommend insuring a front glass as an addition. In the second chapter, state-of-the-art is provided. In the third chap- ter, I explain the basic principles of insurance with a focus on the insur- ance of vehicles. In chapter four, essential information about machine learning and three main machine learning approaches is provided. In the following chapter, I describe technologies that were used in the pro- cess of making the prediction system. In the sixth chapter, I provide the summary of steps that were used in the process of making this recommendation system. The final application makes a prediction by providing necessary input information that helps an agent choose the right insurance com- pany for the customer or provide a suggestion to offer front glass insurance as an addition.

1. https://www.able.cz

1 2 State of the Art

In this chapter, I describe the used technology and background of this thesis.

2.1 Car Insurance

Nowadays, motor vehicles became an everyday part of our lives. The quantity of cars is increasing [1] and with growing popularity, the chance of being involved in an accident is higher. That is why the car insurance system was invented to protect people from accidents or at least compensate for property damage or injuries. In Europe, although the countries require a driver to have car insur- ance, there are differences between minimal coverage [2]. Although there are lots of companies that offer insurance packages, their prod- ucts differ. The best product selection depends mainly on vehicle parameters like power, weight, engine volume, category, and personal driver’s information such as the number of years that person owns a driver’s licence, the city they live in or if they were registered in a traf- fic accident [3].

2.2 Machine Learning Recommendations

Machine learning as a term was first introduced in 1952 by Arthur Samuel. It was popularized and used for everyday purposes since 2006. Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy [4]. Machine learning algorithms are divided into categories such as un- supervised learning, supervised learning, semi-supervised learning or reinforcement learning. In supervised learning, the algorithm fo- cuses on data with input and output labels. In other words, it can be described as learning with a teacher. A recommender system is software that predicts the rating or pref- erence of an item that a user or customer can choose. The recommender systems are used in various applications varying from predicting

2 2. State of the Art videos to watch, things to buy or specific product configuration to use. Recommender systems are also used in assisting insurance agents with picking the right product for customers. These systems are using supervised machine learning algorithms (see section 4.1.1) that learn from large datasets of existing contracts while constantly updating such datasets.

2.3 Car Insurance Recommendation System

The car insurance recommendation system developed in this thesis is based on real data from two insurance companies. An existing sys- tem similar to the one developed in this thesis is focused on choosing the right product to recommend to a customer. Its targets are cus- tomers that are likely to subscribe to an additional cover [5]. However, my system is predicting the most fitting insurance company that cus- tomers with similar personal and vehicle data have chosen. The system also suggests to the user if they should consider insuring the front glass in addition to the recommended insurance company. Two machine learning frameworks that are used to train this sys- tem are CatBoost1 and LightGBM2. Both of these algorithms are based on gradient boosting on decision trees. Gradient boosting is a ma- chine learning technique that produces a prediction model in the form of an ensemble of weak prediction models in functional space. It is achieving state-of-the-art results in many practical tasks [6]. Both of these frameworks have support for categorical features. Categor- ical features are variables that can take on one of a limited number of possible values (e.g., City or Vehicle Category). Comparison of these two frameworks is based on accuracy score, which is calculated as a fraction of correctly predicted samples to all predicted samples. After selecting the model with a better score, the ap- plication is deployed in a Docker3 container. It provides a REST API4 accepting input information and providing the recommendation.

1. https://catboost.ai/ 2. https://lightgbm.readthedocs.io/en/latest/ 3. https://www.docker.com/ 4. https://en.wikipedia.org/wiki/Representational_state_transfer

3 2. State of the Art

Data redundancy or data inconsistency is a known problem when using such data for any business model, primarily when used in ma- chine learning systems. It can lead to misleading and inefficient models that cannot be repaired by changing the parameters of a learning func- tion. Although refilling missing data or changing the data structure can lead to promising results, this step is beyond the expectations for this thesis.

4 3 Insurance

Insurance is the means of protection against financial loss. In the mod- ern world, where cars are driving one after another, we risk our health and possessions. That is why the insurance companies were created and are helping people in need. In the first section will be coverage of essential motor insurances. The following section will look at front glass insurance, which is not part of most companies’ basic insurance pack.

3.1 Motor insurance

Motor insurance is the type of insurance that covers financial loss in the event of an incident that involves a vehicle. In many countries, motor insurance is necessary to have, although it may vary from one country or jurisdiction to another [2]. Availability and price of the in- surance coverage depend on multiple factors.

1. Driving record — the driver’s participation in any recent acci- dents or traffic violations is affecting the price of the insurance package

2. Car usage — distance covered by the driver is a large factor because of the increased possibility to be part of an accident

3. Location — larger cities have higher rates of vandalism with a higher chance for the driver to be involved in an accident, so the bigger city means increased payment of the insurance

4. Age — mature drivers are known to have fewer accidents than drivers with less experience, so drivers below age 25 pay more

5. Car specification — the more expensive the car is, the more money has to be paid, which also applies to a bigger engine, the safety record of the car or the security factor if the driver is part of some accident.

All of these factors lead to the price of the insurance coverage.

5 3. Insurance 3.2 Front glass insurance

Front glass, also windshield or windscreen, is a front window of a ve- hicle that provides visibility while also protecting a driver front outer elements, mostly rocks or solid objects. Constructing a windshield is met with important conditions like providing structural support for the roof in case of a car rollover, shatterproof so, in case of an ac- cident, shards of the broken glass does not fly in and injure any pas- senger, and toughness which ensures that the glass does not damage in a minor collision. The material that fulfils all of these requirements is called laminated safety glass [7]. This technique of making a re- sistible front glass also makes it quite expensive, so every damage to the windshield cost a lot of money. To avoid losing expenses in case of the damaged front glass, car owners can purchase front glass insurance. Front glass insurance is mainly divided into two kinds of insurance policies: comprehensive coverage and full glass coverage. Comprehensive coverage covers repairing or replacing the windshield after a collision with another object or a rock. It also covers damage from theft, fire, or another inci- dent that breaks or damaged the glass. Full glass coverage can be part of comprehensive coverage, or it can be an addition to it. Depending on the insurance company, a deductible does not have to be paid while having full glass coverage [8].

6 4 Machine Learning

Machine learning (ML) is a tool that helps us solve problems with a large amount of data. ML techniques are used to discover hidden pat- terns within a data structure that would otherwise be hard to find [9]. Computer programs that use these algorithms can learn from experi- ence to improve performance without being explicitly programmed for such task. These systems are widely used for high traffic websites like YouTube to provide video recommendations, filter spam messages in Gmail, predicting high traffic areas, and much more.

4.1 Machine Learning Approaches

Machine learning approaches can be divided into four groups, where supervised learning and unsupervised learning are used more ordinar- ily. Semi-supervised learning and reinforced learning are newer than the first two, but they showed great potential with their complexity [9].

4.1.1 Supervised Learning

Supervised learning, otherwise known as learning with a teacher, is a method that consists of making a prediction model based on in- put and output labelled data. The algorithms are learning a function from training examples, where each input has a dedicated output to find the best possible replication of an unknown input function [10]. Supervised learning can be divided into two categories:

1. Classification — is a method that deals with the prediction of the output value(the target) based on input values(features). We provide learning data to the algorithm to learn and then label new unseen variables based on the learning experience. Target value in classification is discrete(e.g. nominal or ordinal). The most basic and common classification problem is filtering spam messages in an email where the set of input emails is labelled as "spam" or "no-spam".

7 4. Machine Learning

2. Regression — is very similar to classification, but the outputs are continuous instead of discrete. Predicting the value of an estate can be marked as a regression problem.

4.1.2 Unsupervised Learning

Unsupervised learning is an approach where only input data is pro- vided. The goal is to find patterns or similarities in data without la- belling them. This style is mainly used for clustering, which is a process of making groups with similar attributes within that are not known beforehand [10]. The number of clusters is not predefined, and it can lead to good but also bad results.

4.1.3 Semi-supervised Learning

Semi-supervised learning is a mixture of supervised and unsuper- vised learning. It uses a mix of labelled data and a more extensive set of unlabeled data. It is a very well approved method when labelling data requires too much time/resources. For example, the dataset can contain labelled pictures with brain scans to make the most probable diagnoses for all the patients [10].

4.1.4 Reinforcement Learning

In reinforcement learning, the system is not provided with input/out- put data. However, the learning system receives so-called rewards or penalties after every action to maximize the outcome for the whole process [10]. A prevalent reinforcement system that can play chess games, shogi and Go is called AlphaZero [11].

4.2 Classification algorithms

In this section, a description of the most used algorithms in classifica- tion is provided. The description is focused on applying the algorithm on collected data from section 6.1.

8 4. Machine Learning

4.2.1 K-Nearest Neighbor

The KNN is a non-parametric algorithm used in classification and regression. The non-parametric algorithm does not make a strict as- sumption about the form of a mapping function. It uses the principle of clustering, where it assumes the similarity between already labelled data (clustered in groups) and new input data. The KNN is classified as a lazy learning algorithm — the algo- rithms do not learn from the dataset right away. Instead, it stores and uses the data at the moment of classification. Each variable (column) is a new dimension where the algorithm finds K nearest neighbors using the user-chosen function. With each dimension, the complexity of a function is growing. Apart from finding the N-dimensional func- tion, where N is the number of variables, the algorithm does not yield good results for heterogeneous data with categorical variables [12].

4.2.2 Support Vector Machine

Support vector machine (SVM) is a classification and regression algo- rithm. SVM objective is to find an N-dimensional hyperplane where N is the number of variables. The algorithm classifies data points through hyperplane, which is dividing the N-dimensional space into output classes. The SVM is a very effective algorithm when the number of training samples is lower than the number of dimensions. How- ever, the SVM does not have a scalable nature, so large datasets are not suitable for this algorithm [13].

4.2.3 Logistic Regression

Logistic regression (LR) is a classification model similar to linear re- gression that predicts the probability of a binary variable from one or more response variables. LR is using a logistic function1 to pre- dict a probability of an input class belonging to the default class. As the probability is ranging from 0 to 1, LR can be used as a classi- fication algorithm. Even though LR is very fast, it does not function well in high dimensional space [14].

1. https://en.wikipedia.org/wiki/Logistic_function

9 4. Machine Learning

4.2.4 Decision Trees

A decision tree is a classification and regression algorithm. The tree is made out of three components:

1. Nodes — represent the features or variables in the dataset,

2. Branches — represent the decision rules,

3. Leafs — represent the outcome (final predicted value).

Regression trees are the same as classification trees with the excep- tion of predicting continuous variable (e.g., price) instead of discrete. Like a typical data structure tree, decision trees have a root node as a start, and the nodes are connected through branches until the last node, a leaf node. The model learns the decision rules from the train- ing data and constructs a decision tree, which has the most deciding rules initiating from the root node. The decision trees are performing well with a large amount of data, categorical data, heterogeneous data and have a scalable nature [15].

10 5 Used Technologies

In this thesis, there are lots of algorithms and technologies used and described in this chapter. As was mentioned in section 4.2.4, decision trees are the most suitable algorithm for heterogeneous and categorical data. As the obtained dataset fulfils these constraints, the chosen frame- works for the thesis implementation were CatBoost and LightGBM. Another factor for choosing these two frameworks is the presence of categorical data, which they support.

5.1 CatBoost

CatBoost is a machine learning algorithm that uses gradient boosting on decision trees (GBDT). Gradient boosting is a powerful method that is achieving decisive results in a variety of practical tasks. Hetero- geneous data, noisy data are the primary source of learning problems for years with this method. Gradient boosting is essentially a process of con- structing an ensemble predictor by performing gradient descent in a func- tional space [6]. As the name CatBoost suggests, this algorithm is mainly based on categorical features. Categorical features are a unique set of values called categories that can not be compared to each other. One popular technique for dealing with categorical features in boosted trees is using dummies. Dummies are variables that are assigning binary value based on an original variable. However, this technique is ineffective when it comes to high cardinality features (e.g., "user ID" feature).

5.2 LightGBM

Light Gradient Boosting Machine (LGBM) is an open-source machine learning framework that uses gradient boosting on decision trees. GBDT, with its standard implementation, is facing new challenges, especially with regards to efficiency and accuracy. The most time- consuming part is finding the best split points. The pre-sorting algo- rithm [16] is very accurate but also time and memory consuming. Another method is called a histogram-based algorithm which is more

11 5. Used Technologies

efficient in both memory consumption and training time. LightGBM is focusing on performance and scalability by using these two techinques: 1. Gradient-based One-Side Sampling (GOSS) — with the intent of finding the best split value, the GOSS method is focused on sampling instances with higher gradients. A gradient can also be described as a slope of a function. A higher gradient means that a model can learn fast, and if a gradient is close to zero, the model stops learning. Choosing mainly higher gra- dient features and randomly dropping lower gradient features explains why the GOSS method is effective in terms of memory and time [17]. 2. Exclusive Feature Bundling (EFB) — as the time complexity of a his- togram building takes O(#data * #feature) [17], training time can be reduced by bundling feature together. Bundling features re- sults in higher dimensional data where features are mutually ex- clusive1. This method reduces time complexity to O(#data * #bun- dle). LGBM also handles categorical features.

5.3 REST API

The REST API was used in this thesis to connect the backend applica- tion (ML predictions) with the end-user (insurance agent). Representational State Transfer Application Programming Inter- face(also known as REST API or RESTful API) is an API that fol- lows constraints of REST architectural style and allows interactions with RESTful web services. API is a set of rules that define interactions between multiple soft- ware application. API defines kinds of calls or request in proper form that can be made to access data from an application. REST determines how the API looks like. It is some set of architectural constraints that is followed when API is created. Web API works in an HTTP request- response context. The client who wants some data from the server

1. Mutually exclusive data never takes zero values simultaneously.

12 5. Used Technologies sends a request and receives a response from the server, which con- tains requested data [18]. REST defines six architectural constraints that a web service must follow.

1. Uniform interface — which is fundamental to the design of any RESTful system. This criterion helps to simplify overall system architecture and improves the visibility of interactions. It is di- vided into four constraints:

(a) Resource identification in requests where resources are sep- arated from representations that are returned to the client. (b) Self-descriptive messages, when every response contains enough information about how to process received data. (c) Hypermedia is available, which means that the client should be able to use hyperlinks to other available resources. (d) Resources can be manipulated through representation be- cause it has enough information to do so.

[18], section 5.1.5.

2. Client-server — separates the user interface concerns with data storage concerns to access API from multiple platforms by using client-server request and responses via HTTP.It also improves scalability [18], section 5.1.2.

3. Stateless — each request from the client to a server contains only the necessary information to understand and perform such task and are not storing any user data. These constraints help with scalability, reliability and visibility [18], section 5.1.3.

4. Cachable — data within a response to a request are labelled as cachable or non-cachable so that the cached data can be used in further similar responses [18], section 5.1.4.

5. A layered system — each layer can interact a see only the imme- diate layer above or under. It helps with security and scalability, although it has a downside of performance issues that can be solved by caching intermediaries [18], section 5.1.6.

13 5. Used Technologies

6. Code on demand (optional) so that servers can send executable code to improve client functionality [18], section 5.1.7.

5.4 Docker

Docker was used in this thesis, so the application can be used in every environment with the advantage of easy deployment and manage- ment. Docker is an open-source platform that can run, develop or ship any application. Docker can run an application in an isolated environ- ment which is called a container. Application is stored in packages that can be deployed in any , Windows or macOS environment. Containers contain everything that the application needs to be run, so it does not depend on the host. Using docker images can save time for shipping testing and deploying code quickly to production [19].

5.5 Pandas

Pandas is a Python package that can quickly and intuitively manipu- late data through Series(1-dimensional) or DataFrame(2-dimensional). It is built on top of NumPy with the goal of becoming the most powerful and flexible open-source data analysis/manipulation tool available in any lan- guage. Pandas have support for different kinds of data such as tabu- lar data(SQL table or Excel spreadsheet), arbitrary matrix data with named rows and columns, ordered or unordered time series data and many more. Pandas is suited for manipulating missing values, inserting and deleting columns and rows, merging, joining, grouping by functionality and many more. It is an open-source project released under the three-clause BSD licence [20].

14 6 Methodology

There is seven-step in process of developing and deploying a machine learning models [21]: 1. Data Collection

2. Data Preparation

3. Choose an Algorithm

4. Train the Model

5. Evaluate the Model

6. Parameter Tuning

7. Make Predictions

6.1 Data Collection

Data used to train the machine learning model were imported from the internal systems of the Able company. The data were aggregated and anonymized to avoid disclosure of sensitive personal information such as name or address. Data protection is essential to avoid leaking personal information even as an accident to prevent usage by third parties for fraud, identity theft or phishing scams. The folder consisted of four anonymized JSON files.

6.2 Data Preparation

The goal of data preparation is to set up a high-quality representation of the collected data. These are the methods I used when cleaning data for training: 1. Formatting data

2. Resulting file structure

3. Choosing relevant data

15 6. Methodology

4. Flattening data

5. Dealing with missing data

6. Creating new variables

7. Updating current variables

8. Splitting data

6.2.1 Formatting data

The first step after looking at the data was to decide which filesare essential to use. Raw JSON files are very hard to read in plain text, so the files had to be formatted using a python script. JSON issup- porting only two different outermost levels, which are object (curly brackets) and array (square brackets). Files in the dataset consisted of lines with objects. Therefore the commas and square brackets had to be added to separate root objects and put into an array.

6.2.2 Resulting file structure

After formatting, the resulting file structure was following:

• Contracts.jsonAnonymized.json — contained the records of in- surances with all necessary information for the training model,

• Debtors.jsonAnonymized.json — contained the information about the debt of a customer,

• Interventions.jsonAnonymized.json — contained the interven- tion data in case of a problem with a contract between the insur- ance company and the provider,

• CodeLists.jsonAnonymized.json — contained the coded infor- mation about products appearing in contracts.

16 6. Methodology 6.2.3 Choosing relevant data

After analyzing the individual files, I concluded that Intervention.json and Debtors.json files were not needed for this thesis. Contacts.json contained 235737 records of contracts, and CodeLists.json consisted of translations between coded product and product name, which was important for the FrontGlass recommendation prediction.

6.2.4 Flattening data

After further analysis, I noticed that some of the columns, such as vehi- cle details or subjects, were normalized. Data normalization is the pro- cess of storing data in different tables to reduce data redundancy and improve data integrity. However, in this case, the data used for training the model must have separate columns for each feature.

6.2.5 Dealing with missing data

After flattening and merging data together, the number of records dropped to 82105 because records without vehicle parameters cannot be used in this data model. Analyzing data again revealed that there are still 60.7% missing cells. Features with a high missing percentage (above 80%) and also with the information established after signing the contract had to be excluded. This process reduced column count from 115 to 9.

Duplicates can also be included in missing data because they do not contribute to making the model better. After dropping duplicates, the overall number of records dropped to 53719. Data still had 247 rows with missing values. One way of dealing with missing data is ini- tializing them with median1, modus2 or mean3. However, in this case, with a low percentage of missing data, deleting the rows is not going to affect the final model.

1. Middle value from the set of data. 2. Value that appears most in the set of data. 3. Average value from the set of data.

17 6. Methodology 6.2.6 Creating new variables

As one of the mandatory features is making a model that predicts if the customer should ensure front glass as an addition, a new vari- able called FrontGlass was made. Process of making this new variable consisted of several steps, finding coded counterparts in the corre- sponding file where each Producer has its coded list. The result was an array of coded integers with respective Producers. The next step was to create a function that goes through an array of coded risks and assigns value one if at least one code matched and zero otherwise.

6.2.7 Updating current variables

Data types in the machine learning world can be divided into two parts – categorical and numerical. Numerical data types are repre- sented by real numbers. When it comes to categorical data, computers do not have a method to compare them as they are. The solution is to transform them into numbers. Catboost (as described in sec- tion 5.1) has support for categorical features with algorithms that turn them into numbers. However, the LightGBM framework only supports converted types. That is why three string type variables of the remaining nine variables had to be categorized by using pandas .astype("category") function.

6.2.8 Splitting data

Another important part of the data preparation step is to turn data into two parts. The first part is the training dataset (around 75%), and the second part is the testing dataset (remaining data). This step is critical because testing a model with the data that the model learned on would give biased results. Function train_test_split from sklearn.model_selection divides data into random train and test subjects. As the insurance prediction of a front glass is an addition to the first model, the dataset for predicting Producer did not contain the Front- Glass variable. This criterium resulted in deleting the FrontGlass vari- able from the Producer model dataset.

18 6. Methodology 6.3 Choose an Algorithm

With this step, the correct was chosen. By analyzing data structure, the most suitable algorithm was the decision tree. Other algorithm does not perform well with homogeneous data, which is described in section 4.2. The most popular algorithms in the Gradient Boosting Deci- sion Tree class are CatBoost, LightGBM and XGboost. As the XGboost is not supporting categorical variables, the Catboost and LightGBM models were chosen. The first training cycle was done with the Cat- Boost.

6.4 Train the Model

Training the CatBoost models was divided into two parts. The target value of the first model was Producer, and the other was FrontGlass.

Listing 6.1: Sample of code from the CatBoost Classification training process cat_features=[’Category’,’Model’,’City’]

model_producer= CatBoostClassifier(iterations =100 ,eval_metric="Accuracy",verbose=True)

model_producer.fit(x_train, x_test,cat_features = cat_features, plot=True)

As the Listing 6.1 shows, training the CatBoost model with categor- ical variables requires passing them in the form of a list with column names to the training function. The other parameters of the function are training dataset (x_train) and respective output results (x_test).

6.5 Evaluate the Model

Classifier evaluation is an essential part of the whole process of making machine learning application. It ultimately comes to one question. How well does my classifier work? This question is answered by using evaluation metrics.

19 6. Methodology

The first and the most used evaluation metric for the classifier is accuracy, which is the fraction of correctly predicted samples to all pre- dicted input samples. Accuracy works very well when dealing with balanced datasets. The further from the 50/50 ratio it gets, the more misguided the accuracy is. For other evaluation metrics to look at, first, the confusion matrix should be explained. The confusion matrix for binary classification problems is a 2x2 matrix with rows as predicted classes and columns as true value classes. There are four possible outcomes of the prediction:

1. True Positive (TP) rate — true positive value is predicted as positive.

2. False Positive (TP) rate — true negative value is predicted as positive.

3. False Negative (TP) rate — true positive value is predicted as negative.

4. True Negative (TP) rate — true negative value is predicted as negative.

There are other evaluation metrics that bring more information about the model efficiency or performance [22]:

Precision – positive values predicted from the total predicted val- ues in class: TP Precision = (6.1) TP + FP Recall – fraction of positive predicted values and correctly pre- dicted values: TP Recall = (6.2) TP + FN F-Measure (FM)– a harmonic mean of precision and recall: 2 ∗ Precision ∗ Recall FM = (6.3) Precision + Recall

20 6. Methodology

After the classification model was trained, I evaluated the model predicting the most fitting producer by the accuracy. The result was 0.9, which means that the model predicted the correct producer in 90% of the test instances. The model was predicting two anonymized values — 7 and 3. There are 53472 contracts in the dataset, and 51.34% of all contracts belongs to producer "7" and the rest to producer "3". Accuracy, in this case, is a good evaluation metric because of the even ratio between these two output values. However, evaluation metrics used in the second model that is pre- dicting if the agent should suggest insuring front glass as an addition were extended. The first reason to avoid using the only accuracy was the percentage of customers who have chosen to insure front glass, which was approx. 7.36%. After the prediction, the model yielded the results shown in table 6.1. Table 6.1: Evaluation results after first CatBoost model training pre- dicting the "Producer".

TP FP FN TN Precision Recall Accuracy FM 74 57 949 12288 0.564 0.072 0.924 0.128

Although the accuracy of 0.924 is very high, the recall metric and number of false-negative prediction is fundamental. False-negative prediction means that the model recommends an agent not to of- fer to ensure the front glass, but it should, according to the data. As the recall is dependent on the FN value, the smaller the recall is, the more customers will not insure a front glass resulting in less profit for the company.

6.6 Parameter Tuning

After the evaluation, the front glass prediction model had to be ad- justed. Catboost has perfect default parameter settings, so the parame- ter tuning in prediction functions will not affect the final model in asig- nificant way. As the results show in Table 6.2, changing the random seed in the training function did not significantly impact the model. However, another part of the training model that can be changed is the input data. As the number of contracts where the subjects have

21 6. Methodology Table 6.2: Evaluation results after changing the random seed used for training.

Random seed TP FP FN TN Precision Recall Accuracy FM none 74 57 949 12288 0.564 0.072 0.924 0.128 30 68 67 955 12278 0.503 0.066 0.923 0.117 40 65 40 958 12305 0.619 0.063 0.925 0.115 55 73 51 950 12294 0.588 0.071 0.925 0.127 70 70 46 953 12299 0.603 0.068 0.925 0.122 80 59 43 964 12302 0.578 0.057 0.924 0.104

insured front glass was only 3939, the input data were constructed differently. Instead of splitting the input data randomly, all the rows with the front glass value of "1" (positive rows) were selected and added to a specific amount of rows with the front glass value of"0" (negative rows). In Table 6.3, the evaluation results are summarized after training the model with new datasets. The process of training the model with a better ratio of positive and negative rows showed that the FM metric radically changed. With the increasing amount of negative rows, precision and accuracy in- creased as well. The final step is for the business manager to decide which configuration should be deployed and used in their system ac- cording to preferences. There are some positive and negative outcomes with the increasing amount of negative rows. The positive outcomes are:

1. With the FP dropping, the agents do not have to suggest cus- tomers products they probably do not want to purchase.

2. With the TN increasing, the model correctly predicts customers who are not purchasing the front glass.

However, there are also some negative outcomes:

1. With the FN increasing, the company would lose the opportunity to increase profit by recommending the product.

2. With the FP decreasing, fewer customers will be correctly iden- tified as interested in the product.

22 6. Methodology Table 6.3: Results after training the model predicting "FrontGlass" on dataset with different number of negative rows.

NoNRa TP FP FN TN Precision Recall Accuracy FM 5000 3807 13867 132 35666 0.21 0.96 0.73 0.35 6000 3717 11685 222 37848 0.24 0.94 0.77 0.38 8000 3554 9001 385 40532 0.28 0.90 0.82 0.43 10000 3141 5767 798 43766 0.35 0.79 0.87 0.48 12000 2904 4736 1025 44797 0.38 0.73 0.89 0.50 14000 2674 3565 1265 45968 0.42 0.67 0.91 0.52 16000 2384 2997 1555 46536 0.44 0.6 0.91 0.51 18000 2189 2603 1750 46930 0.45 0.55 0.92 0.50 20000 2023 2213 1916 47320 0.47 0.51 0.92 0.49 22000 2023 2014 1913 47519 0.50 0.51 0.92 0.51 25000 1744 1623 2195 47910 0.51 0.44 0.93 0.47

a. Number of negative rows

6.7 Make Predictions

The final step in the process of making any machine learning appli- cation is packaging and deploying. After selecting the correct model according to a business plan, the app will be packaged into a docker container. The docker container consists of the flask application, two CatBoost models with metadata, Dockerfile for installation and re- quirements. The flask application has two methods for predicting the Producer and the FrontGlass variables. The GET request body must contain these variables as a raw JSON object — City, PostNumber, Model, EngineVolume, BuildYear, Weight, Power, Category. The FrontGlass variant also needs a Producer on top. After sending a GET request, the response is in the form of a JSON object that shows a prediction.

23 7 Research

In this chapter, I compare results from CatBoost with the second clas- sification algorithm – LightGBM. Moreover, I look at some potential problems in data structure and possible solutions to make the models more precise.

7.1 Comparing Results

7.1.1 Model Producer

Although the LightGBM has support for categorical variables, it is dif- ferent than in CatBoost. The categorical variables have to be converted to numbers before entering the training function. This conversion process can be done with the pandas .astype("category") function.

Table 7.1: Evaluation results from CatBoost and LightGBM classifiers for "Producer".

Framework TP FP FN TN Precision Recall Accuracy FM CatBoost 5317 929 133 4316 0.851 0.975 0.900 0.909 LighGBM 5209 789 241 4456 0.868 0.955 0.903 0.910

In Table 7.1, the evaluation results from both of the algorithms are almost the same. Accuracy and F-Measure are equal, and the only difference is the number of false-positive and false-negative occur- rences. In this case, the false-positive value represents the scenario where the producer was predicted as "7", but it should have been "3". The decision to choose either algorithm depends again on the business people in the company. CatBoost model is better suited if the com- pany wants to ensure vehicles under the producer "7" and LightGBM otherwise.

7.1.2 Model FrontGlass

The FrontGlass model was harder to evaluate because of the low amount of positive rows. However, the results were more distinct than in the other model.

24 7. Research

Table 7.2: Evaluation results from LightGBM model predicting "Front- Glass" with different number of negative rows.

NoNRa TP FP FN TN Precision Recall Accuracy FM 5000 3820 13634 119 35899 0.21 0.96 0.74 0.35 6000 3720 11840 219 37693 0.23 0.94 0.77 0.38 8000 3498 9300 441 40233 0.27 0.88 0.81 0.41 10000 3069 6438 870 43095 0.32 0.77 0.86 0.45 12000 2731 4913 1208 44620 0.35 0.69 0.88 0.47 14000 2424 3821 1515 45712 0.38 0.61 0.90 0.47 16000 2206 3252 1733 46281 0.40 0.56 0.90 0.46 18000 2050 2743 1889 46790 0.42 0.52 0.91 0.46 20000 1944 2391 1995 47142 0.44 0.49 0.91 0.46 22000 1808 2098 2131 47435 0.46 0.45 0.92 0.46 25000 1629 1767 2310 47766 0.47 0.41 0.92 0.44

a. Number of negative rows

I compare the results from the LightGBM classifier, which are in Table 7.2 with the CatBoost classifier, shown in Table 6.3. As the results show, the Catboost is outperforming the LightGBM model in every category apart from false positive and false negative instances. With the lower ratio of positive rows, the difference between LightGBM and Catboost is growing.

7.2 Model Analysis

In this section, the potential problems with a model are explained with a potential solution to change them to achieve better results. The first issue with the model is the data anonymization of thebirth number. Although the birth number is unambiguous identification of the person and can lead to sensitive personal information leak, the subject’s age would help the model with more precise predic- tions. The age can decide if the customer gets a better offer because drivers without much driving experience often have more expensive insurance. As the insurance companies have some differences between

25 7. Research

their products, some companies might have better coverages for less experienced drivers. The next issue is that two subjects can be associated with the same contract without indicating who is the owner of the vehicle. As the data were flattened, two subjects with the same contract ID were assigned the exact vehicle. This step can lead to misinterpretation and a model trained on inaccurate data. Another problem with the data is missing records of vehicles. Out of 235737 contracts, only 82105 were containing information about the vehicle. As the merging of the flattened files was done to avoid miss- ing vehicle parameters, this problem is not very important with low amounts of data. Moving the data without vehicle parameters to a sep- arate table to avoid aggregation can result in faster data preparation and easier deployment. As the CodeLists file was describing the coded risks in vehicle con- tract details, the description of the products was vague. For example coded product 1806 for Producer "7" was described as Pojištění čelního skla vozidla and product 404 for the same Producer as pojištění čelního skla vozidla. Without the understanding of individual coverages and their products, the results can be misguided. I would recommend adding more description to every risk to help understand the coverage of the risks better.

7.3 Potential improvements

With the CatBoost model having an excellent default setting, an essen- tial step to make this recommendation system achieve better results is to improve the quality of the input dataset. As the problem with the data were discussed in the previous section, resolving them can considerably improve the accuracy of prediction models. Another improvement of this recommendation system can be made with an in- creased amount of input data. As the "Producer" model is only predict- ing two output values, a larger dataset with more values can achieve better results.

26 8 Conclusion

The goal of this thesis was to implement a car insurance recommenda- tion system using a machine learning algorithm that helps insurance agents to choose the right insurance company or a product for cus- tomers. Other goals were to compare results from two different frame- works and to discuss some potential problems with data. In the second chapter, the overview of the insurance and already existing similar system is discussed. The third chapter provides a brief description of insurance with a focus on motor insurance. In the fourth chapter, the overview of machine learning approaches with a focus on supervised learning is provided, along with a breakdown of the most used classification algorithms. Chapter five provides a summary of technologies that were used in the process of imple- menting the recommendation system. The implementation process is summarized into seven steps in the sixth chapter that needs to be done while making any machine learning application. The seventh chapter compares two machine learning algorithms used to train the system together with the overview of discovered problems with the data. As a result of this thesis, the application for recommending an in- surance company with the addition of predicting the recommendation for front glass insurance was made. The application is packaged in a docker container to support deployment on any . As the final product, the system has an API accepting input information and providing the recommendation. The results of two compared algorithms – CatBoost and LightGBM – were similar with the slight edge towards the CatBoost. With bet- ter accuracy and significantly better F-Measure score, the CatBoost outperformed the LightGBM model mainly in the front glass recom- mendation part of the system. As mentioned in section 7.2, some problems with data while mak- ing the application were encountered, and in that section, the possible solutions to achieve better results were provided. Currently, the predictions are only focused on Producer and Front- Glass. As an extension, the system can expand the diversity of predict- ing products. Insurance companies are offering complete insurance packs built from individual smaller insurances. After expanding the

27 8. Conclusion system by predicting a variety of different smaller products, the final model can be created to make a final prediction of a predefined cover- age. The system can also be used in another insurance sector like real estate insurance.

28 A An appendix

The file bc_thesis.zip contains following files and folders:

• model_producer – sample CatBoost model predicting the Pro- ducer

• model_frontglass – sample CatBoost model predicting the Front- Glass

• requirement.txt – list of necessary libraries for Docker image

• Dockerfile – text document containing all needed instruction to assemble an image

• app.py – the REST API python script

• json_edit.py – python script for editing the data/Contracts.jsonAnonymized.json file

• Bachelors_thesis.ipynb – Jupyter Notebook containing the model training and comparing

• README.md – markdown file that contains prerequisites and setup information

• catboost_info – folder containing additional support files for the CatBoost models

• data – data folder with four anonymized JSON files

• .ipynb_checkpoints – metadata for the jupyter notebook

29 Bibliography

1. QUALMAN, DARRIN. Happy motoring: Global automobile produc- tion 1900 to 2016 [online]. 2017-06-13 [visited on 2021-05-14]. Available from: https : / / www . darrinqualman . com / global - automobile-production/. 2. Auto Insurance In Different Countries [online]. Brno: International Driving Authority, 2021 [visited on 2021-04-16]. Available from: https : / / www . idaoffice . org / posts / auto - insurance - in - different-countries. 3. What determines the price of an auto insurance policy? [Online]. Insurance Information Institute, 1996/2021 [visited on 2021-04- 16]. Available from: https : / / www . iii . org / article / what - determines-price-my-auto-insurance-policy. 4. Machine Learning [online]. IBM Cloud Education, 2020-07-15 [visited on 2021-05-14]. Available from: https://www.ibm.com/ cloud/learn/machine-learning. 5. LESAGE, Laurent; DEACONU, Madalina; LEJAY, Antoine; MEIRA, Jorge Augusto; NICHIL, Geoffrey; STATE, Radu. A Recommendation System For Car Insurance. European Actuarial Journal. 2020. Available from doi: 10.1007/s13385-020-00236-z. 6. DOROGUSH, Anna Veronika; GULIN, Andrey; GUSEV, Gleb; KAZEEV, Nikita; OSTROUMOVA PROKHORENKOVA, Liud- mila; VOROBEV, Aleksandr. Fighting biases with dynamic boost- ing. CoRR. 2017, vol. abs/1706.09516. Available from arXiv: 1706. 09516. 7. CHANISADA TUCHINDA, MD; SABONG SRIVANNABOON, MD; HENRY W. LIM, MD. Photoprotection by window glass, automobile glass, and sunglasses [online]. [N.d.] [visited on 2021-05-22]. Available from doi: https://doi.org/10.1016/j. jaad.2005.11.1082. 8. Does Car Insurance Cover Windshield Damage? [Online]. Allstate [visited on 2021-05-22]. Available from: https://www.allstate. com/tr/car-insurance/windshield-damage.aspx.

30 BIBLIOGRAPHY

9. EDWARDS, Gavin. Machine Learning | An Introduction [online]. Canada: Towards Data Science, 2018-11-18 [visited on 2021- 05-14]. Available from: https : / / towardsdatascience . com / machine-learning-an-introduction-23b84d51e6d0. 10. ŁAWRYNOWICZ, Agnieszka; TRESP, Volker. Introduc- ing Machine Learning: Machine Learning Basics [online]. [N.d.] [visited on 2021-05-14]. Available from: https : / / www . researchgate . net / publication / 268804320 _ Introducing_Machine_Learning. 11. SILVER, David; HUBERT, Thomas; SCHRITTWIESER, Julian; HASSABIS, Demis. AlphaZero: Shedding new light on chess, shogi, and Go [online]. DeepMind, 2018-12-06 [visited on 2021-05-14]. Available from: https : / / deepmind . com / blog / article / alphazero-shedding-new-light-grand-games-chess-shogi- and-go. 12. K-Nearest Neighbor(KNN) Algorithm for Machine Learning [online]. JavaTpoint [visited on 2021-05-14]. Available from: https:// www.javatpoint.com/k-nearest-neighbor-algorithm-for- machine-learning. 13. GANDHI, Rohith. Support Vector Machine — Introduction to Machine Learning Algorithms [online]. Towards Data Sci- ence, 2018-06-07 [visited on 2021-05-14]. Available from: https : / / towardsdatascience . com / support - vector - machine- introduction- to- machine- learning- algorithms- 934a444fca47. 14. BROWNLEE, Jason. Logistic Regression for Machine Learning [on- line]. Machine Learning Mastery, 2020-08-15 [visited on 2021-05- 14]. Available from: https://machinelearningmastery.com/ logistic-regression-for-machine-learning/. 15. Decision Tree Classification Algorithm [online]. JavaTpoint [vis- ited on 2021-05-14]. Available from: https://www.javatpoint. com / machine - learning - decision - tree - classification - algorithm.

31 BIBLIOGRAPHY

16. MANISH MEHTA, Rakesh Agrawal; RISSANEN, Jorma. SLIQ: A Fast Scalable Classifier for Data Mining: SLIQ Classifier [online]. [N.d.] [visited on 2021-05-14]. Available from: https://link. springer.com/content/pdf/10.1007%5C%2FBFb0014141. 17. SHARMA, Abhishek. What makes LightGBM lightning fast? [On- line]. Towards Data Science, 2018-10-15 [visited on 2021-05-14]. Available from: https : / / towardsdatascience . com / what - makes--lightning-fast-a27cf0d9785e. 18. Architectural Styles and the Design of Network-based Software Archi- tectures [online]. University of California, Irvine: Fielding, Roy Thomas, 2000 [visited on 2021-04-20]. Available from: https: //www.ics.uci.edu/~fielding/pubs/dissertation/rest_ arch_style.htm. 19. Docker overview [online]. Palo Alto, CA: Scott Johnston [visited on 2021-04-20]. Available from: https://docs.docker.com/get- started/overview/. 20. MCKINNEY, W. Package overview [online] [visited on 2021-04-21]. Available from: https://pandas.pydata.org/docs/getting_ started/overview.html. 21. MAYO, Matthew. Frameworks for Approaching the Machine Learning Process - KDnuggets [online]. KDnuggets, 2018 [visited on 2021- 05-14]. Available from: https://www.kdnuggets.com/2018/05/ general-approaches-machine-learning-process.html. 22. HOSSIN, M.; SULAIMAN, M.N. A Review on Evaluation Metrics for Data Classification Evaluations: REVIEW OF DISCRIMINA- TOR METRICS [online]. [N.d.] [visited on 2021-05-14]. Avail- able from: https : / / www . researchgate . net / publication / 275224157 _ A _ Review _ on _ Evaluation _ Metrics _ for _ Data _ Classification_Evaluations.

32