Modelling Default Probabilities the Classical Vs

DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 Modelling default probabilities The classical vs. machine learning approach FILIP JOVANOVIC PAUL SINGH KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES Modelling default probabilities The classical vs. machine learning approach FILIP JOVANOVIC PAUL SINGH KTH Degree Projects in Financial Mathematics (30 ECTS credits) Master's Programme in Industrial Engineering and Management KTH Royal Institute of Technology year 2020 Supervisor at Klarna Bank AB: Joel Silverberg Supervisor at KTH: Boualem Djehiche Examiner at KTH: Boualem Djehiche TRITA-SCI-GRU 2020:053 MAT-E 2020:018 Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci Modelling default probabilities: The classical vs. machine learning approach Abstract Fintech companies that offer Buy Now, Pay Later products are heavily dependent on accurate default probability models. This is since the fintech companies bear the risk of customers not fulfilling their obligations. In order to minimize the losses incurred to customers defaulting several machine learning algorithms can be applied but in an era in which machine learning is gaining popularity, there is a vast amount of algorithms to select from. This thesis aims to address this issue by applying three fundamentally different machine learning algorithms in order to find the best algorithm according to a selection of chosen metrics such as ROCAUC and precision-recall AUC. The algorithms that were compared are Logistic Regression, Random Forest and CatBoost. All these algorithms were benchmarked against Klarna's current XGBoost model. The results indicated that the CatBoost model is the optimal one according to the main metric of comparison, the ROCAUC -score. The CatBoost model outperformed the Logistic Regression model by seven percentage points, the Random Forest model by three percentage points and the XGBoost model by one percentage point. Modellering av fallissemang: Klassisk metod vs. maskininlärning Sammanfattning Fintechbolag som erbjuder KöpNu, Betala Senare-tjänsterärstarkt beroende av välfungerande fallissemangmodeller. Detta d˚adessa fintechbolag bärrisken av att kunder inte betalar tillbaka sina krediter. Föratt minimera förlusternasom uppkommer nären kund inte betalar tillbaka finns flera olika maskininlärningsalgoritmeratt applicera, men i dagens explosiva utveckling p˚a maskininlärningsfronten finns det ett stort antal algoritmer att väljamellan. Denna avhandling ämnaratt testa tre olika maskininlärningsalgoritmerföratt fastställavilken av dessa som presterar bästsett till olika prestationsm˚atts˚asom ROCAUC och precision-recall AUC. Algoritmerna som jämförsärLogistisk Regression, Random Forest och CatBoost. Samtliga algoritmers prestanda jämförsäven med Klarnas nuvarande XGBoost-modell. Resultaten visar p˚aatt CatBoost-modellen ärden mest optimala sett till det primäraprestationsm˚attet ROCAUC. CatBoost-modellen var överlägsetbättremed sju procentenheter högre ROCAUC änLogistisk Regression, tre procentenheter högre ROCAUC änRandom Forest och en procentenhet högre ROCAUC änKlarnas nuvarande XGBoost-modell. Acknowledgements We would like to express our deepest appreciation to our KTH supervisor Tatjana Pavlenko for providing us with constructive criticism and valuable advice throughout the thesis. We would also like to extend our deepest gratitude to Joel Silverberg at Klarna for making this thesis possible and offering us relentless support during the process. Many thanks to Anna Hedströmand Lukas Kvissberg for helping us retrieve the data used in this thesis and special thanks to Anna for the encouraging support. Many thanks to Adam Myrénfor helping us getting onboarded and assisting us with practical tasks. We would also like to thank Edvin Lundströmfor providing valuable feedback during the peer review. At last we would like to say that it has been a great pleasure working in the US Financing team at Klarna. Table of contents 1 Introduction 1 1.1 Background . .1 1.2 Purpose and Research Question . .1 2 Theory 3 2.1 Binary Classification . .3 2.2 Logistic Regression . .3 2.2.1 Regularization . .5 2.3 Tree-Based Concepts . .7 2.3.1 Classification Trees . .8 2.3.2 Bagging . .9 2.3.3 Boosting . .9 2.4 Random Forest . 11 2.5 Categorical Boosting: CatBoost . 12 2.5.1 Ordered Boosting . 12 2.5.2 Ordered Target Statistics . 13 2.6 Feature selection . 14 2.6.1 Recursive Feature Elimination: RFE . 14 2.6.2 Occam's Razor . 14 2.6.3 LASSO . 14 2.7 Hyper-Parameter Tuning . 15 2.8 Model Evaluation . 16 2.8.1 Confusion Matrix . 17 2.8.2 Receiver Operating Characteristic . 17 2.8.3 Precision-Recall Curve . 18 2.8.4 Area Under the Curve . 19 3 Methodology 22 3.1 Data . 22 3.1.1 Target Variable . 22 3.1.2 Preprocessing . 22 3.1.3 Balancing Data . 24 3.2 Model Selection . 24 3.2.1 Recursive Feature Elimination: RFE . 24 3.2.2 Logistic Regression . 25 3.2.3 CatBoost . 28 3.2.4 Random Forest . 31 4 Results 33 5 Discussion 35 5.1 Conclusion . 37 5.1.1 Future Studies . 37 6 References 38 6 1 Introduction 1.1 Background Shopping habits have changed a lot during the last five years and online shopping is, for the first time ever, overtaking retail [1]. The rapid progress of smartphones and increased availability of high-speed internet is surely something that has played a major part in this. But fintech companies have also made the online shopping experience much easier and more available. Buy Now, Pay Later (BNPL) is a service which allows customers to make a purchase and delay payments. This is attractive since it allows a lot of customers to make purchases at any time independent of their current financial situation. The most well-known Pay Later service in Sweden is the 14 days invoice, but other services are gaining popularity. In the US, more and more customers are choosing to split the purchase over four equal instalments paid every 14 days from the date of order. This gives the customer the opportunity to buy a product or a service which the customer doesn't have enough fundings for at the moment, but will have in the foreseeable future. To use this service, customers select Klarna during the checkout process. Klarna then makes an instant decision if the customer will be approved or denied the chosen payment method. This decision is made using a Probability of Default model (PD model) in which external bureau data and internal data is used. Since Klarna bears the risk of customers not fulfilling their obligations, a better model is of high interest since it will reduce losses incurred due to customers defaulting on their payments and other losses like frauds. Financial institutions that offer BNPL without charging interest are not subjected to the same regulations as traditional banks. In order to mitigate the losses that can be incurred due to credit risk financial institutions have developed internal models that assess the default probability of each customer. The internal models can be developed applying a vast selection of algorithms and it is of interest for financial institutions to find the \best" algorithm. 1.2 Purpose and Research Question The purpose of this thesis project is to compare different algorithms that predict the default probability of a customer. The different methods that will be compared in this thesis are Logistic Regression, Random Forest and CatBoost. In order to assess the performance of these methods they will be benchmarked against Klarna's current model which uses XGBoost. Metrics such as AUC (Area Under the Curve), ROC (Receiver Operating Characteristic) together with the precision-recall curve will be considered when the models are evaluated. The reasoning behind the selected algorithms is that we want to compare models that implement different techniques. For example, Random Forest is an example of a bagging-algorithm and CatBoost is a boosting-algorithm. These algorithms are relatively advanced machine learning algorithms which are more difficult to interpret than the more common Logistic Regression model. From an academic perspective it is therefore interesting to compare a somewhat \classical" approach using Logistic Regression to a bit more novel machine learning algorithms. Why we have chosen CatBoost amongst other boosting algorithms is because it has been reported that the algorithm has outperformed XGBoost [2]. This renders in the following research questions to be studied: 1 • What are the benefits and drawbacks of using Logistic Regression, Random Forest and CatBoost in order to predict defaults? • How does the optimal algorithm compare to the current XGBoost model? 2 2 Theory In this section we explain the mathematical theory of the algorithms and concepts that will be applied in the thesis. 2.1 Binary Classification To understand the concept of binary classification we introduce the following equation which summarizes the general idea of a classification problem in the context of supervised learning. Y = f(X) + (1) where 2 Y 2 {−1; 1g; X = X1;X2;:::;Xm ; E[] = 0; Var() = σ The main goal of a binary classification is to find a function f^ that approximates the unknown function f as well as possible. In the context of this thesis the target variable Yi will modelled as a binary random variable. The observed values of this variable will be denoted as yi and will indicate if customer i has defaulted on a payment. Thus i represents an order placed by a specific customer. m Then if customer i defaults, yi = 1 otherwise yi = −1. The random predictor variable Xi 2 R contains information regarding customer i and consists of both qualitative and quantitative variables e.g. a customers geographical location, salary, payment history etc. The observed equivalent will be ^ denoted xi = [xi;1; xi;2; : : : ; xi;m].

Modelling Default Probabilities the Classical Vs

Benchmarking and Optimization of Gradient Boosting Decision Tree Algorithms

A Hybrid Machine Learning/Deep Learning COVID-19 Severity Predictive Model from CT Images and Clinical Data

New Directions in Automated Traffic Analysis

ISBN # 1-60132-514-2; American Council on Science & Education / CSCE 2021

Catboost for Big Data: an Interdisciplinary Review

Xgboost Add-In for JMP Pro

Arxiv:2009.09993V3 [Q-Fin.TR] 14 May 2021 Formats Following Predetermined Protocols and Data Structures

Minimal Variance Sampling in Stochastic Gradient Boosting

Estimating the Pan Evaporation in Northwest China by Coupling Catboost with Bat Algorithm

Catboost for Big Data: an Interdisciplinary Review

Catboost: Unbiased Boosting with Categorical Features

Comparison of Gradient Boosting Decision Tree Algorithms for CPU Performance CPU Performansı Için Gradyan Artırıcı Karar A