DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

Modelling default probabilities The classical vs. approach

FILIP JOVANOVIC

PAUL SINGH

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

Modelling default probabilities

The classical vs. machine learning approach

FILIP JOVANOVIC

PAUL SINGH KTH

Degree Projects in Financial Mathematics (30 ECTS credits) Master's Programme in Industrial Engineering and Management KTH Royal Institute of Technology year 2020 Supervisor at Klarna Bank AB: Joel Silverberg Supervisor at KTH: Boualem Djehiche Examiner at KTH: Boualem Djehiche

TRITA-SCI-GRU 2020:053 MAT-E 2020:018

Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

Modelling default probabilities: The classical vs. machine learning approach

Abstract

Fintech companies that offer Buy Now, Pay Later products are heavily dependent on accurate default probability models. This is since the fintech companies bear the risk of customers not fulfilling their obligations. In order to minimize the losses incurred to customers defaulting several machine learning algorithms can be applied but in an era in which machine learning is gaining popularity, there is a vast amount of algorithms to select from. This thesis aims to address this issue by applying three fundamentally different machine learning algorithms in order to find the best algorithm according to a selection of chosen metrics such as ROCAUC and precision-recall AUC. The algorithms that were compared are Logistic Regression, Random Forest and CatBoost. All these algorithms were benchmarked against Klarna’s current XGBoost model. The results indicated that the CatBoost model is the optimal one according to the main metric of comparison, the ROCAUC -score. The CatBoost model outperformed the Logistic Regression model by seven percentage points, the Random Forest model by three percentage points and the XGBoost model by one percentage point.

Modellering av fallissemang: Klassisk metod vs. maskininl¨arning

Sammanfattning

Fintechbolag som erbjuder K¨opNu, Betala Senare-tj¨anster¨arstarkt beroende av v¨alfungerande fallissemangmodeller. Detta d˚adessa fintechbolag b¨arrisken av att kunder inte betalar tillbaka sina krediter. F¨oratt minimera f¨orlusternasom uppkommer n¨aren kund inte betalar tillbaka finns flera olika maskininl¨arningsalgoritmeratt applicera, men i dagens explosiva utveckling p˚a maskininl¨arningsfronten finns det ett stort antal algoritmer att v¨aljamellan. Denna avhandling ¨amnaratt testa tre olika maskininl¨arningsalgoritmerf¨oratt fastst¨allavilken av dessa som presterar b¨astsett till olika prestationsm˚atts˚asom ROCAUC och precision-recall AUC. Algoritmerna som j¨amf¨ors¨arLogistisk Regression, Random Forest och CatBoost. Samtliga algoritmers prestanda j¨amf¨ors¨aven med Klarnas nuvarande XGBoost-modell. Resultaten visar p˚aatt CatBoost-modellen ¨arden mest optimala sett till det prim¨araprestationsm˚attet ROCAUC. CatBoost-modellen var ¨overl¨agsetb¨attremed sju procentenheter h¨ogre ROCAUC ¨anLogistisk Regression, tre procentenheter h¨ogre ROCAUC ¨anRandom Forest och en procentenhet h¨ogre ROCAUC ¨anKlarnas nuvarande XGBoost-modell.

Acknowledgements

We would like to express our deepest appreciation to our KTH supervisor Tatjana Pavlenko for providing us with constructive criticism and valuable advice throughout the thesis. We would also like to extend our deepest gratitude to Joel Silverberg at Klarna for making this thesis possible and offering us relentless support during the process. Many thanks to Anna Hedstr¨omand Lukas Kvissberg for helping us retrieve the data used in this thesis and special thanks to Anna for the encouraging support. Many thanks to Adam Myr´enfor helping us getting onboarded and assisting us with practical tasks. We would also like to thank Edvin Lundstr¨omfor providing valuable feedback during the peer review. At last we would like to say that it has been a great pleasure working in the US Financing team at Klarna.

Table of contents

1 Introduction 1 1.1 Background ...... 1 1.2 Purpose and Research Question ...... 1

2 Theory 3 2.1 Binary Classification ...... 3 2.2 Logistic Regression ...... 3 2.2.1 Regularization ...... 5 2.3 Tree-Based Concepts ...... 7 2.3.1 Classification Trees ...... 8 2.3.2 Bagging ...... 9 2.3.3 Boosting ...... 9 2.4 Random Forest ...... 11 2.5 Categorical Boosting: CatBoost ...... 12 2.5.1 Ordered Boosting ...... 12 2.5.2 Ordered Target Statistics ...... 13 2.6 Feature selection ...... 14 2.6.1 Recursive Feature Elimination: RFE ...... 14 2.6.2 Occam’s Razor ...... 14 2.6.3 LASSO ...... 14 2.7 Hyper-Parameter Tuning ...... 15 2.8 Model Evaluation ...... 16 2.8.1 Confusion Matrix ...... 17 2.8.2 Receiver Operating Characteristic ...... 17 2.8.3 Precision-Recall Curve ...... 18 2.8.4 Area Under the Curve ...... 19

3 Methodology 22 3.1 Data ...... 22 3.1.1 Target Variable ...... 22 3.1.2 Preprocessing ...... 22 3.1.3 Balancing Data ...... 24 3.2 Model Selection ...... 24 3.2.1 Recursive Feature Elimination: RFE ...... 24 3.2.2 Logistic Regression ...... 25 3.2.3 CatBoost ...... 28 3.2.4 Random Forest ...... 31

4 Results 33

5 Discussion 35 5.1 Conclusion ...... 37 5.1.1 Future Studies ...... 37 6 References 38

6 1 Introduction

1.1 Background Shopping habits have changed a lot during the last five years and online shopping is, for the first time ever, overtaking retail [1]. The rapid progress of smartphones and increased availability of high-speed internet is surely something that has played a major part in this. But fintech companies have also made the online shopping experience much easier and more available. Buy Now, Pay Later (BNPL) is a service which allows customers to make a purchase and delay payments. This is attractive since it allows a lot of customers to make purchases at any time independent of their current financial situation. The most well-known Pay Later service in Sweden is the 14 days invoice, but other services are gaining popularity. In the US, more and more customers are choosing to split the purchase over four equal instalments paid every 14 days from the date of order. This gives the customer the opportunity to buy a product or a service which the customer doesn’t have enough fundings for at the moment, but will have in the foreseeable future.

To use this service, customers select Klarna during the checkout process. Klarna then makes an instant decision if the customer will be approved or denied the chosen payment method. This decision is made using a Probability of Default model (PD model) in which external bureau data and internal data is used. Since Klarna bears the risk of customers not fulfilling their obligations, a better model is of high interest since it will reduce losses incurred due to customers defaulting on their payments and other losses like frauds.

Financial institutions that offer BNPL without charging interest are not subjected to the same regulations as traditional banks. In order to mitigate the losses that can be incurred due to credit risk financial institutions have developed internal models that assess the default probability of each customer. The internal models can be developed applying a vast selection of algorithms and it is of interest for financial institutions to find the “best” algorithm.

1.2 Purpose and Research Question The purpose of this thesis project is to compare different algorithms that predict the default probability of a customer. The different methods that will be compared in this thesis are Logistic Regression, Random Forest and CatBoost. In order to assess the performance of these methods they will be benchmarked against Klarna’s current model which uses XGBoost. Metrics such as AUC (Area Under the Curve), ROC (Receiver Operating Characteristic) together with the precision-recall curve will be considered when the models are evaluated.

The reasoning behind the selected algorithms is that we want to compare models that implement different techniques. For example, Random Forest is an example of a bagging-algorithm and CatBoost is a boosting-algorithm. These algorithms are relatively advanced machine learning algorithms which are more difficult to interpret than the more common Logistic Regression model. From an academic perspective it is therefore interesting to compare a somewhat “classical” approach using Logistic Regression to a bit more novel machine learning algorithms. Why we have chosen CatBoost amongst other boosting algorithms is because it has been reported that the algorithm has outperformed XGBoost [2]. This renders in the following research questions to be studied:

1 • What are the benefits and drawbacks of using Logistic Regression, Random Forest and CatBoost in order to predict defaults? • How does the optimal algorithm compare to the current XGBoost model?

2 2 Theory

In this section we explain the mathematical theory of the algorithms and concepts that will be applied in the thesis.

2.1 Binary Classification To understand the concept of binary classification we introduce the following equation which summarizes the general idea of a classification problem in the context of supervised learning.

Y = f(X) +  (1)

where   2 Y ∈ {−1, 1}, X = X1,X2,...,Xm , E[] = 0, Var() = σ The main goal of a binary classification is to find a function fˆ that approximates the unknown function f as well as possible. In the context of this thesis the target variable Yi will modelled as a binary random variable. The observed values of this variable will be denoted as yi and will indicate if customer i has defaulted on a payment. Thus i represents an order placed by a specific customer. m Then if customer i defaults, yi = 1 otherwise yi = −1. The random predictor variable Xi ∈ contains information regarding customer i and consists of both qualitative and quantitative variables e.g. a customers geographical location, salary, payment history etc. The observed equivalent will be ˆ denoted xi = [xi,1, xi,2, . . . , xi,m]. When we have found a function f we define this as a classifier and given an input xi the classifier will give a prediction yˆi. To assess the performance of a classifier we introduce the concept of a loss function, L(yi, yˆi), which compares the prediction yˆi to the actual value yi. When the classifier is trained the goal will be to minimize the average loss of all observations in the training set and will thus include minimizing the errors from the loss function. The training set is an observed set of targets variables and their corresponding predictors which we N ˆ denote by D = {xi, yi}i=1. The irreducible error, , is independent of the fitted function f. This means that even if one finds a perfect estimation where f = fˆ, there will still be an error in the predictions [3]. An example of what the quantity  may contain is variation that is not possible to measure, i.e. the risk of a default may vary for a given customer on a given day due to the customer’s mood.

2.2 Logistic Regression Logistic regression is a binary classification method which gained popularity in the 1970s as an alternative to Fisher’s Linear Discriminant Analysis (LDA) 50 years after being rediscovered in the 1920s [4]. The first signs of Logistic Regression traces all the way back to the early 19th century when Pierre-Francois Verhulst published a paper in which he fitted curves to model a countries’ population growth, but did not explain how they were fitted. A couple of years later, Verhulst published a more detailed paper on the subject and introduced the term “logistic” for the first time [4].

Logistic Regression has been the most popular method to model probability of default until recently when more advanced machine learning algorithms have surpassed the speed and accuracy of Logistic Regression in many applications [5]. But due to the financial market being highly regulated, non- linear modelling for traditional credit issuing is often not allowed. Therefore, financial institutions

3 have to rely on linear models, like Logistic Regression, which is why the method won’t cease to exist in the near future.

In general Logistic Regression tends to perform well on smaller datasets and when the number of explanatory variables is larger than or equal to the number of noise variables [6]. Also, Logistic Regression captures linear relationships in the data well. For larger and more complex datasets where there exists nonlinear relationships, other machine learning methods might be preferable.

The general linear relationship between the predictor variables can be written on the following form [7]:

P (G = 1|X = x) log = β + βT x P (G = K|X = x) 1,0 1

P (G = 2|X = x) T log = β2,0 + β2 x P (G = K|X = x) (2) . . P (G = K − 1|X = x) log = β + βT x P (G = K|X = x) (K−1),0 K−1

where G is the response variable, K is the number of classes, βi,0 is intercept, βi is coefficient vector, X ∈ Rm denotes a real valued random input vector with m variables and x is the observed vector of m predictor variables. In our case we have binary responses with outcomes:

1: default 0: not default

instead of 1 and -1 as in the general case presented in the previous section.

The number of classes K is therefore two. With m explanatory variables xi, the linear relationship (also known as log-odds or logarithm of the odds of probability) can be expressed as the linear function, l:

pi T l(β) = logit(pi) = log = β0 + β1 xi = β0 + β1xi,1 + β2xi,2 + ... + βmxi,m (3) 1 − pi

where pi is the probability of customer i defaulting and xi is the input vector for customer i. With simple algebraic manipulation, we get that the probability is defined as: 1 P (Yi = 1|Xi = xi) = T (4) 1 + exp(β0 + β1 xi)

where Yi = 1 corresponds to the outcome for customer i being default. Thus, we can use the probability to classify our prediction. The class prediction is defined as: ( 1, if P (Yi = 1|Xi = xi) ≥ c Yˆi = (5) 0, if P (Yi = 1|Xi = xi) < c

4 where c ∈ (0, 1) is the threshold value of the decision boundary meaning that if the outcome is above or equal to the predefined threshold c, it will be classified as a default.

In order to fit the model, maximum likelihood is often applied by using the conditional likelihood of Y given X [7]. The log-likelihood for N observations in the two-class case is defined as the function, L:

N X n o L(β) = yˆilog(p(xi; β)) + (1 − yˆi)log(1 − p(xi; β)) i=1 N X n T o = yˆiβ xi + (1 − yˆi)log(1 − p(xi; β)) (6) i=1 N n T o X T β0+β1 xi = yˆi(β0 + β1 xi) − log(1 + e ) i=1 where β = {β0, β1} and the conditional probability p(xi; β) = P (Yi = 1|Xi = xi; β). The function L represents the negative log-loss function and is optimized w.r.t. β by maximization.

In order to maximize the negative log-likelihood, we take the derivatives and set them to zero:

N ∂L(β) X = x (y − p(x ; β)) = 0 (7) ∂β i i i i=1 This generates m + 1 equations nonlinear in β which can be solved using the Newton-Raphson method. For this, the second order derivatives are calculated:

N ∂2L(β) X = − x xT p(x ; β)(1 − p(x ; β)) (8) ∂β∂βT i i i i i=1 Starting at the current β, a single iteration is given by:

 ∂2L(βold) −1 ∂L(βold) βnew = βold − (9) ∂βold∂(βold)T ∂βold The iterations stop when it reaches the stopping criterion, i.e. when the duality gap is less than the predefined tolerance. If the algorithm does not converge, the iterations stop when the predefined maximum number of iterations is reached.

2.2.1 Regularization A potential problem with Logistic Regression is overfitting. An overfitted model fails to predict the outcome on data other than the data the model is trained on due to lack of generalization and this leads to a decrease in model performance [8]. The general idea behind regularization is to give noise variables less or no weight in the model. There are several methods that deal with this problem and two of them are the L1-regularization (also known as LASSO) and L2-regularization (also known as Ridge regression). Also, a linear combination of the two, an elastic net, can be used to draw benefits from both methods. Regularization reduces the variance of the model, and by reducing the variance,

5 the degree of overfitting decreases as well.

L1-Regularization

An important feature of LASSO is the ability to both shrink and remove variables implying that it can also be used for variable selection. The regularization penalizes the model by decreasing the size of the coefficients, and this is done by adding the following term to the negative log-loss function (6):

m X −λL1 |βi| (10) i=1 where λL1 ≥ 0 is a shrinkage parameter and m is the number of coefficients. A larger value of λL1 corresponds to a greater amount of shrinkage [7]. The procedure of finding the optimal shrinkage parameter is presented in section 2.7 and is further analysed in section 3.2.2. The negative log-loss function is now defined as follows:

N m n T o X T β0+β1 xi X L(β) = yˆi(β0 + β1 xi) − log(1 + e ) − λL1 |βi| (11) i=1 i=1

Let βe denote the current estimates of the coefficients and let pe(xi) = p(xi; βe) and wi = pe(xi)(1 − pe(xi)). Now, we form the second order Taylor expansion about current estimates and we get the following quadratic objective function [9]:

N m 1 X n T 2 2o X LQ(β) = − wi(zi − β0 − β xi) + C(βe) − λL |βi| (12) 2 1 1 i=1 i=1

T yˆi−pe(xi) Where zi = β0 + β1 xi + is the current working response [9]. The next update of pe(xi)(1−pe(xi)) coefficients using Newton’s method is obtained by maximizing LQ(β):

( N m ) 1 X n T 2 2o X max − wi(zi − β0 − β1 xi) + C(βe) − λL1 |βi| (13) β∈ m+1 2 R i=1 i=1 This optimization problem is a simple weighted-least-squares problem and can be solved using the gen- eralized Newton algorithm. Overall, the procedure consists of a sequence of nested loops as follows [9]:

outer loop: Decrement λL1 middle loop: Uppdate LQ(β) using current coefficients βe inner loop: Run the generalized Newton algorithm on the optimization problem (13)

The iterations stop when it reaches the stopping criterion, i.e. when the duality gap is less than the predefined tolerance. If the algorithm does not converge, the iterations stop when the predefined maximum number of iterations is reached.

Since the dataset that will be used in this thesis is relatively large we are interested in performing a dimensionality reduction. Therefore the L1-regularization will be used in this thesis. However, we

6 present an overview of the other regularization techniques below.

L2-Regularization

Ridge regression reduces the variance by maximizing the penalizing sum of squares instead of the norm as for the case in LASSO. This comes at the price of bias and has to be considered during implementation. The following term is added to the negative log-loss function (6):

m X 2 −λL2 βi (14) i=1 where λL2 ≥ 0 is as previously the shrinkage parameter but for the L2-regularization [7]. The coefficients are shrunk toward, but never reach, zero. Thus, we get the following negative log-loss function:

N m n T o X T β0+β1 xi X 2 L(β) = yˆi(β0 + β1 xi) − log(1 + e ) − λL2 βi (15) i=1 i=1

The procedure of optimizing the negative log-loss function with the added L2-regularization term is the same as presented in the case of L1-regularization above.

Elastic Net

Elastic net is as mentioned before the linear combination of the two regularization’s giving the model the benefit of both decreased variance and selection of variables. The following term is added to the negative log-loss function (6):

m X 2  −λ αβi + (1 − α)|βi| (16) i=1 where α = [0, 1] is the weight. A larger α corresponds to more weight being given to the L2- regularization [7]. Thus, the new negative log-loss function is as follows:

N m n T o X T β0+β1 xi X 2  L(β) = yˆi(β0 + β1 xi) − log(1 + e ) − λ αβi + (1 − α)|βi| (17) i=1 i=1 The procedure of optimizing the negative log-loss function with the added elastic net term is the same as presented in the case of L1-regularization above.

2.3 Tree-Based Concepts In order to explain the Random Forest and CatBoost algorithms we will start by explaining essential concepts such as classification trees, bagging and boosting.

7 2.3.1 Classification Trees The basic idea behind a classification tree is to stratify the feature space into J non-overlapping regions, R1,R2, ..., RJ [7]. If we only have two features then the feature space is easy to visualize. In figure (1) below the feature space spanned up by the features X1 and X2 has been stratified into five non-overlapping regions. Also the stratified region can be expressed as a decision tree, as in figure (2), which is very useful for interpreting a tree-based model even when the feature space is too large to visualize [7]. Whenever a classification of a new observation (x1, x2) is to be done we check which region this observation belongs to. Then we assign this observation the class that is the most commonly occurring in that particular region [7]. Using mathematical rigour we introduce the following expression:

1 X pˆj,m = I{yi = m} (18) Nj xi∈Rj

Figure 2: Classification Tree linked to the re- Figure 1: Non-overlapping regions [7] gions [7]

Here Nj denotes the number of observations in region Rj. Expression (18) gives us the proportion of the observations in region Rj that belong to class m. So for each region Rj we maximize pˆj,m w.r.t. m. Thus the class in region Rj will be assigned according to the following equation:

class(Rj) = arg max pˆj,m (19) m In order to evaluate the nodes for splitting a criteria will be needed. This criteria will measure the node impurity and the most common node impurity measures used are misclassification error,

8 Gini-index and deviance. In this thesis we will use the Gini-index since it is better suited for numerical optimization due to being differentiable [7]. We define the Gini-index, G, in the following equation:

K X G = pˆj,m(1 − pˆj,m) = {K = 2, i.e. only two classes} = 2ˆpj,2(1 − pˆj,2) (20) m=1 A classification tree is easily interpretable as can be seen from figure (2) but unfortunately classifi- cation trees run the risk of being overfitted and thus leading to high variance. One remedy for this is to prune the trees but we will focus on bagging and boosting strategies in order to decrease the variance of the classifiers.

2.3.2 Bagging Bootstrap aggregating (bagging) is a method which can be used to decrease the variance of a classifier. bagging accomplishes a variance reduction by using multiple trees built on different training sets and using the majority vote for the final classification. More formally, we use B bootstrapped sets from the training data. On each of these bootstrapped sets we build a classifier and get B classifiers in total, fˆ1(x), fˆ2(x), ..., fˆB(x). To make the final prediction we simply use a majority vote of all the classifiers which gives the most commonly occurring class amongst all B predictions [3]. Note that the results of implementing bagging can vary a lot depending on feature characteristics. If some features are more prominent than others it can lead to classification trees that are correlated and thus the concept of Wisdom of Crowds fails to apply since it relies on the ”crowd” being uncorrelated. This problem is solved by the Random Forest algorithm which we present in section 2.4. When bagging is applied each classifier is trained using only the corresponding bootstrap sample and thus we can use the remaining data, the Out-Of-Bag (OOB) observations, to estimate the OOB-error which is a valid estimator of the test error [3].

2.3.3 Boosting Boosting methods are similar to the concept of bagging in the sense that it uses multiple learners/trees to make a final prediction. But instead of growing the trees independently of each other as in bagging, the boosting method grows the trees sequentially in order to learn from the previous built trees. For this the method uses M weak learners hm(x; am), i.e. learners that are slightly better then flipping a fair coin in the process of a classification. Note that learner m is characterized by the parameters am. In the case where the base learner is a classification tree we can formally PJ write, h(x) = j=1 bj · I{x ∈ Rj}, where bj represents the majority class in region Rj [7]. These weak learners are then combined into a ”committee” which has low bias due the weak learners and low variance due to including many weak learners. More formally, we are trying to minimize the following expression when we use boosting:

N X  min L yi,F (xi) (21) F i=1 PM Here F (x) follows the form of a basis expansion i.e, F (x) = m=1 βmh(x; am) where βm is the expansion coefficient for the basis function h(x; am). Using this we can rewrite equation (21) as follows:

9 N M X  X  min L yi, βmh(xi; am) (22) {β ,a }M m m m=1 i=1 m=1 The above optimization problems requires intensive numerical optimization techniques, therefore we instead approximate the solution to equation (22) by forward stagewise additive modeling. This essentially means that we, in each iteration m, find the optimal basis function h(x; am) and corresponding βm and then we set Fm(x) = Fm−1(x) + βmhm(x; am). We do not go back and modify previously added terms [7]. This means that we use the two-step process described below which is, in machine learning, formally referred to as boosting [10].

1. For m = 1,...,M : PN   (βm, am) = arg min i=1 L yi,Fm−1(xi) + β · h(xi; a) β,a

2. Set: Fm(x) = Fm−1(x) + βmhm(x; am)

Here the function βmhm(x; am) can be seen as the best step towards the data-based estimation of the true function F (x). The only constraint we put on this step is that the base learner hm(x; am) is a classification tree.

Steepest Descent

In this section we discuss how to practically perform the two steps presented in the previous section which we formally referred to as boosting. To solve the minimization problem in step one, we can use N steepest descent. To do this we find the data-based negative gradient −gm = {−gm(xi)}i=1, where " # ∂L yi,F (xi) gm(xi) = , (23) ∂F (xi) F (x)=Fm−1(x) which gives the best steepest descent direction from the current point Fm−1(xi). But since this N gradient is data-based we only have values for it at certain points {xi}i=1. Also note that this gradient is unconstrained and might therefore be difficult to replicate using a classification tree, which is our predetermined base-learner. Thus we choose the best step, βmhm(xi; am), such that N hm = {h(xi; am)}i=1 is as parallel as possible to −gm. Another way to see this is that we are trying to find the h(x; a), in our case a classification tree, that has the strongest correlation to the negative steepest descent step −gm(x) over the data distribution [10]. Formally this means that we solve for am such that: N X 2 am = arg min [−gm(xi) − βh(xi; a)] (24) a,β i=1

This will give us the direction of our descent step. In order to find the size of our step ρm, we perform a ”line search” which is given by the expression below:

N X  ρm = arg min L yi,Fm−1(xi) + ρh(xi; am) (25) ρ i=1

10 Mimicking step two in the previous section we update our approximation which has now been influenced by the unconstrained data-based gradient, Fm(x) = Fm−1(x) + ρmh(x; am)

Summing all these steps we arrive at the following algorithm which describes [10]:

Algorithm 1: Gradient Boosting

Result: FM (x) PN F0(x) ← arg min i=1 L(yi, ρ) ρ for m ← 1 : M do

h ∂L(yi,F (xi)) i y˜i ← − , i = 1 : N ∂F (xi) F (x)=Fm−1(x) PN 2 am ← arg min i=1[˜yi − βh(xi; a)] a,β PN  ρm = arg min i=1 L yi,Fm−1(xi) + ρh(xi; am) ρ Fm(x) = Fm−1(x) + ρmh(x; am) end

2.4 Random Forest As mentioned earlier in the report a Random Forest uses bagging but also addresses the issues that come with applying bagging when the trees are correlated. The Random Forest algorithm de-correlates the trees by actively restricting the number of variables that are considered in each split when a tree is built. Thus each time a split is done, a randomly generated subset of the full set of predictors is considered. The reasoning behind this is that if we have a strong predictor than this will most likely be the first split in all of the trees, thus leading to correlated trees [3]. Formally a Random Forest classifier with B classification trees can be defined as follows:

B ˆB h X i fRF (x) = sign T (x; Θm) (26) m=1

Here T (x; Θm) denotes a classification tree where Θm is simply the parameters for the m:th tree in the forest. These parameters have to be chosen by the analyst before the model is trained and we present the parameters that we will tune below: • n estimator: Number of trees in forest. • min samples split: Minimum number of observations required in node to split it.

• max depth: Maximum depth of a tree. If none is specified tree is grow until min samples split is reached for every leaf. • min samples leaf : Minimum number of observations required in node to be a leaf. If a split does not lead to min samples leaf number of observations in respective daughter nodes, the split is not made.

• max features: Number of features considered when finding ”optimal” split.

11 Just like in bagging the final prediction is made using a majority vote. Now we describe how a Random Forest is built once the above parameters have been chosen. This can typically be done by implementing a grid search.

Creating a Random Forest

1. Create a Bootstrap sample B∗ of size N from the training set. • Create a classification tree using the bootstrap sample. We repeat the below steps until the minimum node size nmin is reached. Then we stop the splitting process. (a) Randomly select p features from all the m features. (b) Use Gini-index to find best splitting feature of the p selected in previous step. (c) Split the data into two daughter-nodes using the selected feature. (d) Restart from step (a) in each daughter-node from previous step.  B 2. Repeat the above process B times and get B classification trees, Ti(x; Θi) i=1, that will constitute the Random Forest.

The final classifier is then given by equation (26).

To create a Balanced Random Forest we draw a random sample from the minority class and then draw an equally sized random sample from the majority class and let this constitute our bootstrap sample B∗ in step one [11]. The rest of the steps are the same. This will make sure that each tree is built using a balanced dataset.

2.5 Categorical Boosting: CatBoost The CatBoosting algorithm aims to address the issue of the target leakage that arises from encoding categorical variables using target statistics but also from the boosting steps in the gradient boosting algorithm. Essentially a target leakage arises when we encode categorical variables using their respective targets to calculate the mean, median or another numerical quantity involving the target. CatBoost implements ordered target encoding in order to prevent target leakage that arises from general target encoding. To prevent target leakage from the boosting step the algorithm implements a modification of the general gradient boosting and this is referred to as ordered boosting which we explain in the next section. Why this is important is because the target leakage leads to biased gradient estimates which affect the learning task negatively [12]. This stems from the fact that when the gradient is calculated for a point xi in equation (23), it is based on a model F which has been built using that particular point and its corresponding target. This leads to residuals having smaller absolute values on the training data compared to the test data [12].

2.5.1 Ordered Boosting

N We let D = {xi, yi}i=1 denote our training set. Then the idea is that we use a random permuta- tion of this dataset, i.e we shuffle the rows randomly, and build N − 1 different models F . More specifically model Fi denotes a model built on the first i observations from the shuffled dataset. Thus when the gradient g(xj) is calculated the model Fj−1 is used. Essentially this means that

12 for each boosting step I we need to build N −1 different models. The algorithm for this is given below:

Algorithm 2: Ordered Boosting

Result: F1, F2, ..., FN Fi ← 0 for i = 1 : N for iter ← 1 :I do for i ← 1 :N do for j ← 1 : i − 1 do d gj ← da Loss(yj, a)|a=Fi(xj ) end F ← LearnOneT ree((xj, gj) for j = 1 : i − 1) Fi ← Fi + F end end

This is computationally inefficient and therefore CatBoost implements a modification of the described algorithm in order to speed up computations but the underlying concept is the same.

2.5.2 Ordered Target Statistics The CatBoost algorithm converts categorical features to numerical using a technique referred to as ordered target statistics. We start by explaining general target statistics to give the reader an intuition of the concept. Assume that we have a categorical feature k and that the values of this category is given by xi,k where i denotes the observation. Using target statistics our goal is to replace xi,k for every i, i.e. for all rows, with a numerical value, xˆi,k, which is based on the target for that observation. A greedy approach can be applied and xˆi,k can be set to the average value of the targets y, taken over all training examples which have same the same value as xi,k. Introducing a prior p and a parameter a and some mathematical notations we get: Pn j=1 I{xi,j =xi,k} · yj + ap xˆi,k = Pn (27) j=1 I{xi,j =xi,k} + a

But since this is a greedy approach it leads to target leakage since xˆi,k is computed using yi [13]. To prevent this ordered target statistics aims to NOT use yi when xˆi,k is calculated. This is done by introducing an ”artificial” time to the dataset in the form of a random permutation. This is practically achieved by randomly shuffling the rows of the dataset. Then for each example the target statistics only rely on the preceding observations. In the figure below we can see an example of this where an arbitrary categorical feature, ”Occupation”, and its values are depicted. To get the target statistics for the fifth row which has the value ”Manager”, we only use the historical data illustrated by the pink rows. This is then repeated for every observation and it can be noted that observation that are located at the top after the permutation has been done will have higher variance since the available history is small. To prevent this multiple random permutations are used [13]. Thus after the ordered target statistics have been calculated the depicted values that the feature ”Occupation” takes will be replaced by numerical values.

13 Figure 3: Ordered Target Statistics [14]

2.6 Feature selection 2.6.1 Recursive Feature Elimination: RFE To decide how many variables to use in each model a Recursive Feature Elimination (RFE) with a built-in cross validation can be applied. The RFE-algorithm starts by creating five-folds which will be used to cross-validate the score. The next step is to fit a user-specified classifier, e.g. a Random Forest, to all available features. Then the user defined score is calculated by applying cross validation. After this the feature with the lowest feature importance is removed and a classifier is fitted using the rest of the variables. This process is repeated until the number of features left reach the lower limit which is defined by the user. Then the number of variables that yielded the highest validation score is the most optimal [15]. Thus the concepts behind the RFE selection is relatively simple but computationally intensive. More details regarding the used defined parameters will be given in the methodology section.

2.6.2 Occam’s Razor Another concept that will be applied when selecting the features is Occam’s Razor, also known as the Law of Parsimony. This concept simply states that if all else is held equal, the simplest solution should be chosen. For example, in our setting this would imply that if the number of features used can be reduced without sacrificing performance, one should always opt for this. So this means that even if the RFE implies that the best model will use p number of features, it can still be possible to have the same performance but with less complexity.

2.6.3 LASSO For the Logistic Regression model there exists another feature selection method which utilizes the L1-penalty described previously in the theory section. Since this penalty has the ability to shrink the coefficients to zero, one can utilize this and keep only the variables that have non-zero

coefficients. But to apply this one needs to pick the λL1 that yields the optimal model in terms

14 of ROCAUC -score. In the plot below the profiles of the LASSO coefficients are presented for an arbitrary example to illustrate the idea.

Figure 4: LASSO Path [16]

In the figure above each colored line represents a feature and the y-axis represent their coefficients. we can see how the coefficients of the features are affected by the shrinkage parameter C which is given on the x-axis. The parameter C has an inverse relation to λL1 which means that when C increases, the penalty λL1 exerts less effect on the coefficients. This is all depicted in the figure as more variables are added into the model, illustrated by the vertical dotted lines, as C is increasing.

In order to utilize this feature selection method one needs to optimize C according to some predefined metric. This approach will further be discussed in the methodology section.

2.7 Hyper-Parameter Tuning All the models have parameters that need to be decided. However, one can ignore this and simply use the default parameters but then one runs the risk of inferior model performance. For example, these parameters can be the depth of the tress used in the CatBoost and Random Forest models or the shrinkage parameter λ in the Logistic Regression. In order to tune these parameters one can implement a grid search or randomized search which we will present below. Some of the parameters will be discarded from the grid since there are more efficient ways to determine their value. This is further presented in the methodology section.

Grid Search/Randomized Search

A grid search is a relatively simple concept in which a predefined grid of parameters are used to build several different models. For example, if we have three different numerical parameters that we

15 need to set, A, B and C, we can define a grid as follows: • A: [1, 2, 3, 4, 5] • B: [100, 200, 300] • C: [0, 0.25, 0.5, 0.75 1]

Then the grid search algorithm will build models using all possible combinations of above parameters. After this the performance of each model is assessed using some predefined metric. In order to mitigate any sample bias the grid search algorithm makes use of a five-fold cross-validation [17]. Below we formally summarize the steps of a grid search:

1. Split the training data into five-folds

• Let D1, D2, D3, D4, D5 be the different folds 2. Pick a possible combination from the grid • For i = 1:5

(a) Train a model using the chosen parameters on all sets except Di

(b) Use the model built in previous step and check performance on Di • Average the performance metric over all five folds. 3. Repeat step two until all combinations have been tested.

4. Optimal parameters are given by the combination that yielded the best performing model according to the predefined metric.

For the randomized search, in step two the combination will be picked randomly from the grid unlike the grid search in which we sequentially test all combinations. In addition to this, we replace step three with a limit that defines how many combinations to test [18]. Thus a randomized search will not exhaust the whole grid defined, since this will be too computationally inefficient.

More details regarding the parameters that will be tuned are given throughout the methodology section.

2.8 Model Evaluation In order to evaluate a models performance different metrics will be needed. Essentially, these metrics will try to quantify a models ability to generalize and will thus indicate how the model will perform on unseen test data. Thus, it will be possible to compare different models based on these metrics which is the goal of this thesis.

16 2.8.1 Confusion Matrix One way of evaluating the performance is by inspecting the confusion matrix. The confusion matrix depicts the relationships between the predicted outcome and the true outcome and is the foundation of the performance measures used in this paper. The 2x2-matrix consists of the four outcomes True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN) as illustrated in the figure below.

Figure 5: Confusion Matrix

In our case, the four different outcomes corresponds to the following events:

• TP: The model predicts default, and the true outcome is default. • FP: The model predicts default, but the true outcome is no-default.

• TN: The model predicts no-default, and the true outcome is no-default. • FN: The model predicts no-default, but the true outcome is default.

Some basic performance measures that can be calculated from the matrix is the test error rate and test accuracy (ACC) and are defined as follows: FN + FP TP + TN Error rate = Accuracy = (28) TP + FP + FN + TN TP + FP + FN + TN

2.8.2 Receiver Operating Characteristic For a better overview of the performance, one can look at the Receiver Operating Characteristic (ROC) curve. The curve illustrates the relationship between the true positive rate (TPR) and the false positive rate (FPR). The closer the curve is to the upper left corner, the better the model is at distinguishing TP from FP. The TPR is the rate of how many correctly predicted defaults there are among all true defaults and FPR is the rate of how many incorrectly predicted defaults there are among all true non-defaults. The rates are defined as follows: TP FP TPR = FPR = (29) TP + FN FP + TN

17 Figure 6

The dotted diagonal line represents the line of no-discrimination meaning that a model performing along this line is an equally good predictor as a random guess. Points under the diagonal represents a negative predictive power meaning that the model reverses the predictions and tend to predict default when the true outcome is non-default and non-default when the true outcome is default.

2.8.3 Precision-Recall Curve An other type of performance measure that will be considered in this project is the relationship between precision and recall. Precision is a measure of the positive predicting value, in our case the measure of how many defaults the model classified correctly among all observations classified as default. Recall on the other hand is a measure of the true positive rate, i.e. the rate of how many correctly predicted defaults there are among all true defaults as mentioned in previous section. The interpretation of the two measures in this project is that low precision indicates a high opportunity cost, since a lot of good customers are being declined. Low recall on the other hand indicates that our model is accepting customers that will most likely default. The optimal case would be too maximize both precision and recall but in this case recall is slightly more important. The precision and recall are defined as follows: TP TP Precision = Recall = (30) TP + FP TP + FN

18 Figure 7

Unlike the case of a ROC curve where the no-skill line is fixed on the diagonal, the no-skill line in this case changes based on the default ratio in the dataset [19]. An imbalanced 90/10 class distribution would give a horizontal no-skill line at y = 0.1.

F-score

Since the precision and recall scores are both important one can make use of the F -score which is a weighted average of the precision and recall [20]. Below we present the formula for the F -score when precision and recall are weighted equally. This is also the scenario for this thesis, since finding the appropriate weights is out of the scope. precision · recall F = 2 · (31) precision + recall

2.8.4 Area Under the Curve One way of quantifying the performance using ROC - and PR-curves is to measure the Area Under the Curve (AUC). The AUC is simply a measure of how close the curve is the upper left corner and is given on the interval [0, 1] where 1 corresponds to a perfect model predicting all outcomes correctly and 0 to a model predicting all outcomes reversely. A downside of AUC is that it does not provide information of the characteristics of the curve. The ROC -curve could for example be skewed towards the left (see figure (8)) and have the same AUC as a curve skewed to the right (see figure (9)). The two models would perform very differently. The former corresponds to a higher TPR meaning that the number of FN decreases and this is desirable from Klarna’s perspective since the event of predicting non-default when the true outcome is default is a credit loss. The latter

19 corresponds to a opportunity cost in the form of a missed sale. Thus, due to the importance of this measure this will be the main metric of comparison when the models are analyzed.

Figure 8: Skewed to the left Figure 9: Skewed to the right

The following figures illustrates the relationship between the ROC -curve and different values of AUC for a constant threshold value of 0.5:

Figure 10 Figure 11

Figure 12 Figure 13

20 Figure 14 Figure 15

Figure 16 Figure 17

21 3 Methodology

In this section we describe the process of how our results are achieved. The intention is to describe the process to such an extent that the project can be replicated by others.

3.1 Data The data that will be used in this thesis consists of external credit bureau data and internal data. In total the dataset that we will work with consists of approximately two million rows and 730 columns. Due to privacy issues and also the sensitive nature of the data a detailed description regarding the variables cannot be given in this thesis. However, we will give a detailed description of how the dataset is preprocessed in order for the different algorithms to be applied.

3.1.1 Target Variable The target value will indicate if a customer has defaulted and based on how a default is defined the results may vary. For example, even if a customer does not pay on due date, it might be because the customer has forgotten about the instalment and ends up paying it after the due date. Thus, if we define default as ”not paid by due date” this customer will be seen as a ”bad customer” but this will not the case if our default definition is ”unpaid 90 days after due date”. Thus, the default definition will affect the characteristics of the dataset used for modelling by filtering out ”false defaults” from ”real defaults” and thus affect the models. However, to be able to compare our models to the current model we will use the same default definition as the previous model used. Thus, we consider a customer to have defaulted if the full order value has not been returned by due date.

3.1.2 Preprocessing Since the dataset is relatively large performing an explorative data analysis is not feasible. Therefore we want to subset the dataset in a structured and logical way. The first step in this subsetting is to remove every row which for which the boolean target value defaulted is missing. By this step the rows are reduced to approximately 500,000. The next step is to remove every column which contains only one value since these variables are redundant in any ML-model. This step removes approximately 135 columns. After this the high-cardinality variables not suitable for prediction are removed. For example, a customers email is not relevant and the dataset contains unique email’s for each order. Thus this variable is a categorical variable with 500,000 classes which renders it useless. So to remove variables like this each column needs to be inspected but since we have approximately 570 columns it is not viable to manually check each column. Therefore the following approach was used.

1. Check which columns have more than 1000 different values. (147 columns) 2. Find those who are NOT numeric or integers. (19 columns)

3. Remove the rest from the dataset after inspection. (18 columns)

The logic behind this is that if a non-numeric or non-integer column has more that 1000 different values, it would mean that if we perform a one-hot-encoding we would have to add 1000 variables for that particular column. This is not feasible and therefore these columns are removed. But in

22 order to not remove variables that should be included in the model, all variables in step three above are inspected before removal. As can be noted in step three above, one variables with more than 1000 categories is included since it is believed to be important. However, this variable will only be used in the CatBoost model since it does not utilize one-hot-encoding.

After this we remove all the columns that have more than 80 % of the data is missing. This cut-off value provides approximately 100,000 observations which is sufficient for modelling purposes. This further reduces the dataset by 107 columns. In the dataset a lot of boolean variables have duplicates with a 0 and 1 representation instead of True and False. Therefore one of the doublets for all the variables need to be removed and this results in 31 columns being removed. The next step of the preprocessing is to remove columns that have low variation and does not manage to distinguish between defaults and non-defaults. The cutoff value that we have used for this step is variation below 1 %. For example, if a categorical variable contains two classes, A and B, and 99 percent of the observations belong to class A then this columns is redundant IF the target values for the classes don’t differ. This means that the classes A and B are not good at distinguishing whether or not a customer has defaulted and can therefore be removed. At this step extreme cases are treated a bit differently as well since even if a column that is extremely imbalanced manages to distinguish the targets very well it might introduce unwanted bias. This can be illustrated using the previous example but now class B contains less than 10 observations and all these are defaults. Then the ML-algorithm will biased based on these 10 observations which is unwanted. This successfully removes 45 columns and then we split up the dataset into one numeric dataset and one categorical dataset since these variables have to be analyzed differently.

For the categorical columns which have now reached a feasible number of columns to analyze manually, we inspect each column and remove those ones that are not suitable for any ML-algorithm. After this step the categorical columns are preprocessed and ready, thus we proceed to describe the remaining steps that were applied to the numerical variables. To further reduce the number of columns we remove numerical columns that have high correlation with other columns. For this we find the pairwise Pearson correlation for all numerical columns in the dataset. For example, if six columns correlate strongly we keep the column that has the least number of missing elements and remove the rest. This step results in approximately 180 columns being removed. The last step of the preprocessing is to remove the observations for which 90 % of the principal has been returned on due date. The logic behind this comes from the structure of the Pay Later in 4 Parts. Since there are four equal instalments a customer that did not pay the last instalment will have returned 75 % of the principal. But if a customer has paid more than 90 % this indicates that the customer might have accidentally filled out the invoice incorrectly.

When all of the above steps have been completed the dataset has now been reduced to 218 columns and about 460 thousand rows. The reason for removing rows is simply that the target value is missing.

Since the algorithms work a bit differently the preprocessed dataset will have to be adapted for each model. The CatBoost algorithm can handle missing values and makes use of target statistics for categorical variables and therefore no one-hot encoding is needed. Therefore no further processing of the already preprocessed dataset is needed for the CatBoost model. For the Random Forest and Logistic Regression the situation is a bit different since both algorithms require categorical variables to be one-hot encoded and missing values to be either imputed or removed. In this thesis we will

23 handle missing values by using mean/median-imputation for the numerical categories whilst creating a new class,missing, for the missing categorical values.

3.1.3 Balancing Data The dataset that is used in this thesis suffers from imbalance. This essentially means that the ratio between defaults and non-defaults is approximately 1:10. The problem that arises due to this is that the classifier can achieve a prediction accuracy of 90 % by classifying all observations as non-defaults. The effect of this will be that the classifier will overtrain for the majority class making it very accurate to classify non-defaults but undertrain for the minority class which leads to low accuracy for detecting defaults. There are many ways to handle class imbalance and considering our dataset which is relatively large we decided to apply random undersampling instead of oversampling since undersampling still provides us with a relatively large dataset for modelling purposes. For the CatBoost algorithm and the Logistic Regression we also compared undersampling to the built-in ”class-weight” parameter that adjusts for class imbalances by adding weights to the minority class samples which increases the log-loss incurred if the classification is incorrect. We noted that for the CatBoost algorithm this yielded better results, in terms of precision and recall, on the test set but for the Logistic Regression using class weights the performance was worse than if undersampling was applied. Therefore for the Logistic Regression undersampling was used and for the CatBoost we used class weights. To solve the imbalance problem for the Random Forest algorithm we implemented a Balanced Random Forest and compared the performance to a regular Random Forest built on undersampled data. The performance of the two approaches were very similar but due to training time being longer for the Balanced Random Forest we opted for a regular Random Forest with undersampling applied.

3.2 Model Selection In this section we will explain how we performed a model selection for each of the different algorithms. By model selection we refer to the process of deciding how many variables to be included and what parameters the model should have. These parameters and variables are then used to build the final models on which we will base the analysis. Note that the variables used might differ between each model since there are structural differences in the models.

3.2.1 Recursive Feature Elimination: RFE The lower limit of variables to use is set to 40 variables since this is the number of variables used in the current XGBoost model. The metric that is calculated for each fitted model in the RFE is the ROCAUC and a five-fold cross validation is used when the metric is calculated. The RFE is only applied to the numerical features since if we include categorical variables those will be one-hot-encoded. Thus if one level of a categorical variable is removed by the RFE the interpretation of the model will be skewed. This in conjunction with the fact that we only have ten categorical variables in total makes it more reasonable to only apply the RFE to the numerical variables. Since the tree-based models are fundamentally different from the Logistic Regression we might risk adding bias by using a RFE based on a Random Forest classifier. To avoid this we perform two RFE’s, one based on a Random Forest classifier and another based on a Logistic Regression classifier. In the figures below the RFE results are presented.

24 Figure 18: RFE using Random Forest

Figure 19: RFE using Logistic Regression

From figure (18) we can see that implementing a model using 131 variables gives the best ROCAUC score. But we can also see a pike in the ROCAUC around 86 variables. Since less variables also implies less complexity it is desirable to use as few variables as possible without compromising the performance, this is also referred to as the principle of Occam’s Razor (OR) [21]. Based on this we implement models containing all variables (218), RFE-optimal (131) and OR-optimal (86) for the tree based algorithms and models containing all variables (218), RFE-optimal (123) and OR-optimal (65) for the Logistic Regression model.

3.2.2 Logistic Regression In order for a Logistic Regression to perform well we had to first standardize the variables by setting the mean to zero and scaling to unit variance. This is done in order to prevent variables with larger variance dominating the objective function over variables with lower variance, which would reduce the predicting power [22].

25 The data is then split into training and test sets with 80/20 % distribution. Then the training set undergoes an undersampling in order to deal with the imbalanced classification mentioned in earlier sections. The test set is kept imbalanced to represent the real world default ratio. Since the Logistic Regression algorithm cannot handle missing values in the data, we imputed each feature individually with the median of the corresponding feature in the training set. However, this was not done for the categorical features which was mentioned earlier.

To build a Regularized Logistic Regression we first implemented a randomized search algorithm optimized on the ROCAUC -score in order to obtain the optimal hyper parameter for our model. The algorithm builds a number of models with a random chosen value of the parameter from the defined grid. The model’s ROCAUC -scores are then compared to each other and the parameter for the model with the highest ROCAUC -score is presented. Note that this is done only for the numerical variables as in the case of the RFE due to the same reasons. Thus, LASSO is applied only to the numerical variables. The parameter that is grid searched is the regularization strength C which was presented in the theory section regarding LASSO. We remind the reader that a value closer to zero specifies a stronger regularization and that C is a linear transformation of the shrinkage parameter λ and is given by C = 1/λ.

The values for the parameter is as follows: • C : [’uniform interval [0,1]’, ’10’, ’100’] where the uniform interval is continuous. The optimal parameter according to the ROCAUC -score from the randomized search algorithm is the following:

• C = 0.1764

Now, the Regularized Logistic Regression model is built with the obtained parameter but only using the numerical variables. This model will have coefficients with the value zero due to the imposed L1-penalty. These variables are removed, and we create a new dataset including the remaining variables and the categorical variables which were removed before LASSO. After this a Logistic Regression model is trained on this new dataset and this model will be referred to as the L1 Logistic Regression, since it utilized LASSO as feature selection. Then the performance is compared to the Logistic Regression models using RFE and OR as feature selection. Using LASSO with the obtained parameter C for feature selection resulted in 158 variables being selected. Presenting a LASSO-path for this is redundant since there are more than 200 variables which yields the plot unreadable. The ROC -curves with corresponding AUC’s are visualized below together with precision-recall curves on training and test sets.

26 Figure 20: ROC -curve: Test Set Figure 21: PR-curve: Test Set

The ROCAUC’s on the test set are marginally different from each other, indicating that the Logistic Regression model is insensitive to the number of variables. Also, note that the PRAUC’s are remarkably low. This is due to the fact that the precision-recall curve is calculated on imbalanced test sets.

Figure 22: ROC -curve: Training Set Figure 23: PR-curve: Training Set

As can be observed from figure (22), the ROCAUC’s on the training sets are at a reasonable level. A higher score would have indicated overfitting and bad generalization of the classifier and is not desirable. Here, the PRAUC’s are significantly higher due to the fact that the training set is balanced.

Based on these plots, it is clear that a Logistic Regression with 65 variables will be used as the final model since the performance differences are marginal and due to the fact that a lower complexity is preferred.

27 3.2.3 CatBoost As mentioned in the theory section the CatBoost algorithm can be implemented using ordered boosting or plain boosting. To decide which boosting mode to use we built duplicate models using the different modes and compared their performance in terms of ROCAUC and PRAUC. To build the CatBoost models we implemented a grid-search to find the best hyper parameters from a predefined list. The following parameters were grid-searched: • Tree depth • L2 Leaf Regularization: Penalizes large weights in each leaf [23]. For the tree depth its recommended to use values in the range 4-10 as this is optimal in most cases [23]. For the l2 leaf regularization any positive value is allowed [23]. Since there were no recommendations on which values to use for the l2 leaf regularization we made our grid relatively large, ranging from 0 to 100, and shrunk the grid based on the results. The final grid which the parameters were tuned on was the following: • Tree depth: [4, 5, 6, 7, 8] • L2 Leaf Regularization: [18, 26, 35, 40, 50] The number of trees i.e. boosting steps are determined by using the default value of 1000 combined with an overfitting detector which stops the boosting rounds when the ROCAUC stops to increase for more than 50 boosting steps. This will prevent the model from overfitting [24]. Another parameter that is set by the algorithm itself is the learning rate which governs the gradient step [23]. This value is automatically set by the algorithm and is chosen based on properties of the dataset and the number of trees used and should yield a value close to the optimal one [23]. In the figures below we find the ROCAUC -score on the y-axis and the 30 different models built on the x-axis. The optimal models yielding the highest ROCAUC -score on the validation set were as following for the different boosting modes. • Ordered CatBoost: Tree depth = 7, L2 Leaf Reg = 40 • Plain CatBoost: Tree depth = 8, L2 Leaf Reg = 40

Figure 24: Grid Search: Ordered Figure 25: Grid Search: Plain Boosting Boosting

28 This grid-search was then repeated for the RFE-selected variables and for the 86 variables selected using figure (18). The grid search for the RFE-selected variables and the variables selected using Occam’s Razor are presented in the following figures for both boosting modes.

Figure 26: Grid Search: Ordered Figure 27: Grid Search: Plain Boosting RFE Boosting RFE

Figure 28: Grid Search: Ordered Figure 29: Grid Search: Plain Boosting Occam’s Razor Boosting Occam’s Razor

Below we present the optimal parameters for each model according to the grid search. • Ordered CatBoost (RFE): Tree depth = 8, L2 Leaf Reg = 40 • Plain CatBoost (RFE): Tree depth = 8, L2 Leaf Reg = 40 • Ordered CatBoost (OR): Tree depth = 8, L2 Leaf Reg = 50

• Plain CatBoost (OR): Tree depth = 8, L2 Leaf Reg = 35 When the optimal parameters for every model have been found we train the models on the training set and evaluate their performance by plotting the ROCAUC -scores for the training and test sets.

29 Besides this we also look at the precision-recall curves for the same datasets. By assessing the scores on the training sets we can understand how well the different models generalize to the data. Also it will be possible to detect possible sample bias if the performance is very different across the different sets. For example if the test set randomly has observations which are more ”easily” classified as defaults the performance on the test set might be higher than for the training set.

Below we find the ROC -curves with their respective AUC -values for each model and for each dataset.

Figure 30: ROC -Curve: Test Set Figure 31: PR-Curve: Test Set

From the figures above we see that the plain boosting mode seems to outperform the ordered mode, both in terms of AUC of the ROC -curve and AUC of the PR-curve. Also one should note that the feature elimination using RFE and Occam’s Razor does not decrease the performance of the classifier. Hence the principle of Occam’s Razor has been successfully applied since we have removed excess complexity without sacrificing performance.

Now we analyze these plots on the training set:

30 Figure 32: ROC -curve: Training Figure 33: PR-curve: Training Set Set

In these figures we are seeing the effect that the ordered boosting has. All versions of the ordered boosting mode has lower performance, both in terms of the ROC -curve and the PR-curve. But it is important to remember that a relatively high score on the training set is not desirable since this will indicate overfitting and bad generalization of the classifier. We can also in these plots clearly see that the variables selected using Occam’s Razor provide the best performance. Besides this the developer of the CatBoost algorithm compared the performance of the two boosting modes on different datasets. The results indicated that the ordered mode is preferred when the dataset is relatively small, i.e. less than 40k for training [13]. This is also in favor of the plain boosting mode.

Based on all this it is clear that 86 variables should be used in the final model. Regarding which boosting mode we should use we argue that even if the plain boosting mode has worse generalization than the ordered boosting the performance is still very consistent on the different sets. This combined with the fact that the plain boosting mode shows better performance leads to the exclusion of the ordered boosting mode.

3.2.4 Random Forest For the Random Forest algorithm more parameters were to be tuned which results in a grid search being computationally inefficient. We therefore implemented a randomized grid search instead. The difference is that in the grid search every possible combination of the given parameters is tested and for the CatBoost algorithm there were in total 30 different combinations. This essentially means that 30 different models are created and tested systematically. For the Random Forest grid there are more than 400 combinations which makes a full grid search unfeasible. The randomized grid search covers a larger parameter space and randomly picks the parameters. In this thesis 100 randomly selected combinations were tried from the predefined grid which is presented below: • min samples split: [2, 5, 10, 15] • max depth: [10, 37, 63, 90, 117, 143, 170, 197, 223, 250, None] • min samples leaf : [1, 2, 4, 6, 8]

31 • bootstrap: [True, False] To determine the number of trees to use in the Random Forest we used the following plot in which the OOB-error is plotted against the number of trees used. Based on this plot we choose the number of trees were the decrease in the OOB-errors seems to stabilize. From the following plot we can see that the OOB-errors seems to stabilize after 1000 trees and thus 1000 trees will be used in each Random Forest.

Figure 34: OOB-errors vs. Number of Trees

When the randomized search was completed it yielded the following parameters for the different Random Forest models: • Random Forest: max depth = 250, min samples split = 2, min samples leaf = 1

• Random Forest (RFE): max depth = 250, min samples split = 2, min samples leaf = 1 • Random Forest (OR): max depth = 250, min samples split = 2, min samples leaf = 1 To decide which of the different models to use we analyze the AUC -scores of the ROC -curve and the PR-curve.

32 Figure 35: ROC -curve: Test Set Figure 36: PR-curve: Test Set

In the figures above we can observe the same pattern as for the CatBoost classifier. The Random Forest model built by applying Occam’s Razor perform at par with the RFE-optimal model and the full model. This was expected since the RFE was applied using a Random Forest classifier. Here the RFE-selected variables seems to provide a slightly better model when we consider the AUC of the PR-curve. But considering that the RFE-selected model has 45 more features, the 0.01 gain in AUC is not worth the extra complexity that is added. Analyzing the performance on the training set is redundant for a Random Forest classifier since when the forest is fitted and the max depth is relatively large, the forest will grow trees until the leaves contain only one observation. Thus by constructing the Random Forest will be overfitted. But since the grid search was performed using ROCAUC -score, the resulting models are believed to have the best performance.

By analyzing the above plots it is clear that the OR-optimal Random Forest should be used since it provides performance at par to the other models but with less complexity in terms of features.

4 Results

In this section the resulting models will be presented along with relevant measures to make a compar- ison possible.

Since there are different costs associated with false positive and false negatives we have to use metrics that highlight this. For example, the cost associated with a default is definitely greater than the cost of declining a customer who would not have defaulted. This is since if a customer defaults the direct loss is equal to the order amount. Whilst declining a customer means that a small percentage of the order amount, in terms of a transaction fee, is lost. Thus the optimal case would be to put slightly more emphasis on false negatives than false positives. But since figuring out the exact trade-off between the different situations a rigorous analysis is needed which is not in the scope of this thesis. Below we find the precision, recall, F and AUC -scores for the final models together with the ROCAUC of Klarna’s current XGBoost model:

33 Models Precision Recall F ROCAUC PRAUC XGBoost - - - 0.77 - CatBoost 0.18 0.65 0.29 0.78 0.29 Random Forest 0.16 0.71 0.26 0.75 0.24 LogReg 0.14 0.66 0.23 0.71 0.18

Table of model results Below we present the ROC -curves and PR-curves for all models in order to visualize the results:

Figure 37: ROC -curves: Test Set Figure 38: PR-curves: Test Set

34 5 Discussion

In this section the results presented in the previous section will be discussed. We also address our posed research questions.

We present our research questions to be addressed below: • What are the benefits and drawbacks of using Logistic Regression, Random Forest and CatBoost in order to predict defaults? • How does the optimal algorithm compare to the current XGBoost model?

In order to answer the research questions posed we use the results presented in the previous section. From the results we can clearly see a distinct pattern for almost all the metrics presented. The CatBoost algorithm outperforms both the Logistic Regression and the Random Forest. Comparing it to the Logistic Regression the difference in performance is relatively large. The only metric which the CatBoost algorithm seems to underperform compared to the other algorithms is the recall. Here the Random Forest outperforms the CatBoost by seven percentage points. Essentially this means that the Random Forest and the Logistic regression models are both better at finding the true positives, i.e. predicting actual defaults. Hence one could argue the Random Forest model is better because of this but it i important to remember that the recall and precision scores need to be analyzed together. A relatively high recall-score is not very impressive if the precision-score is underperforming. This practically means that the false positive rate is relatively large. For example, if the algorithm simply classifies every observations as a default the recall-score will be equal to 1 since every true default has been classified correctly. But this is not a very good model since the alternative cost will be huge if the majority of the customers are declined. Therefore, we consider the F -score and the PRAUC since these combine the results from precision and recall. Considering this we can see that the CatBoost model is outperforming the two other algorithms.

The main metric of interest for Klarna is the ROCAUC -score since it summarizes the models ability to separate defaults from non-defaults. Comparing this score we can see that the CatBoost algorithm outperforms the Random Forest and the Logistic Regression significantly but only marginally Klarna’s current XGBoost. The reason behind the relatively large difference in performance can be related to the underlying structure of the models. Both the CatBoost and XGBoost algorithms are gradient boosting methods which might be better suited for this kind of setting. In a similar study were seven machine learning algorithms were compared at Nordea Bank the results indicated that amongst the top three models, two were gradient boosted [25]. Hence it is very possible that the na- ture of the models are responsible for the performance gap between the three different ML-techniques.

However, even if the results indicate that the CatBoost outperforms the XGBoost we need to address the fact that using different datasets for building the models can influence the results in either direction. The data that was used for the CatBoost model is from December 2019 to the beginning of January 2020. A majority of the observations are from December and due to Christmas there is a possibility that the data is populated differently. The data used for the XGBoost model ranges from start of May 2019 till end of July 2019. Therefore variables that were used in the XGBoost model might have less information, i.e. more missing rows, than the exact same variables used in the CatBoost model. Thus the characteristics of the datasets used should also be considered when

35 interpreting the results.

Another thing to consider is that the CatBoost algorithm is the only algorithm that does not need any encoding of categorical variables since this is built in to the algorithm. Thus in the case were there a relatively large amount of categorical variables which have relatively high cardinality, the preprocessing needed for the CatBoost is much simpler.

36 5.1 Conclusion In this section a final recommendation is given regarding which model to use and why.

Our final conclusion considering the results is that there is of great interest for Klarna Bank to consider a CatBoost algorithm for a PD model since it performs at par to their current XGBoost model. Also the CatBoost is preferable since it does not require categorical variables to be encoded which XGBoost requires.

5.1.1 Future Studies Since this study aimed to compare different algorithms the different encoding methods that can be used for categorical variables were not truly explored. Therefore it would be of interest to see how different encoding methods affect the performance of a classifier. Also since there are different costs associated with declining a ”good” customer versus accepting a ”bad” customer it would be interesting to actually quantify this affect by analyzing the different costs and weighting the F -score accordingly.

37 6 References

[1] Kate Rooney. Online shopping overtakes a major part of retail for the first time ever, April 2019. URL: https://www.cnbc.com/2019/04/02/ online-shopping-officially-overtakes-brick-and-mortar-retail-for-the-first-time-ever. html. [2] CatBoost - state-of-the-art open-source gradient boosting with categorical features support. CatBoost, 2020. URL: https://catboost.ai/#benchmark. [3] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning, volume 103 of Springer Texts in Statistics. Springer New York, New York, NY, 2013. ISBN 978-1-4614-7137-0 978-1-4614-7138-7. doi: 10.1007/978-1-4614-7138-7. URL: http://link.springer.com/10.1007/978-1-4614-7138-7. [4] J.S. Cramer. The Origins of Logistic Regression, November 2002. URL: https://papers. tinbergen.nl/02119.pdf. [5] Vishal Morde. XGBoost Algorithm: Long May She Reign!, April 2019. URL: https://towardsdatascience.com/ https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d. [6] Kaitlin Kirasich, Trace Smith, and Bivin Sadler. Random Forest vs Logistic Regression: Binary Classification for Heterogeneous Datasets. 1(3):25, 2018. URL: https://scholar.smu.edu/ cgi/viewcontent.cgi?article=1041&context=datasciencereview. [7] Trevor Hastie, Jerome Friedman, and Robert Tibshirani. The Elements of Statistical Learning - Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, second edition edition, 2009.

[8] Douglas M Hawkins. The problem of overfitting. pages 1–12, 2004. [9] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity - The Lasso and Generalizations. Taylor & Francis Inc, August 2015. ISBN 1-4987-1216-9. URL: https://web.stanford.edu/~hastie/StatLearnSparsity_files/SLS.pdf. [10] Jerome H Friedman. GREEDY FUNCTION APPROXIMATION: A GRADIENT BOOSTING MACHINE. The Annals of Statistics, Vol. 29(No. 5):1189–1232, 2001. [11] Chao Chen, Andy Liaw, and Leo Breiman. Using Random Forest to Learn Imbalanced Data. URL: https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf. [12] Anna Dorogush, Andrey Gulin, Gleb Gusev, Nikita Kazeev, Liudmila Os- troumova Prokhorenkova, and Aleksandr Vorobev. Fighting biases with dynamic boosting. June 2017. URL: https://www.researchgate.net/publication/318030603_Fighting_ biases_with_dynamic_boosting/link/59b6742d0f7e9bd4a7fbef17/download.

38 [13] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: unbiased boosting with categorical features. January 2019. URL: http://arxiv.org/abs/1706.09516. [14] Daniel Chepenko. Introduction to gradient boosting on decision trees with Catboost, February 2019. URL: https://towardsdatascience.com/ introduction-to-gradient-boosting-on-decision-trees-with-catboost-d511a9ccbd14. [15] sklearn.feature selection.RFE - scikit-learn 0.22.2 Documentation. scikit learn, 2020. URL: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection. RFE.html. [16] Lasso path using LARS — scikit-learn 0.11-git Documentation. scikit learn, 2020. URL: https://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/auto_ examples/linear_model/plot_lasso_lars.html. [17] sklearn.model selection.GridSearchCV — scikit-learn 0.22.2 Documentation. scikit learn, 2020. URL: https://scikit-learn.org/stable/modules/generated/sklearn.model_ selection.GridSearchCV.html. [18] sklearn.model selection.RandomizedSearchCV — scikit-learn 0.22.2 Documentation. scikit learn, 2020. URL: https://scikit-learn.org/stable/modules/generated/sklearn.model_ selection.RandomizedSearchCV.html. [19] Jason Brownlee. How to Use ROC Curves and Precision-Recall Curves for Clas- sification in Python, August 2018. URL: https://machinelearningmastery.com/ roc-curves-and-precision-recall-curves-for-classification-in-python/. [20] sklearn.metrics.f1 score — scikit-learn 0.23.0 Documentation. scikit learn, 2020. URL: https: //scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html. [21] Brian Duignan. Occam’s razor | Origin, Examples, & Facts, December 2018. URL: https: //www.britannica.com/topic/Occams-razor. [22] sklearn.preprocessing.StandardScaler - scikit-learn 0.22.2 Documentation. scikit learn, 2020. URL: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing. StandardScaler.html. [23] Parameter tuning - CatBoost. Documentation. CatBoost, 2020. URL: https://catboost.ai/ docs/concepts/parameter-tuning.html. [24] Overfitting detector - CatBoost. Documentation. CatBoost, 2020. URL: https://catboost. ai/docs/concepts/overfitting-detector.html. [25] Daria Granstr¨omand Johan Abrahamsson. Loan Default Prediction using Supervised Machine Learning Algorithms. Master’s thesis, KTH Royal Institute of Technology, Stockholm, 2019. URL: http://kth.diva-portal.org/smash/get/diva2:1319711/FULLTEXT02.pdf.

39

TRITA -SCI-GRU 2020:053

www.kth.se