DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

Predicting House Prices on the Countryside using Boosted Decision Trees

WAR REVEND

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

Predicting House Prices on the Countryside using Boosted Decision Trees

WAR REVEND

Degree Projects in Mathematical Statistics (30 ECTS credits) Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2020 Supervisors at Booli Search Technologies AB: Christopher Madsen Supervisor at KTH: Joakim Andén-Pantera Examiner at KTH: Joakim Andén-Pantera

TRITA-SCI-GRU 2020:302 MAT-E 2020:075

Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

Abstract

This thesis intends to evaluate the feasibility of supervised learning models for predicting house prices on the countryside of South Sweden. It is essen- tial for mortgage lenders to have accurate housing valuation algorithms and the current model offered by Booli is not accurate enough when evaluating residence prices on the countryside. Different types of boosted decision trees were implemented to address this issue and their performances were compared to traditional methods. These different types of supervised learning models were implemented in order to find the best model with re- gards to relevant evaluation metrics such as root-mean-squared error (RMSE) and mean absolute percentage error (MAPE). The implemented models were ridge regression, lasso regression, random forest, AdaBoost, gradient boost- ing, CatBoost, XGBoost, and LightGBM. All these models were benchmarked against Booli’s current housing valuation algorithms which are based on a k- NN model. The results from this thesis indicated that the LightGBM model is the optimal one as it had the best overall performance with respect to the cho- sen evaluation metrics. When comparing the LightGBM model to the bench- mark, the performance was overall better, the LightGBM model had an RMSE score of 0.330 compared to 0.358 for the Booli model, indicating that there is a potential of using boosted decision trees to improve the predictive accuracy of residence prices on the countryside.

Sammanfattning

Denna uppsats ämnar utvärdera genomförbarheten hos olika övervakade in- lärningsmodeller för att förutse huspriser på landsbygden i Södra Sverige. Det är viktigt för bostadslånsgivare att ha noggranna algoritmer när de värderar bostäder, den nuvarande modellen som Booli erbjuder har dålig precision när det gäller värderingar av bostäder på landsbygden. Olika typer av boostade beslutsträd implementerades för att ta itu med denna fråga och deras prestan- da jämfördes med traditionella maskininlärningsmetoder. Dessa olika typer av övervakad inlärningsmodeller implementerades för att hitta den bästa model- len med avseende på relevanta prestationsmått som t.ex. root-mean-squared error (RMSE) och mean absolute percentage error (MAPE). De övervaka- de inlärningsmodellerna var ridge regression, lasso regression, random forest, AdaBoost, , CatBoost, XGBoost, and LightGBM. Samtliga algoritmers prestanda jämförs med Boolis nuvarande bostadsvärderingsalgo- ritm, som är baserade på en k-NN modell. Resultatet från denna uppsats vi- sar att LightGBM modellen är den optimala modellen för att värdera husen på landsbygden eftersom den hade den bästa totala prestandan med avseen- de på de utvalda utvärderingsmetoderna. LightGBM modellen jämfördes med Booli modellen där prestandan av LightGBM modellen var i överlag bättre, där LightGBM modellen hade ett RMSE värde på 0.330 jämfört med Booli modellen som hade ett RMSE värde på 0.358. Vilket indikerar att det finns en potential att använda boostade beslutsträd för att förbättra noggrannheten i förutsägelserna av huspriser på landsbygden.

Acknowledgements

First, I wish to thank my supervisor at KTH, Joakim Andén-Pantera, for his excellent patience, guidance, and support in completing this thesis. I would also like to express my gratitude toward Christopher Madsen, my supervisors at Booli Search Technologies AB for inspiration and support. Further, I want to thank Johan Mattsson and Olof Sjöbergh at Booli for additional advice and providing the opportunity for this thesis.

Contents

Acknowledgements

1 Introduction 1 1.1 Background and problem formulation ...... 1 1.2 Purpose ...... 2 1.3 Research question ...... 2 1.4 Scope ...... 3 1.5 Limitation ...... 4

2 Background 5 2.1 Supervised Learning ...... 5 2.2 Regression ...... 6 2.2.1 Simple linear regression ...... 6 2.2.2 Multiple linear regression ...... 7 2.3 Shrinkage Methods ...... 9 2.3.1 Ridge regression ...... 10 2.3.2 Lasso regression ...... 11 2.4 k-Nearest Neighbors Algorithm ...... 12 2.5 Decision Trees ...... 13 2.5.1 Regression Trees ...... 13 2.6 Random Forest ...... 16 2.7 Boosting ...... 17 2.7.1 AdaBoost ...... 19 2.7.2 Gradient Boosting ...... 21 2.7.3 Categorical Boosting: CatBoost ...... 23 2.7.4 XGBoost ...... 24 2.7.5 LightGBM ...... 26 2.8 Hyper-Parameter Tuning ...... 28 2.8.1 Cross-validation ...... 28 CONTENTS

2.9 Metrics of interest ...... 31

3 Methods 33 3.1 Data ...... 33 3.1.1 Overview of the available data ...... 33 3.1.2 Preprocessing ...... 36 3.2 Model Implementation ...... 43 3.2.1 Hyper-Parameter Tuning of the Models ...... 43

4 Results 47

5 Discussion 58 5.1 Results Evaluation ...... 58 5.1.1 Model Comparison ...... 58 5.1.2 Benchmark Comparison ...... 59

6 Conclusion 61 6.1 Answering Research Questions ...... 61 6.2 Future work ...... 62

Bibliography 63

A Scatterplots of variables vs Absolute Percentage Error 67 List of Tables

3.1 Description of the variables available in the data set obtained from Booli...... 35 3.2 Description of the variables available in the data set obtained from SCB ...... 36 3.3 Hyper-parameter set for lasso and ridge regression ...... 43 3.4 Hyper-parameter set for gradient boosting regression . . . . . 44 3.5 Hyper-parameter set for random forest ...... 44 3.6 Hyper-parameter set for AdaBoost regression ...... 44 3.7 Hyper-parameter set for LightGBM ...... 45 3.8 Hyper-parameter set for XGBoost ...... 45 3.9 Hyper-parameter set for CatBoost ...... 46

4.1 Performance of the models evaluated with metrics used by Booli 52

List of Figures

2.1 Illustrating leaf-wise and level-wise tree growth ...... 28 2.2 Illustrating 5-fold cross-validation on a data set ...... 30

3.1 Illustrating which variables in the data set have NaN values and the percentage of missing values and have the percentage . 38 3.2 Illustrating one-hot encoding ...... 39 LIST OF FIGURES

3.3 Distribution of the adjusted residence price together with Q-Q plot ...... 40 3.4 Distribution of the log-transformed adjusted residence price together with Q-Q plot ...... 40 3.5 Heat map between numerical variables in the data set...... 42 3.6 Heat map between the ten highest numerical variables that cor- relates to the residence price in the data set...... 42

4.1 Barplot of the RMSE score on the train set when evaluating different scaling and transformation methods...... 47 4.2 Barplot of the RMSE score on the test set when evaluating different scaling and transformation methods...... 48 4.3 Evaluating MSE and MAPE as loss function for LightGBM and CatBoost ...... 48 4.4 Barplot of the RMSE score on the train set when evaluating sample weights...... 49 4.5 Barplot of the RMSE score on the test set when evaluating sample weights...... 49 4.6 Barplot of the RMSE score on the train and test set...... 50 4.7 Barplot of the MAPE score on the train and test set...... 51 4.8 Barplot of the MdAPE score on the train and test set...... 51 4.9 Histogram of the percentage error between the Booli and Light- GBM model ...... 53 4.10 Map of the mean percentage error in each DESO area using LightGBM and Booli model ...... 53 4.11 Map of the mean percentage error in each DESO area using XGBoost ...... 54 4.12 Scatterplot of the spread of house prices in DeSO area vs MAPE in DeSO area using LightGBM ...... 55 4.13 Scatterplot of the spread of house prices in DeSO area vs MAPE in DeSO area using Booli model ...... 55 4.14 Scatterplot of the spread of house prices in DeSO area vs MAPE in DeSO area using lasso ...... 56 4.15 Scatterplots of the variables assessedValue respective as- sessedValueBuilding vs absolute percentage error, us- ing the LightGBM model ...... 56 4.16 Scatterplots of the variables assessedValuePlot respec- tive assessmentPoints vs absolute percentage error, us- ing the LightGBM model ...... 57 LIST OF FIGURES

A.1 Scatterplots of the variables constructionYear respec- tive delta_Date vs absolute percentage error, using the Light- GBM model ...... 67 A.2 Scatterplots of the variables latitude, longitude, to- talArea , livingArea, plotArea respective rooms vs absolute percentage error, using the LightGBM model . . . . . 68 A.3 Scatterplots of the variables distanceToOceanFront re- spective distanceToWater vs absolute percentage error, using the LightGBM model ...... 69 Chapter 1

Introduction

This chapter provides an overview of the aim of the thesis. The topics discussed within this chapter are the thesis’ background, purpose, scope, and limitations of the thesis.

1.1 Background and problem formulation

For the majority of the people, purchasing a residence is the biggest financial commitment in their life. Ensuring that homeowners have a trusted way of monitoring the value of their assets is important. Even companies such as Zil- low, an American online real estate database company, offered a competition in 2018 with a one million dollar grand prize for improving their valuation al- gorithm [1]. The valuation algorithm is an important feature that Booli offers to its consumers, but it is also used by their owner, SBAB Bank AB, when they need to value residences across Sweden. Usually, the bank customers want to get a valuation on their residence when they want to get a loan on the house or when they need a new mortgage for purchasing a new property. It would have been tedious, time-consuming and expensive for the bank if all the res- idences that they value had to be done manually by inspecting the property, especially when living in a data-driven world. Therefore, one needs to have as accurate methods as possible when valuing residences because it can affect both the bank and their customers in a negative way if the estimated price of the property deviates too much from what it is actually worth.

The housing valuation algorithm is an important feature that Booli offers. Their current model is based on a k-nearest neighbor algorithm (k-NN) to evaluate close by transactions of similar residents. When a residence is be-

1 2 CHAPTER 1. INTRODUCTION

ing evaluated using k-NN, the input data for the algorithm are transactions of similar residents. The valuation algorithm works really well in urban areas for instance Stockholm City, but it performs poorly in rural areas. This can be realized through the fact that the k-NN is dependent on transaction data of its neighboring residence. On the countryside, there are less transaction data available which makes it difficult for the algorithm to perform well. By using statistical methods and machine learning algorithms, one can improve the cur- rent valuation algorithm, since the current algorithm has problems. Sweden is divided into six index areas, Greater Stockholm, Greater Gothen- burg, Greater Malmö, North Sweden, Middle Sweden, and South Sweden. When looking at villas on the countryside, one is considering the index areas North Sweden, Middle Sweden, and South Sweden. In this thesis, the focus will be on one index area, South Sweden, since the algorithm deviates the most in this area and South Sweden also has more data points compared to North Sweden and Middle Sweden.

1.2 Purpose

The objective of this thesis is to investigate which method from a chosen set of boosted decision trees performs the best price prediction and compare the result to other machine learning algorithms. As mentioned earlier the current valuation algorithm developed by Booli has poor precision when evaluating residence prices on the countryside. The algorithms devolved will be bench- marked against the Booli model to see under which circumstances the different models perform better than each other and against the Booli model.

1.3 Research question

The goal of this thesis is to answer the following questions:

• For a chosen set of boosted decision trees, which boosted decision tree performs best when valuing residences on the countryside?

• How is the performance of the boosted decision trees compared to the more traditional methods ridge regression, lasso regression, and ran- dom forest?

• Can the boosted decision trees yield a better valuation algorithm than the valuation algorithm used by Booli? If so, to what extent? CHAPTER 1. INTRODUCTION 3

1.4 Scope

The scope of this thesis is to implement and investigate how different boosted decision tree performs when valuing residences on the countryside. The boosted trees performance will be compared to a set of machine learning methods and benchmarked against the model developed by Booli. The model evalua- tion techniques that will be considered in this thesis are limited to root-mean- squared error (RMSE), mean absolute percentage error (MAPE), median ab- solute percentage error (MdAPE), number of valuations that have an absolute percentage error below fifteen percent and mean of the absolute percentage error excluding the one percent of extreme errors. The decision trees that will be implemented are the following: gradient boosting, AdaBoost, XGBoost, LightGBM, and CatBoost.

The traditional machine learning methods which will be compared to decision tree, are the following: ridge regression, lasso regression, and random forest.

These methods have been chosen to evaluate how different models that uti- lize different techniques perform in comparison to each other. For instance, ridge regression is a shrinkage method, random forest is a bagging algorithm and LightGBM is a boosting algorithm. From an academic perspective, it is intriguing to evaluate how traditional machine learning methods compares to more advanced machine learning methods. The reason why the above boost- ing algorithms was chosen, rather than choosing other boosting algorithms is because XGBoost is a model that has been widely incorporated in the winning submission on Kaggle competitions [2]. When LightGBM was released by Microsoft it showed promising results as it outperformed XGBoost during a comparison experiment, where it beat the XGBoost in several categories such as accuracy, and speed during training and testing [3]. A few months later, the boosting algorithm CatBoost was released by Yandex. CatBoost outper- formed both XGBoost and LightGBM during a benchmark, where they were evaluated on different Kaggle competitions [4].

The thesis project will be conducted at Booli Search Technologies AB (hence- forth Booli) and the data that is provided by Booli is data that they have col- lected during the years through web scraping, by buying data from Lantmä- teriet, and collecting from SCB (Statistiska Centralbyrån). 4 CHAPTER 1. INTRODUCTION

1.5 Limitation

In order to narrow down the scope of this thesis, the following limitations had to be made:

• Only villas will be investigated and no other type of residences such as apartments or office houses, since villas are the type of residence that has the highest median prediction error.

• Only villas on the countryside will be considered as the Booli model is accurate when it comes to evaluating residences in metropolitan areas.

• The data contains transaction data starting from 2015-01-01 to 2020-03- 31, as the data from this time period is sufficient enough for covering the index area of interest.

• Only villas in the index area of South Sweden will be considered, how- ever the methodology used is applicable to the rest of the index areas as well. The reason for choosing South Sweden is that it is one of the areas where the Booli model has the largest prediction error and also has more data points compared to North Sweden and Middle Sweden. Chapter 2

Background

In this chapter, the theoretical aspects behind a chosen set of supervised learn- ing models in the field of regression are discussed. The chosen metrics to eval- uate models are also provided in this chapter.

2.1 Supervised Learning

One can divide statistical learning algorithms into three types of learning prob- lems: supervised learning, unsupervised learning, and reinforcement learning [5, p. 1–4]. In this thesis, only supervised learning will be considered. In su- pervised learning, there is a well-defined structure to the data, where the goal is to learn the relationship between the regressor(s) and predictor variable(s). The variables are assumed to be random variables that are drawn from some unknown distribution [6]. A sample obtained from the joint distribution of the random variables referred to as training set is used during the training process in order to learn the relationship.

In this thesis the focus will be on the regression task where the response vari- able y belongs to and the regressor variables X can be both categorical or numeric. The regressors are assumed to have a N observation, where each ′ regressor has a p-dimensional feature space, then X =[X1,X2,...,Xp] will be the size N × p, where the ith observation of the jth feature is denoted by xij. The relationship between the variables is defined by y = f(X)+ǫ, where f(X) is a fixed function capturing the relationship with the response variable and ǫ is the random error term that is independent of X and has mean zero N [7, p. 16]. Given the training set τ = {(xi,yi)}i=1 , xi =(xi1,...,xip) used when estimating the function f(X). The prediction of y is then modeled as

5 6 CHAPTER 2. BACKGROUND

y = f(X), where f represents the estimation of f, and y represents the esti- mation of y [7, p. 21]. b b b b The function f(·) can be estimated with different methods, the main distinction is parametric or non-parametric methods that are generally used in statistical learning methods. The parametric method relies on assumptions about the functional form of f: a class F of possible functions is defined and search of the optimal one within that class based on some optimality criterion. Under such assumptions, one trains the model by estimating the parameters of the model using the training set τ [7, p. 21–23]. However, the non-parametric method does not rely on any explicit assumptions about the form of f. In- stead, these methods tries to find an estimate of f that is as close to the points in τ as possible without being too rough or changing too rapidly [7, p. 23–24].

When the functional form f is assumed and a class F of possible functions is defined, then parameters of the model can be estimated using the training set τ . A loss function L(y,f(X)) is defined in order to quantify the prediction performance of the function f, thus mapping it to a real number R. In the regression setting the squared loss function is normally used to measure how well the function f fits the data: L(y,f(X))=(y − f(X))2 [6]. The goal of supervised learning is to find a function f(X) ∈ F such that that function has 1 N the lowest average loss on the training data set: N i=1 L (yi, xi) [6]. Hence, the solution to this optimization problem is to find the optimal estimator f(X) P such that b N 1 f(X)= arg min L (yi, xi) (2.1) f∈F N i=1 X b 2.2 Regression

In this section, the theory about simple and multiple linear regression will be presented and is inspired by the theory presented by Montgomery, Peck, and Vining [8, Ch. 2–3].

2.2.1 Simple linear regression The simple linear regression method [8, p. 12–15] is a simple and straightfor- ward method used for predicting a response variable y on the basis of a single CHAPTER 2. BACKGROUND 7

regressor x. It assumes a linear relationship between the response variable y and the single regressor x, which is generally written in the following form

y = β0 + β1x + ε, (2.2) where the regression coefficients β0 and β1 are unknown constants that repre- sent the intercept and slope in the model. The ε is a random error term. Fur- thermore, the errors are assumed to be uncorrelated with a normal distribution with mean zero and unknown variance σ2, more specifically ε ∼ N(0,σ2).

The regression coefficients are estimated using the method of least squares, where the residual sum of squares (RSS) can now be expressed as

N N 2 2 RSS(β0, β1)= ε = (yi − β0 − β1xi) . (2.3) i=1 i=1 X X The estimators β0 of β0 and β1 of β1 solve the following minimization problem

b β0, βb1 = arg min RSS(β0, β1). (2.4) β0,β1   By expanding equationb (2.3)b a quadratic expression is obtained, where the values of β0 and β1 that minimize the objective function RSS can be derived [8, p. 12–15], giving b b

β0 =y ¯ − β¯1x,¯ (2.5)

bn i=1 (xi − x¯)(yi − y¯) β1 = . (2.6) n (x − x¯)2 P i=1 i The fitted simple linearb regressionP is then modeled as

y0 = β0 + β1x0, (2.7) where y0 is a new prediction of an unseen observation x0. b b b 2.2.2b Multiple linear regression The multiple linear regression method [8, p. 67–70] extends the simple linear regression as it involves more than one regressor variable. In general, one assumes that the response variable y is related to p regressors as

y = β0 + β1x1 + β2x2 + ... + βpxp + ε (2.8) 8 CHAPTER 2. BACKGROUND

where β = [β0, β1,...,βp] are regression coefficients. The regression coef- ficients can be estimated using the least squares method. Suppose there exist N observations on the p regressors, that N >p observations are available, that the ith observed response is denoted by yi and xij is the jth regressor of the ith observation. As in the case of simple linear regression, the errors are assumed to be normally distributed with mean zero and unknown variance σ2 [8, p. 67–70].

Equation (2.8) can then be written as

p

yi = β0 + βjxij + εi, i =1, 2,...,N. (2.9) j=1 X The goal of the least-squares method is to fit a hyperplane into (p+1)-dimensional space that minimizes the sum of squared residuals, where the RSS can be ex- pressed as

N N p 2 2 RSS(β0, β1,...,βp)= ε = yi − β0 − βjxij . (2.10) i=1 i=1 j=1 ! X X X Equation (2.8) can be written in a more compact notion by formulating the model using matrix notation [8, p. 71–73]

y = Xβ + ε, (2.11) where

y1 1 x11 x12 ··· x1p

 y2  1 x21 x22 ··· x2p  y = , X = (2.12) ......  .  . . . . .      y  1 x x ··· x   N   N1 N2 Np     β0 ε1

β1  ε2  β = , ε = . (2.13) . .  .   .      β  ε   p  N      In general, y is an N vector representing all the response variables, X is an N × (p + 1) matrix of the regressors, β is a (p + 1) × 1 vector representing CHAPTER 2. BACKGROUND 9

the regression coefficients, and ε is an N vector of random errors. Using the transpose matrix operator ′ on a matrix will flip the matrix over its diagonal; specifically, it switches the row and column indices of the matrix by yielding another matrix denoted as A′ [8, p. 577]. The RSS can now be expressed as

RSS(β)= ε′ε =(y − Xβ)′(y − Xβ). (2.14) The minimization problem used to estimate the regression coefficients is

β = arg min RSS(β) (2.15) β∈Rp+1 which is a convex problem.b A closed-form solution is obtained by differenti- ating equation (2.15) with respect to β and setting it equal to zero in order to get [8, p. 71–73],

β =(X′X)−1 X′y. (2.16) The fitted multiple linear regression is then modeled as b

y0 = βX0 (2.17) where y0 is a new prediction of an unseen observation X0. b Linearb regression is one of the simplest models that can be used for solving regression problems, it belongs to the class of parametric methods. The main advantages are that the method is very simple to understand and has good interpretability. Furthermore, it is fast both during the training and testing process and the usage of memory consumption is low [7, p. 104–109]. The disadvantages are that the model assumes that the data is normally distributed and fitted to a hyper-plane of (p + 1)-dimensional [7, p. 22].

2.3 Shrinkage Methods

The idea of shrinkage methods is to perform a linear regression while shrink- ing the estimation of the regression coefficients β toward zero. The shrinkage methods introduce bias but can remarkably reduce the variance of the esti- mates, which could decrease the test error of theb model if the latter effect is larger [9, p. 61]. In this section the shrinkage methods ridge regression and lasso regression will be described. 10 CHAPTER 2. BACKGROUND

2.3.1 Ridge regression The ridge regression method [9, p. 61–64] minimizes the residual sum of squares subject to the limitation that the sum of squares of the regression co- efficients is less than a constant t:

N p 2 p ridge 2 β = arg min yi − β0 − xijβj subject to βj ≤ t. β∈Rp+1 i=1 j=1 ! j=1 X X X b (2.18) Equation (2.18) can be written in a closed form with the penalty term λ,

N p 2 p ridge 2 β = arg min yi − β0 − xijβj + λ βj (2.19) β∈Rp+1   i=1 j=1 ! j=1 X X X  b where there exists a one-to-one correspondence between parameterst and λ [9, p. 63]. The features in the data set need to be standardized before solving equa- tion (2.19) because the ridge solutions are not equivariant under scaling [8, p. 63]. The ridge regression method handles the problem of multicollinearity as it imposes a size constraint on the coefficients. Multicollinearity is a phe- nomenon in regression analysis where there exist linear dependence between a large number of the regressors. Multicollinearity leads to an unstable model as the estimated regressor coefficient will have large variances and covariances [8, p. 285–290].

Doing the same thing that was done in Section 2.2.2 for multiple linear re- gression, where equation (2.19) can be written in a more compact notion by formulating it on matrix notation, the minimization problem can then be ex- pressed as

′ βridge = arg min (y − Xβ) (y − Xβ)+ λβ′β. (2.20) β∈Rp+1 Equation (2.20)b has the solution [9, p. 63]

′ −1 ′ βridge = X X + λI X y. (2.21) The optimal is typically found using cross-validation with the error metric λ b RMSE. The model with the lowest RMSE value yields the optimal value of λ. CHAPTER 2. BACKGROUND 11

2.3.2 Lasso regression The lasso regression method [9, p. 68–69] minimizes the residual sum of squares subject to the limitation that the sum of squares of the absolute value of the regression coefficients is less than a constant t:

N p 2 p lasso β = arg min yi − β0 − xijβj subject to |βj|≤ t. β∈Rp+1 i=1 j=1 ! j=1 X X X b (2.22) Equation (2.22) can be written in a closed form with the penalty term λ,

N p 2 p lasso β = arg min yi − β0 − xijβj + λ |βj| (2.23) β∈Rp+1   i=1 j=1 ! j=1 X X X  b where there exists a one-to-one correspondence between parameters t [9, p. 68]. A general closed form solution for the lasso method does not exist [9, p. 69].

From the equation (2.19) and equation (2.23), it can be noted by comparing the penalty term that the two methods differentiates from each other. The ridge regression method has a quadratic penalty term compared to the lasso regres- sion method that has an absolute value penalty term. Using different penalty terms will lead to different solutions when using both methods. In the solution of the lasso regression model, some regressors will be set to zero while in the ridge regression model these regressors are generally slightly larger than zero. This makes the lasso regression a method better for feature selection because it reduces the feature space by setting some regressors to zero, creating a sim- pler model that does not include those regressors [9, p. 68–73].

The optimal λ for the lasso regression method is found using the same tech- nique described for the ridge regression method. Cross-validation is used to evaluate different models based on their λ-value.

The advantages of the shrinkage methods are mainly that they handle over- fitting when working with a high-dimensional data set. Reducing the fea- ture space makes the model more interpretable and at the same time reduces the dimension during training. Comparing the different shrinkage methods 12 CHAPTER 2. BACKGROUND

shows a trade-off between the models, lasso only selects a subset of the predic- tors which makes it produce simpler and more interpretable models compared ridge regression [7, p. 223–224]. The disadvantage is that shrinkage methods are that reducing the dimension of the feature space too much can introduce high bias in the model [7, p. 33–36].

2.4 k-Nearest Neighbors Algorithm

The k-nearest neighbors algorithm (k-NN) [10] is a method that can be used to solve clustering, classification, and regression problems. We are going to fo- cus on the k-NN for solving regression problems. Given a query point x0, the k-NN algorithm identifies the k closest points in the training set that are closest to x0, represented by the neighborhood N0. The Euclidean distance metric is used for finding the nearest neighbors to query point x0, where the Euclidean N 2 distance is calculated as following d (xi, x0) = i=1 (xi0 − xi1) [10]. (0) (0) r  If d1 ,...,dN are the sorted distances in a increasingP order, then the neigh- (0) borhood N0 is defined as N0 = xi subject to d(xi, x0) ≤ dk . Now it can compute the average value ofnk nearest neighbors target variableso which 1 is the output of the model given by y0 = k i∈N0 yi [10]. P The k-NN method belong to the classb of non-parametric methods. The advan- tages of the method are that it does not make an assumption about the distribu- tion of the data compared to linear regression, it is simple to understand, and can be used both for classification or regression problems. The disadvantages are that the method has a much higher memory consumption and is more com- putationally expensive than linear regression, as the algorithm must calculate the distance and sort all the training data at every prediction [7, p. 104–109]. Furthermore, k-NN is known for not working well when handling large data sets and when working with high dimensional data because of the curse of di- mensionality [7, p. 104–109]. The curse of dimensionality is a phenomenon that occurs when working in high-dimensional spaces. When using k-NN in high dimensional data, there will not exist any nearest neighbor to the query point x0. The density of the feature space grows exponentially with the num- ber of dimensions, this exponential growth creates high sparsity in the data set which makes the k-NN perform badly compared to parametric methods [7, p. 104–109]. CHAPTER 2. BACKGROUND 13

2.5 Decision Trees

Decision trees is a supervised non-parametric method that can be used both for classification and regression problems, which is also known by the name CART that stands for classification and regression tree. The main idea with decision trees is that the input data are partitioned into two non-overlapping regions in each split. The splitting process continues until a stopping criterion is reached, e.g., when no region contains more than six training data. A tree- like structure is obtained in the splitting process which makes the decision tree more explainable and interpretable compared to other statistical methods. But this comes with the loss of competitiveness as the predictive accuracy may not be that great. Methods such as random forest and boosting can improve the predictive accuracy of the trees to a large extent by making the trees more complex. Before delving deep into how random forest and boosting helps to construct more powerful trees, the fundamentals of regression trees need to be explained.

2.5.1 Regression Trees Given the input vector X, one would want to predict the response y. The regression tree works by partitioning the feature space into M distinct regions M {Rm}m=1, and in each region Rm the response y is modeled as a constant cm. The response variable y can then be expressed as function of x:

M

f(x; a)= cmI (x ∈ Rm) (2.24) m=1 X where I (·) is the indicator function which takes the value 1 if the condition is M true and 0 otherwise and a = {(Rm,cm)}m=1. The decision tree is constructed by minimizing the residual sum of squares error,

N 2 RSS(a)= (yi − f (xi; a)) i=1 M X M (2.25) 2 2 = (yi − f (xi; a)) = (yi − cm) m=1 i:x ∈R m=1 i:x ∈R X Xi m X Xi m

It can be observed that for each m, the estimator cm of cm that minimizes each sum of squares 2 is always the average of the response i:xi∈Rm (yi − cm) yi b P 14 CHAPTER 2. BACKGROUND

in region Rm such that xi ∈ Rm which is given by cm = S(Rm; y1,...,yn), where the function S is given by b 1 S(R; y ,...,y )= y (2.26) 1 n |R| i i:x ∈R Xi The regions Rm minimizing the sum of squares is computationally infeasible to find because of the exponential size of the search space [11, p. 151], thus

finding the regions Rm a greedy approach is used when constructing a decision tree. This method is known as recursive binary splitting. Hence, to find the regions Rm minimizing

M 2 (yi − cm) (2.27) m=1 i:x ∈R X Xi m the greedy approach starts at the root node withb all of the data. The feature space X is partitioned into a pair of half-planes R1(j,s) and R2(j,s) by iden- tifying the splitting variable j and split point s:

R1 (j,s)= {(X1,...,Xp) |Xj ≤ s} and R2 (j,s)= {(X1,...,Xp) |Xj >s} . (2.28) The optimal variables j and s are found by solving the optimization problem given by

2 2 min min (yi − c1) + min (yi − c2) . (2.29) j, s  c1 c2  xi∈XR1(j,s) xi∈XR2(j,s)   For a given j ∈ R and s ∈ R, the inner optimization of equation (2.29) is solved by c1 = S(R1(j,s); y1,...,yn) and c2 = S(R2(j,s); y1,...,yn). For a given splitting variable j, an optimal split point s can be found scanning through theb data [9, p. 307]. Repeating for allb j yields the optimal pair (j,s). When the optimal pair is found the feature space is partitioned into two new re- gions and the splitting process is continued in each of these new regions. The splitting process continues until some stopping criterion is met, e.g. reaching the maximum depth of the tree.

The question that then arises is, how large of a tree should one grow? A too- large tree will lead to overfitting (high variance) and a small tree yields to high CHAPTER 2. BACKGROUND 15

bias. The tree size controls the model’s complexity and should, therefore, be decided adaptively from the data.

The process described above results in one large tree T0 is being built. The large tree T0 is then pruned using cost-complexity pruning in order to obtain a sparser model. The idea of cost-complexity pruning is to calculate a cost function for each terminal node. The cost function is given by [12]

Rα (T )= R (T )+ α |T | , (2.30) where |T | 2 R (T )= (yi − cm) . (2.31) m=1 i:x ∈R X Xi m Thus, R (T ) is the training error and if α is the complexityb cost per terminal node, then Rα (T ) is a linear combination of the cost of the training error and the cost penalty for complexity. For each value of α, find a subtree Tα ⊆ T0 which minimizes Rα (T ), i.e. Rα (Tα) = arg min Rα (T ). Choosing a small T ⊆T0 value of α will give a small penalty for having a large number of terminal nodes, thus, the subtree Tα will be large. Conversely, as α increases, the min- imized subtree will have fewer terminal nodes. Choosing a relatively large α will results in that the subtree only contains the root node, and the large tree will have been entirely pruned.

The optimal subtree Tα ⊆ T0 is found by creating a sequence of subtrees by eliminating consecutively leaves whose elimination leads to the smallest |T | increase of m=1 QRm . From this sequence of subtrees, the optimal α is found by five- of tenfold cross-validation [9, p. 305–308]. P Algorithm 1 describes the process of building a regression tree, in the algo- rithm the function, StoppingCondition(D) is the stopping criterion used when building a tree. 16 CHAPTER 2. BACKGROUND

Algorithm 1 BuildTree(D): Building a Regression Tree N 1: Input: Training data D = {(xi,yi)}i=1 2: if StoppingCondition(D) then 3: Break 4: end if 5: Find the optimal splitting variable j and split point s is found by solving the optimization problem: 2 2 6: =min min min , (j,s) xi∈R1(j,s) (yi − c1) + xi∈R2(j,s) (yi − c2) j, s c1 c2   where c1 and c2 are estimatedP by P 7: c1 = S(R1(j,s); y1,...,yn) and c2 = S(R2(j,s); y1,...,yn) 8: Partition the region into a pair of half-planes DL and DR: 9: b DL = R1(j,s)= {(X1,...,Xbp) |Xj ≤ s} 10: DR = R2(j,s)= {(X1,...,Xp) |Xj >s} 11: BuildTree(DL) 12: BuildTree(DR) 13: Output: A decision tree

2.6 Random Forest

The decision tree explained in Section 2.5 suffers from high variance [7, p. 316–317], which is one substantial problem with decision tree [9, p. 312]. One way of reducing the variance of a statistical learning method is to use bootstrap aggregation (bagging) [9, p. 282] which will be now introduced before describing the random forest method. Bagging reduces the variance by using multiple trees built on different training sets, more specifically using bootstrap sampling without replacement by creating B independent samples from the training set. A decision tree is trained on the bootstrapped samples individually and the result obtained are the average of the predictions made by each decision tree. For a given data point x0, the bagging method can be expressed as

B 1 f (x ; a , . . . , a )= f(x ; a ), (2.32) bag 0 1 B B 0 b b X=1 b b ∗ where B is the number of bootstrapped samples and fb (x0) is a decision tree trained on the bth bootstrapped sample. If there exists a strong feature in the data set together with moderately strong features, thenb the majority of the trees will use this strong feature in the top split, this leads to that most trees will look CHAPTER 2. BACKGROUND 17

similar and be highly correlated [7, p. 316–317]. A strong feature is a feature that correlates with the response variable [13]. This is a problem when ap- plying bagging on decision trees as averaging highly correlated trees does not lead to a high reduction of the variance [7, p. 320].

A random forest has the same structure as bagging when constructing trees on bootstrapped samples but offers an improvement over bagged trees. It over- comes the problem with highly correlated trees by decorrelating the trees. For each split in a tree, a random sample of m features from the full set of p fea- tures is considered. Then, on average (p − m)/p of the splits will not consider strong features and other features will be in the top split [7, p. 320].

2.7 Boosting

Boosting is similar to bagging in regards to creating one predictive model from a combination of B weak learners/trees, where a weak learner is slightly more accurate than a random guess. A major difference between boosting and bag- ging is that boosting grows the trees sequentially with the objective to learn from the previously built trees. When each tree is fitted on a modified ver- sion of the original data set, information from the previously fitted tree is used when fitting the current tree [7, p. 321–322]. From Section 2.5.1, note that B Mb the parameters in the B weak learners are a = {ab}b=1, ab = {(Rm,cm)}m=1 where the Rb are the non-overlapping regions in the feature space and cb are the fitted constant values in each region.

Suppose that a training set with N observations is used, where each observa- p tion consists of p features xi ∈ R , a corresponding response variable yi for each observation, {xi,yi} and where the boosted tree model is a sum of B weak learners. Then the response variable yi is given by the weighted combi- nation of the weak learners b B

yi = F (xi; P )= βbf(xi; ab) (2.33) b X=1 where f(xi; ab) is a weakb learner which is given by equation (2.24) and β = B {βb}b=1 are the weights associated with a corresponding weak learner. Fur- thermore, P = (a, β) is a set of parameters that are estimated normally by 18 CHAPTER 2. BACKGROUND

minimizing the empirical risk which given by

N B

β, a = arg min L yi, βbf(xi; ab) (2.34) β,a∈R i=1 b !   X X=1 where L(·) is theb squaredb loss function. The optimization problem described above requires the use of intensive numerical optimization techniques, conse- quently one rather approximate the solution to equation (2.34) by using for- ward stagewise additive modeling (forward stagewise boosting) [9, p. 341– 343]. The intuition behind the algorithm is to estimate the parameters of the trees in a stepwise manner. In each iteration b ∈ 1, 2,...,B, greedy constructing the boosted tree Fb(x) as the sum of the trees up to the stage Fb−1(x) and a new weak learner f(x; ab) is added at stage b, which is given by Fb(x) = Fb−1(x) + βbf(x; ab). Hence, considering all the parameters b−1 {βi, ai}i=1 of Fb−1(x), the weight βb and ab associated with f(x; ab) are esti- mated by solving the optimization problem

N

βb, ab = arg min L (yi,Fb−1(xi)+ βf(xi; a)) (2.35) β,a i=1   X The algorithmb b is a two-step process which is formally referred to as boosting [14]. Below is the algorithm which describe the forward stagewise additive modeling.

Algorithm 2 Forward Stagewise Additive Modeling

1: Initialize f0(x)=0 2: for b =1 to B do N 3: (a) Compute βb, ab = arg min i L (yi,Fb−1(xi)+ βf(xi; a)) β,a =1 4: (b) Set Fb(x)= Fb−1(x)+ βbf(x; ab) b P 5: end for b

The disadvantage with forward stagewise additive modeling is that it uses the squared error loss function [9, p. 342–343], generally the squared error loss function is not that optimal to use when solving different types of problem for example when solving classification problems [9, p. 343]. Five years after the forward stagewise additive modeling was introduced, the AdaBoost algorithm was presented which is equivalent to forward stagewise additive model when using exponential loss function [9, p 343–345]. The AdaBoost algorithm will be presented in the next section. CHAPTER 2. BACKGROUND 19

2.7.1 AdaBoost One of the first boosting algorithms for solving classification problems is adap- tive boosting (AdaBoost) (specifically, AdaBoost.M1) which was proposed by Freund and Schapire in 1995 [15], where the AdaBoost.M1 is interpreted as a forward stagewise additive model with exponential loss function [9, p 343– 345]. AdaBoost is still today one of the most popular and widely used boosting algorithms that has been applied in numerous fields [16]. The intuition behind the algorithm is to maximize the efficiency of each weak learner by using adap- tive boosting, hence adaptive refers to the idea that no prior knowledge of the accuracies of the weak learners is needed. Instead, it adapts to these accuracies and generates a weighted combination of the weak learners, where the weight of each weak learner is a function of its accuracy [15]. More specifically, in

AdaBoost each training sample receives a weight wi that is used when each weak classification tree is fitted on a modified version of the original data set compared to the forward stagewise additive modeling [9, p. 337–339]. Let wi be the sample weight that represent the relative importance of each sample and is used to compute the training error in each fit. In in statistical terms the sample weight wi represents an estimation of the sample distribution [17]. In each iteration, the weights are recalculated, where the weight is increased for those samples that were not correctly classified and decreased of the ones that were correctly classified. Thus, as the procedure continues, each new tree is forced to focus on those samples that are most difficult to classify [18].

For solving regression problems a slightly more complex variant of the Ad- aBoost.M1 was proposed by Drucker, the AdaBoost.R2 algorithm [19]. Druc- ker conducted experiments using AdaBoost.R2, where he obtained good re- sults when solving regression problems. The key to AdaBoost is the reweight- ing of those observations that were not correctly classified. In the regression setting the response variable yi is continuous rather than a categorical vari- ′ able. Thus, the prediction error of a tree will be a real-valued error ei = ′ |yi − f(xi; ab)|. The prediction error ei can be arbitrarily large which leads to the fact that a mapping of the error to the adjusted error ei is needed in the reweighting step of the method. More specifically, the adjusted error is ob- N ′ tained by expressing each error in relation to the largest error D = maxi=0|ei| so that the adjusted error is bounded in the interval [0, 1]. Thus, the adjusted ′ ei error is obtained using a linear loss function, hence ei = D [19].

If a weak learner f is slightly more accurate than a random guess (ǫb < 0.5), 20 CHAPTER 2. BACKGROUND

N b b then the sum of the weighted error terms ǫb = i=1 ei wi should be less than 0.5 yielding that the error of AdaBoost declines exponentially fast in B num- P ber of rounds otherwise if the error ǫb ≥ 0.5 the iteration is stopped. The level to which the data sample xi is reweighted will depend on how large the rela- tion between the error of f(xi; ab) is relative to the largest error in the sample. More specifically, the weights are updated using the weight updating param- ǫb eter denoted by βb which is computed by and measure the confidence in 1−ǫb the regressors. The updated weights are given by

b b 1−ei b+1 wi βb wi = , (2.36) Zb b N b 1−ek where Zb is a normalizing constant defined by Zb = k=1 wkβb . The weights are increased for the data points if the error is large i.e. eb ≥ 0.5 P k otherwise the weights are decreased. The process is repeated until all the B learners are built or ǫb ≥ 0.5. For each input xi, the associated learner makes a prediction f(x; ab),b = 1,...,B. The cumulative prediction y is obtained using the B learners and calculating the weighted median, using ln(1/βb) as the weight for tree f(x; ab): b

1 1 1 f(x) = inf y ∈ Y : ln( ) ≥ ln( ) (2.37)  βb 2 βb  b  b:f(Xx;ab)≤y X  The AdaBoost.R2 algorithm has the disadvantage of not being capable of deal- ing with weak learners that have prediction error larger than 0.5. Additionally, the algorithm is sensitive to noise and outliers since the reweighting expression is proportional to the prediction error [20]. The main advantage of AdaBoost compared to the other boosting algorithms is that it does not have any param- eter that has to be calibrated [20]. The AdaBoost.R2 algorithm is given by Algorithm 3. CHAPTER 2. BACKGROUND 21

Algorithm 3 AdaBoost.R2 w1 (1) 1: Initialize the weight vector , thus wi =1/N for 1 ≤ i ≤ N 2: for b =1 to B do 3: (a) Train the bth weak regression tree f(x; ab) using Algorithm 1 with b training data weighted according to wi b 4: (b) Compute the adjusted error ei for each training sample: N 5: let Db = maxj=1|yj − f(xj; ab)| y f x a 6: then eb = | i− ( i; b)| i Dt 7: (c) Compute the adjusted error of f: N ′b b 8: ǫb = i=1 ei wi 9: if ǫ ≥ 0.5 then b P 10: Set B=b-1 11: Break 12: end if ǫb 13: (d) Let βb = (1−ǫb) 14: (e) Update the weight vector: 1− b b ei b+1 wi βb 15: w = , where Zb is the normalizing constant i Zb b N b 1−ek 16: Zb = k=1 wkβb 17: end for P 18: Output: The boosted tree: f(x) = inf y ∈ Y : ln( 1 ) ≥ 1 ln( 1 ) b:f(x;ab)≤y βb 2 b βb n P P o 2.7.2 Gradient Boosting Gradient boosting is a method of combining multiple weak learners into an ensemble model in an iterative fashion so that it can produce accurate predic- tions. In AdaBoost, the training samples are modified based on the results of the current iteration so that the following tree obtains a better fit. The main idea behind gradient boosting is to improve an imperfect model Fb by adding a new learner f(x; ab) so that the updated model produce a true prediction, y = y. Thus, this would imply the following, b Fb+1(x)= Fb(x)+ f(x; ab)= y ⇔ f(x; ab)= y − Fb(x). (2.38)

These are the residuals, the gradient boosting will fit f(x; ab) to the residual y − Fb(x). As in the previous boosting methods explained, each Fb+1 aims to correct the errors of its predecessor Fb. The residuals can be interpreted as the 22 CHAPTER 2. BACKGROUND

negative gradient of some specified loss function L(·) with respect to Fb(x) [14]. The residuals for the bth iteration is

∂L(y ,F (x)) r = − i (2.39) ib ∂F (x)  F (x)=Fb−1(xi) N and subsequently the next learner f(x; ab) is trained on the residual set {(xi,rib)}i=1, where the parameter ab is obtained by estimating

N

ab = arg min L (rib, βbf(xi; ab)) (2.40) ab∈R i=1 X N i.e. the learner f(x; ab) build using Algorithm 1 with the training set {(xi,rib)}i=1. The weights for the new learner βb is computed by solving

N

βb = arg min L (yi,Fb−1(xi)+ β · f(xi; ab)) (2.41) β∈R i=1 X Thus, after havingb estimated the learner βb that corrects the residual and the model is updated by adding the latest learner to the ensemble model [14]

Fb(x)= Fb−1(x)+ βbf(x; ab). (2.42) Gradient boosting is initialized with a model consisting of a constant value N F0(x) = arg min L(yi, β), and incrementally expands it in a greedy β i=1 way. Algorithm 4 describesP the gradient boosting method.

In these boosting methods gradient boosting, AdaBoost and forward stagewise boosting, a weak learner is introduced to compensate for the imperfections of the existing learners i.e. multiple learners are combined to improve the predic- tion accuracy. When using gradient boosting the imperfections are identified by gradients of the specified loss function and in AdaBoost the imperfections are identified by the highly weighted data points. These imperfections tell us how to improve our model compared to forward stagewise boosting that only adds the weak learner to the ensemble model. AdaBoost is more efficient in both memory consumption and training speed compared to gradient boosting as no gradients need to be computed as it only reweights the data samples. Gra- dient boosting is more flexible than AdaBoost in the sense that AdaBoost can be interpreted as a forward stagewise additive model when having exponential loss function, which is also mentioned in Section 2.7.1. CHAPTER 2. BACKGROUND 23

Algorithm 4 Gradient Boosting N 1: Input: Training set {(xi,yi)}i=1 and a differentiable loss function L(yi,f(x)) N 2: Initialize model with a constant value: F0(x)= arg min L(yi, β) β i=1 3: for b =1 to B do P 4: (a) For i =1, 2,...,N compute ∂L(yi,F (xi)) rib = − ∂F (xi) F (x)=Fb−1(x) h i N 5: (b) Fit a weak learner f(x; ab) to the training set {(xi,rib)}i=1 using Al- gorithm 1 N 6: (c) βb = arg min L(yi,Fb−1(xi)+ βf(xi; ab)) β i=1 7: (d) Fb(x)= Fb−1(x)+ βmf(x; ab) b P 8: end for

9: Output: FB(x)

2.7.3 Categorical Boosting: CatBoost Categorical boosting (CatBoost) was introduced by Prokhorenkova et al. [21] to address the issue with bias that arises in the gradient boosting. They noted that using the ith training sample to estimate the gradient rib (step 4 in Al- gorithm 4) could make it biased with respect to the model Fb(x) (step 7 in Algorithm 4), since when the gradient rib is estimated for the xi sample, it is based on the model Fb(x) that has in the previous step been built using that all the training samples including the ith sample and their corresponding tar- get variables. Thus, they come up with a solution to the problem: the model

Fb(x) needs to be approximated without the ith sample as this will make the gradient unbiased with respect to it. They present a modification of gradient boosting that will address this issue. The algorithm is referred to as ordered i boosting. For each iteration b, a separate model Fb (x) is trained without the i ith training sample. This yields a model Fb (x) that is unbiased with respect to the ith training sample. Hence, step by step each model is updated for ev- i N ery Bth stage, yielding N final models {FB(x)}i=1 which is the output of the model as shown by the Algorithm 5 below. 24 CHAPTER 2. BACKGROUND

Algorithm 5 Ordered Boosting N 1: Input: {(xi,yi)}i=1 1 N 2: Initialize F1 (x)=0,...,F1 (x)=0 3: for b =1 to B do 4: for i =1 to N do 5: for j =1 to i − 1 do 6: (a) ∂L(yj ,u) gj = − ∂u i u=Fb−1(xj )

7: end for 8: (b) Fit a weak learner f (x; ab) to the training set (xj,gj) for j =1, . . . , i − 1 as described in Section 2.5.1 i i 9: (c) Fb (x)= Fb−1(x)+ f(x; ab) 10: end for 11: end for 1 N 12: Output: FB(x),...,FB (x)

Note that the ordered boosting algorithm estimates the gradients sequentially using training samples. Thus, this would introduce high variance from the estimated gradients related to the samples used early in the training set. To address this issue, Prokhorenkova et al. suggested that instead one should use

G permutations of the training set denoted by σ1,...,σG that are sampled at i each stage b when building the models Fb .

This algorithm is computationally infeasible because one needs to train N dif- ferent models which increase the complexity and memory requirements by N times [21]. The authors introduces a more efficient strategy to make its ex- ecution time more similar to the popular boosting techniques XGBoost and LightGBM [21]. Thus CatBoost utilizes a more efficient strategy which is a modification of the ordered boosting algorithm [21]. The specifics of the more computationally efficient strategy is not covered in this thesis.

2.7.4 XGBoost XGBoost [22] is an abbreviation of extreme gradient boosting and as the name suggests XGBoost is based on the gradient boosting method. One of the ad- vantages of XGBoost is its ability to scale well with large data sets and faster computational speed during training the model due to parallel and distributed computing. Compared to gradient boosting, XGBoost also adds a regular- ization term to the cost function [22]. The learning process for the additive CHAPTER 2. BACKGROUND 25

functions used in the model is being carried out by minimizing the following regularized objective

N B

L(yi,yi)+ Ω(f(x; ab)) (2.43) i=1 b X X=1 b 1 2 where Ω is defined as Ω(f)= γT + 2 λ kwk . The function Ω(f) penalizes the complexity of the model by the parameter γ, which penalizes the number of leaves. Whereas the second term penalizes large leaf weights, each individual tree has a weight associated with each leaf, thus yielding a score to a leaf. The final model that sequentially adds all trees, makes a cumulative prediction. Hence the weight penalty prevents any tree from having a too large influence on the cumulative prediction. The loss function L measures the difference between the prediction yi and the target variable yi [22]. Furthermore, let yi be the prediction of the ith observation at the bth iteration, subsequently, f(xi; ab) is needed to beb added in order to minimize the subsequent objective b N

Lb = L (yi,Fb−1(xi)+ f(xi; ab))+Ω(f(x; ab)) (2.44) i=1 X where the model is improved the most by choosing f(x; ab) greedily [22]. Making use of a second-order approximation to quickly optimize the objective in the general setting

N 1 L ≃ L(y ,F (x )) + g f(x ; a )+ h f 2(x ; a ) + Ω(f(x; a )) b i b−1 i i i b 2 i i b b i=1   X (2.45) where ≃ corresponds to asymptotic equivalence between equation (2.44) and equation (2.45). The first and second order gradient statistics are represented by g = ∂ L(y ,F (x )) respective h = ∂2 L(y ,F (x )). i Fb−1(xi) i b−1 i i Fb−1(xi) i b−1 i Removing the constant term L(yi,Fb−1(xi)), expanding the Ω function and letting Ij denote the set of instances belonging to leaf j gives the following 26 CHAPTER 2. BACKGROUND

simplified objective function

N 1 1 L ≃ g f(x ; a )+ h f 2(x ; a ) + γT + λ kwk2 b i i b 2 i i b 2 i=1   X (2.46) T 1 = g w + h + λ w2 + γT  i j 2  i  j  j=1 i∈I i∈I X Xj Xj      ∗ The expression for the optimal weight wj of leaf j can now be computed

gi ∗ i∈Ij wj = − (2.47) hi + λ Pi∈Ij This shows that the weight is relatedP to the gradients used to fit an individ- ual weak learner and the regularization term. The XGBoost paper introduced by Chen and Guestrin [22] goes through all the intermediate steps, were only a few results have been summarized in this section. The algorithm uses dif- ferent techniques too increase efficiency in both memory consumption and training speed Chen and Guestrin when building a weak learner. When the weak learner is built, the ensemble model is updated by adding the latest learner to the ensemble learner in a similar way as gradient boosting, Fb(x)= ∗ Fb−1(x)+ βbf(x; ab). The βb is calculated using the weight wj for the jth leaf Ij.

2.7.5 LightGBM Light gradient boosting machine (LightGBM) [23] is a decision tree that uti- lizes the gradient boosting framework. It was introduced by Microsoft re- searchers as they thought that the already existing XGBoost method was inad- equate regarding the efficiency and scalability of the method when the feature dimension is high and data size is large [23].

Finding the best split points in the learning process of growing a decision tree is the most time-consuming part [23]. Most gradient boosting decision tree (GBDT) methods use the greedy algorithm, which enumerates every possible split on all the features. The algorithm performs the task of finding the most optimal splits but it consumes a lot of memory and is inefficient in the training process [23]. Another algorithm that is used instead is the histogram-based algorithm [23] which groups continuous features into a set of discrete bins CHAPTER 2. BACKGROUND 27

in order to construct feature histograms during training [23]. This algorithm helps with increasing efficiency in both memory consumption and training speed [23].

LightGBM grows a tree vertically compared to other GBDT methods who grows a tree horizontally. Thus, LightGBM grows tree leaf-wise whereas other GBDT methods grow level (depth)-wise. Figure 2.1b shows a leaf-wise tree growth and figure 2.1a shows a level-wise tree growth [24]. This different ap- proach of growing a tree and together with the histogram algorithm makes the LightGBM an effective method when handling large-scale data and features as these two methods reduce the memory consumption and speed up the training process [23].

The boosted tree model is built using the same techniques used in XGBoost, where the tree is a sum of B additive functions. LightGBM’s goal is to reduce the complexity of the feature histogram built during training, by downsampling data and features using two methods. These methods are gradient-based one side sampling (GOSS) and exclusive feature bundling (EFB) [23]. GOSS is used as a sampling algorithm, which optimizes the training samples by retain- ing samples with large gradients and giving constant weights to a significant proportion of the samples with small gradients so that they can be excluded. Hence, GOSS emphasizes more under-trained weak learners without changing the data distribution [23]. EFB is used to combine mutually exclusive features in order to reduce the number of features in a high-dimensional space. The rea- son for downsizing the feature space is because high-dimensional data is often sparse, hence a high percentage of the feature values contain zeros. Combining mutually exclusive features, which are features that do not take nonzero val- ues concurrently reduces the dimensionality of the data and simultaneously improve the training time [23]. 28 CHAPTER 2. BACKGROUND

(a)

(b)

Figure 2.1: (a) Illustrating leaf-wise tree growth. (b) Illustrating level-wise tree growth [25].

2.8 Hyper-Parameter Tuning

While training the models, their parameters are set to default values but need to be configured to their “best” values. Using the default value of the parameters may not yield the most optimal model as for example parameters that chooses the depth and number of leaves in LightGBM affect the model’s accuracy and choosing optimal values for these parameters will improve the accuracy of the model [26]. The parameters of the models are optimized using grid search. The idea of the method is quite simple and will be presented below.

2.8.1 Cross-validation

In order to evaluate the performance of a predictive model f(X) without using the same information in both the training and evaluation stage, which makes the results less trustworthy, the data set can be divided intob three parts: train- ing data, validation data, and test data [7, p. 176]. The training data is used to train the model, and then the trained model is used to predict the responses CHAPTER 2. BACKGROUND 29

for the observations in the validation set. Evaluating a set of models using the training and validation data set, one would retrieve the best model according to a predefined metric that the models are evaluated on when using the validation data. Thus, the test data is only used for estimating the prediction performance of the best model. This method is called holdout cross-validation as the test data is held out until the model is trained [27].

The holdout method is the simplest cross-validation method used. Unfortu- nately, it has a disadvantage in that when hiding some observations from the training data, one risks not including essential data that is needed for the model to properly learn the relationship between X and Y [27, p. 708]. An improve- ment of the holdout method is the K-fold cross-validation, which is described below. The advantage of this method is that it is less important how the data gets divided. Each observation gets the chance to be in the test data exactly once and in the training data K − 1 times. But this comes with the disad- vantage that training procedure needs to be repeated K times, hence it takes K times longer to make an evaluation of the model [27, p. 708].

The test data is fixed, i.e. excluded from the validation and training data. In K-fold cross-validation the data set containing both the validation and training data is divided into K mutually exclusive subsets (the folds) of roughly equal size. One of the folds is selected as the validation set and the remaining folds are the training set [28]. This process is repeated K times with a different fold used for validation each time while calculating the validation error during each time. The most typical choice for K is either 5 or 10 [9, p. 241–242]. Figure 2.2 illustrates the 5-fold cross-validation process, where K =5. 30 CHAPTER 2. BACKGROUND

Data Set

Test data Validation data Train data

Data used during cross-validation

st 1 iteration Validation data Train data Train data Train data Train data e1

nd 2 iteration Train data Validation data Train data Train data Train data e2

rd 3 iteration Train data Train data Validation data Train data Train data e3 validation error

th 4 iteration Train data Train data Train data Validation data Train data e4

th 5 iteration Train data Train data Train data Train data Validation data e5

Figure 2.2: Illustrating 5-fold cross-validation on a data set

Let κ : {1,...,N} 7→ {1,..., K } be a mapping function that denotes the index of the partition to which observation i is assigned by the randomization. If kth fitted model f −k(x) represents the model that is trained with the kth fold as validation set, then the cross-validation estimate of the prediction error is given by b

N 1 CV (f)= L(y , f −κ(i)(x )), (2.48) N i i i=1 X where L(·) represents theb loss function for theb model. Considering a set of models f(x,α), where α represents the tuning parameter index of the model, an exhaustive search can be executed with purpose of finding the optimal αth model. Let f −k(x,α) represent the αth fitted model with the kth fold as vali- dation set, then the cross-validation estimate of the prediction error becomes b N 1 CV (f,α)= L(y , f −κ(i)(x ,α)). (2.49) N i i i=1 X b b CHAPTER 2. BACKGROUND 31

The goal is to find the α which minimizes the validation error. The tuning pa- rameter that minimizes the validation error is denoted by α = argmin CV(f,α) α [9, p. 241–242]. This is more commonly known as hyperparameter tuning, where the optimal parameters α for a model can be foundb by iterating throughb every combination of a chosen parameter subset of hyperparameter values. This type of hyperparameter tuningb is also known as grid search. Then α represents the optimal combination of hyperparameters and the best model is denoted as f(X, α), the prediction performance of the best model is then eval-b uated using the test set. Algorithm 6 describes the process of hyperparameter tuning using gridb search.

Algorithm 6 Hyperparameter tuning using grid search N 1: Input: Training data {(xi,yi)}i=1, specified parameter subset values, a predefined metric, number of folds to use K 2: Divide the training data into K folds, fold1 ,..., foldK 3: Choose one combination of parameters from the set of all possible com- binations of parameters 4: for i =1 to K do 5: (a) Train a model on the chosen combination of parameters on all folds

except foldi

6: (b) Evaluate the performance of the model on foldi 7: end for 8: Calculate the mean of the results obtained from all the K-folds 9: Repeat step 3 until all combinations have been tested. 10: Output: Optimal parameters that represent the model that yields the high- est score according to the performance metric

2.9 Metrics of interest

With the aim of determining the predictive performance of the regression mod- els, evaluation metrics are adopted to assess to what extent a model’s predicted target variable coincides with the true target variable. The models can be eval- uated by the root-mean-squared error (RMSE) and the mean absolute percent- age error (MAPE), Booli also evaluates their models on three other metrics: median absolute percentage error (MdAPE), number of valuations that has an absolute percentage error below fifteen percent, and mean of the absolute percentage error excluding the one percent of most extreme errors. The RMSE measures the square root of the average squared differences between the actual 32 CHAPTER 2. BACKGROUND

value yi and the estimated value yi, as shown by

N b 1 RMSE = (y − y )2. (2.50) vn i i u i=1 u X The MAPE measures the average absolutet percentageb value of the differences between the actual value yi and the estimated value yi, as shown by

N 1 |y − y | MAPE = 100 · i bi . (2.51) n yi i=1 X MdAPE measures the median absolute percentage valueb of the differences be- tween the actual value yi and the estimated value yi, as shown by

|y − y | MdAPE = median 100 · ib i (2.52) i=1,...,N y  i  b Chapter 3

Methods

In this chapter, the process of how our results are achieved is discussed. Hence, the acquisition of data, prepossessing of data, and implementation of the mod- els. The purpose of this chapter is to explain the process in detail so that it can be replicated by other people.

3.1 Data

The data used in this thesis will be described in this section. We will first describe how the data was acquired from two different sources, which vari- ables are included in the data set, how the data is preprocessed, and provide an overview of the missing data in the dataset.

3.1.1 Overview of the available data The data that is provided by Booli is data that they have collected during the years through web scraping, by buying data that represents registrations of ownership (sv. lagfart) from Lantmäteriet and collecting from SCB (Statistiska Centralbyrån). The data set consists of both quantitative and qualitative data. Table 3.1 below describes all the variables available from the data set provided by Booli.

33 34 CHAPTER 3. METHODS

Variable Description listingId The ID number for a specified property ad transactionId The ID number for a specified registration of ownership residenceId The unique ID of the residence soldDate The date the residence was sold soldPrice The price that the residence was sold for in SEK price The asking price for the residence in SEK objectType Categorical values of the type of resi- dence, where the different options are villa, radhus, parhus, kedjehus displayObjectType A categorical value that only shows one type of residence in the real-estate ad- vertising rather than multiple ones like in objectType latitude The north-south geographic coordinate of the residence longitude The east-west geographic coordinate of the residence rooms Number of rooms in the residence livingArea Living area of the residence in m2 otherArea The floor area of the residence in m2 plotArea The plot area of the residence in m2 constructionYear Year of construction of the residence distanceToWater The distance to the nearest water body distanceToOceanFront The distance to the nearest part of the ocean ownShore Categorical variable, where 1 represents access to own shore and 0 otherwise sewer Categorical variable (municipial, own or none) CHAPTER 3. METHODS 35

water Categorical variable (municipial, own, summer or none) assessedValue The assessment value of the property assessedValueBuilding The assessment value of the building assessedValuePlot The assessment value of the plot assessmentPoints The amount of assessment points. The number of points describes the condition of the property assessmentYear The year for the latest assessment county The county in which the residence belongs municipality The municipality in which the residence belongs

Table 3.1: Description of the variables available in the data set obtained from Booli.

Data from SCB have also been collected through their website [29], where the data set multipolygon GeoJSON objects that correspond to DeSO areas (demographic statistical areas). Table 3.2 below describes all the variables available from the data set provided by SCB.

The DeSO areas were introduced by SCB [29] to divide Sweden into 5984 smaller regions, where each region contains between 700 and 2700 inhabi- tants. The DeSOs are divided into three different categories, A, B or C. Cate- gory A represents the DeSOs that are located on the countryside, B represents the DeSOs in the urban areas that are not located in the central area of its municipality, and C represents the DeSOs located in the central area of its municipality.

In this thesis, the category of interest is A as it contains the DeSOs that define the residences that are located on the countryside. The purpose of using the data set from SCB is to obtain the residences on the countryside by mapping the transaction of residences using the geographical coordinates latitude and longitude to their corresponding DeSO code. Then, one can filter out residences that do not belong on the countryside based on the category of the DeSO. 36 CHAPTER 3. METHODS

Variable Description deso The ID of every DeSo area desoCat Categorical variable discussed above (country- side, urban area or city center) geometry GeoJSON object representing the multipolygon area that defines the boundaries of the DeSO area

Table 3.2: Description of the variables available in the data set obtained from SCB

3.1.2 Preprocessing The data needs to be preprocessed before building the models as the prepro- cessing has a significant impact on the performance of a supervised machine learning algorithm [30]. The step by step description on how the data set is preprocessed before training the models on it is described below.

Geographical Data The geospatial data described by the geographical coordinates latitude and longitude, is represented in the WGS842 coordinate reference system (CRS) which is the default reference system used if nothing else is specified. The Swedish authority, Lantmäteriet has created another CRS system that they use rather than WGS842. The system they have implemented is referred to as SWEREF99 (Swedish Reference Frame 1999) [31]. SWEREF99 was devel- oped in order to improve the positional accuracy for areas in Sweden since the WGS84 differs 70–80 cm from SWEREF99 where the distance increases with time [31]. Using the built-in method to_crs() in the Python GeoPandas [32], the coordinates in the data set are transformed in to the SWEREF99 sys- tem. Then from these coordinates a Point object is created using the method Point() from the Python library shapley [33].

Residence Type Now each residence in the data set has a point object corresponding to its coor- dinates. Using the sjoin() method in the GeoPandas library [32] one can find the intersection between Points and multipolygons that corresponds to DeSO areas. When finding the intersection between residences and DeSO areas, the variables in the SCB data set will be merged together with the Booli data set. CHAPTER 3. METHODS 37

Now the residences that do not belong to the countryside can be filtered out using the desoCat variable, where residences that do not belong to the cat- egory A (A represents the countryside) are removed from the data set. Also in a similar way, only residences of the type villa are obtained by filtering out the other type of residences based on the variable displayObjectType. If the variable displayObjectType is not equal to villa then it is removed from the merged data set.

KT-Filter The data needs to be KT-filtered which is a standard that Booli uses when preprocessing the data, the purpose is to exclude residences that are under- or overvalued. The filtering is done by dividing soldPrice with assessed- Value and if this factor is less than 0.7 or greater than 4 then that residence is removed from the data set.

Adjusted residence price The date when a residence is sold influences the residence price, e.g. a res- idence sold five years ago cannot be compared to a residence sold recently. Thus, in order to make it possible to compare residences sold on different time frames one needs to calculate the index adjusted residence price. The index adjusted residence price is based on the soldDate and soldPrice variables, where the price of the residence is adjusted by using Boolis’ own developed price index (SBAB Booli Housing Market Index) [34]. The price index is obtained through the Booli API [35] using the Python library requests, which has the built-in method requests.get() that makes it very convenient for the user to send HTTP requests [36]. In order to adjust the historical residence prices, the most recent index value (Index) is divided with the historical index value corresponding to the month that the residence was sold (Indexhistoric). Thus, this yields the price changing factor that is then multiplied with the historical residence price. Hence, index adjusted residence price is calcu- lated by soldPriceDiscounted = Index · soldPrice. The vari- Indexhistoric able soldPrice is then deleted from the data set as the response variable soldPriceDiscounted is added to the data set.

Missing Data An analysis of missing values in the data set was performed, the variables that contains a not a number (NaN) value(s) are shown in Figure 3.1 with 38 CHAPTER 3. METHODS

the corresponding percentage of missing values for each variable. It can be noted that the variables: rooms, distanceToOceanFront, price and listingId have a large percentage of missing values. The variables price and listingId are removed from the data set as they are redundant when building the models. In either case the variable price is removed as it strongly correlated to the response variable soldPriceDiscounted. The vari- ables rooms and distanceToOceanFront have a large percentage of missing values, the missing values in these variables are handled by replacing the NaN values with the number −1. The rest of the variables that have NaN values are handled by replacing the NaN values with the mean of the value obtained in that DeSO area for each variable.

Derived Variables From the soldDate variable a new variable is derived because the current variable describe the date that a residence was sold by the date format (yyyy− mm − dd) as a datetime object. This data type can not be handled by the mod- els. Thus, new numerical variable delta_Date is obtained that describes the difference in days between when a residence was sold from the reference point of the oldest date that a residence was sold in the data set. Then the variable soldDate is removed as it no longer serves any purpose. A new variable totalArea that represents the total area of a residence is derived from the sum of the variables livingArea, otherArea and plotArea.

Figure 3.1: Illustrating which variables in the data set have NaN values and the percentage of missing values and have the percentage CHAPTER 3. METHODS 39

Feature Scaling Feature scaling is a method used to normalize the variables of the data set, it is also known as data normalization [37]. RobustScaler is used to scale the numerical variables in the data set, it is a scaling algorithm that is more robust to outliers compared to other scaling algorithms [38, p. 52]. The variables are scaled according to the following formula: xij −Q1(xi) , where it uses the in- Q3(xi)−Q1xi) terquartile range (IQR). IQR is a measure of statistical dispersion between up- per and lower quartiles, it is defined as IQR = 75th percentile−25th percentile [38, p. 52–53].

Label Encoding The categorical variables in the data set need to be converted into numerical variables before training the models. The most popular way of encoding cate- gorical variables is through one-hot encoding. Given one categorical variable from the set of all categorical variables, these variables are converted to a (bi- nary) column. The column indicates the presence of each possible value from the original variable, as shown in Figure 3.2 the figure below

Color Green Red Yellow

Green 1 0 0 Green 1 0 0 Red 0 1 0 Yellow 0 0 1 Red 0 1 0

Figure 3.2: The values in the original categorical variable are Green, Red and Yellow. Distinct columns are created for each possible value and in every place where the original value was e.g. Green, one puts the number 1 in that column.

Log-Transformation of the Response Variable The distribution of the response variable soldPriceDiscounted is shown in Figure 3.3 together with the Q-Q (quantile-quantile) plot of the adjusted house price. The response variable soldPriceDiscounted is log-transf- ormed as some models do not handle non-normally distributed data that well (e.g. linear regression as they assume normally distributed data). The re- sponse variable is also log-transformed because it is a standard technique that Booli uses when evaluating models. Figure 3.4 shows the distribution of the response variable after log transformation together with the Q-Q plot. 40 CHAPTER 3. METHODS

(a) (b)

Figure 3.3: (a) Shows the plot of the distribution of the adjusted residence price soldPriceDiscounted and (b) shows the Q-Q plot of the adjusted residence price.

(a) (b)

Figure 3.4: (a) Shows the plot of the normalized distribution of the log- transformed adjusted residence price log(soldPriceDiscounted) and (b) shows the Q-Q plot of the normalized distributed log-transformed adjusted residence price.

Data Transformation Box-Cox transformation transforms a continuous variable into approximately normal distribution by mapping the variable using following the set of trans- CHAPTER 3. METHODS 41

formations [39, p. 53]:

λ−1 x , λ =06 y = λ (3.1) (log(x) , λ =0 The value λ is iteratively found by evaluating different values in the range from −3.0 to 3.0, where the optimal value is obtained when the variable is as near as possible to the normal distribution [39, p. 53].

Whitening is a linear transformation method that transforms a vector of ran- dom variables with a known covariance matrix into a set of new variables with identity covariance [40]. The main advantage of whitening is the decorrelation of the dataset, which makes the random variables separable from each other [38, p. 61]. Assume that the data set has the covariance C, then the whitening transform W needs to satisfy: WTCW = I [40].

Correlation Analysis Figure 3.5 shows the correlation between the continuous variable and the re- sponse variable in the data set. It can be noted that the variables assessed- Value, assessedValueBuilding, and assessedValuePlot have a strong correlation, since assessedValue is just the sum of those two variables. Figure 3.6 shows the ten continuous variable that has the strongest correlation to the response variable.

It was mentioned in Section 2.3.2 that some regressors will be set to zero i.e. excluded from the model which makes the lasso regression a great method to use for feature selection. But mainly decision trees are evaluated and which has a “built-in” feature selection. It was also mentioned in Section 2.6 that strong features are used in the top split and the redundant variables will be at the bottom of the tree. Hence, doing a feature selection on the data set will not be done, because all models evaluated except ridge regression has feature selection embedded in the method [41]. 42 CHAPTER 3. METHODS

Figure 3.5: Heat map between numerical variables in the data set.

Figure 3.6: Heat map between the ten highest numerical variables that corre- lates to the residence price in the data set. CHAPTER 3. METHODS 43

3.2 Model Implementation

The data set was randomly split into a training set (70%) and test set (30%) in a stratified manner to preserve the proportion of DESO areas in both sets. A CSV file from Booli was received that contained the data they had used when they built their latest model and the performance of their model on each data instance. The test set was then filtered out by taking the intersection of the data set and the CSV file based on the variable transactionId, then one obtained the transactions of residences that occur in both in our data set and the CSV file. The training set was then again randomly split into a size of 3000 rows in a stratified manner. This subset of the training set was used to identify the best hyper-parameters for given supervised learning models through a three-fold cross-validation. Thus, the entire training set and best hyper-parameters were then used to obtain the final models. The performance of each model was then evaluated on the test set.

3.2.1 Hyper-Parameter Tuning of the Models The hyper-parameters of the models are obtained using grid search with a 3- fold cross-validation. The Python library scikit-learn has a built-in method GridSearchCV which was used to find the optimal hyper-parameters of the models.

Both the lasso and ridge regression models have only one hyper-parameter to tune, which is the regularization term alpha for each model. Table 3.3 shows the type of hyperparameter that was tuned and the magnitude of the regularization term i.e. the size of the penalty term.

Hyper-parameter Values Description alpha [0.01, 0.1, 1, 10, 100] Magnitude of the regu- larization term

Table 3.3: Hyper-parameter set for lasso and ridge regression

For the gradient boosting model, the number of boosting iterations, step size of the gradient boosting algorithm and the maximum tree depth were tuned as shown in Table 3.4. 44 CHAPTER 3. METHODS

Hyper-parameter Values Description n_estimators [3000, 5000, 7000] Number of boosting it- erations that will be per- formed learning_rate [0.01, 0.05, 0.8] Step size of the gradient boosting algorithm max_depth [5, 10, 15, 20] Maximum depth of a tree

Table 3.4: Hyper-parameter set for gradient boosting regression

For the random forest model, the number of trees grown in the forest, the max- imum tree depth, and the minimum number of samples that should be in a leaf before making a split were tuned as shown in Table 3.5.

Hyper-parameter Values Description n_estimators [3000, 5000, 7000] Number of trees grown in the forest max_depth [5, 10, 15, 20] Maximum depth of a tree min_samples_split [2, 5, 10, 15] Minimum number of samples required for a split in a leaf

Table 3.5: Hyper-parameter set for random forest

For the AdaBoost regression model, the number of boosting iterations and step size of the gradient boosting algorithm was tuned as shown in Table 3.6.

Hyper-parameter Values Description n_estimators [3000, 5000, 7000] Number of boosting it- erations that will be per- formed learning_rate [0.01, 0.05, 0.8] Step size of the gradient boosting algorithm

Table 3.6: Hyper-parameter set for AdaBoost regression CHAPTER 3. METHODS 45

For the LightGBM model, the number of boosting iterations, step size of the gradient boosting algorithm, maximum tree depth, maximum number of leaves in one tree, minimum number of samples in one leaf, and minimum Hessian sum in one leaf were tuned as shown in Table 3.7.

Hyper-parameter Values Description n_estimators [3000, 5000, 7000] Number of boosting it- erations that will be per- formed learning_rate [0.01, 0.05, 0.8] Step size of the gradient boosting algorithm max_depth [5, 10, 15, 20] Maximum depth of a tree num_leaves [5, 10, 15, 20] Maximum number of leaves in one tree min_data_in_leaf [20, 50, 80] Minimum number of samples in one leaf min_child_weight [0.01, 1, 5, 10] Minimum Hessian sum in one leaf

Table 3.7: Hyper-parameter set for LightGBM

For the XGBoost model, the number of boosting iterations, step size of the gradient boosting algorithm, maximum tree depth, and maximum number of leaves in one tree were tuned as shown in Table 3.8.

Hyper-parameter Values Description n_estimators [3000, 5000, 7000] Number of boosting it- erations that will be per- formed learning_rate [0.01, 0.05, 0.8] Step size of the gradient boosting algorithm max_depth [5, 10, 15, 20] Maximum depth of a tree num_leaves [5, 10, 15, 20] Maximum number of leaves in one tree

Table 3.8: Hyper-parameter set for XGBoost 46 CHAPTER 3. METHODS

For the CatBoost model, the number of boosting iterations, step size of the gradient boosting algorithm, maximum tree depth, and magnitude of the ridge regularization term were tuned as shown in Table 3.9.

Hyper-parameter Values Description iterations [3000, 5000, 7000] Number of boosting it- erations that will be per- formed learning_rate [0.01, 0.05, 0.8] Step size of the gradient boosting algorithm depth [5, 10, 15, 20] Maximum depth of the tree l2_leaf_reg [2, 10, 20] Magnitude of the ridge regularization term

Table 3.9: Hyper-parameter set for CatBoost Chapter 4

Results

In this chapter, the results of the models will be presented and discussed.

Figure 4.1 shows the RMSE score on the train set when evaluating different scaling and transformation methods. It can be noted that using no transforma- tion at all yields approximately the same RMSE score as the other scaling and transformation methods robust scaling, Box-Cox, and whitening transforma- tion. A combination of Box-Cox and whitening transformation was also eval- uated and gave a better score for XGBoost but also simultaneously worsened the score for both AdaBoost and random forest. Similar results was obtained for the test set which is shown in Figure 4.2. Henceforward, no scaling or transformation will be applied to the data set as it did not contribute to major improvements in decreasing the RMSE score.

Figure 4.1: Barplot of the RMSE score on the train set when evaluating dif- ferent scaling and transformation methods.

47 48 CHAPTER 4. RESULTS

Figure 4.2: Barplot of the RMSE score on the test set when evaluating different scaling and transformation methods.

Different loss functions can be used by the models, Figure 4.3 shows the MAPE score on both train and test set when using mean squared error MSE and MAPE as loss functions for LightGBM and CatBoost. Both loss functions yield approximately the same MAPE score but it can be noted that the Cat- Boost had a marginally better score with the MAPE loss function on the train- ing score but not the same results were obtained on the test set. Henceforward, MSE will be used as a loss function for the models as similar results are ob- tained for the different loss functions.

(a) (b)

Figure 4.3: (a) shows MAPE score when using MSE and MAPE as loss func- tion for LightGBM and CatBoost on the train set (b) shows the same thing as (a) but for the test set.

Figures 4.4 and 4.5 shows the RMSE score on train respectively test set when CHAPTER 4. RESULTS 49

evaluating if sample weight (SW) can improve the accuracy of the models. The sample weight is based on the number of transactions in each DeSO area, where it can be concluded that it does not matter if sample weights are used. Hence going forward, no sample weights will be used when building the mod- els or when evaluating the models.

Figure 4.4: Barplot of the RMSE score on the train set when evaluating sample weights.

Figure 4.5: Barplot of the RMSE score on the test set when evaluating sample weights.

To evaluate the performance of the different models the RMSE, MAPE and 50 CHAPTER 4. RESULTS

MdAPE metrics are used, Figures 4.6 to 4.8 shows the performance of the models on both train and test set when using each respective metric. The op- timal results are obtained for the LightGBM model but similar results are ob- tained for both the XGBoost and gradient boosting models. In Figure 4.6 it can be noted that the Booli model performs worse then the top-performing models when evaluating the performance on the test set based on the RMSE metric. The performance of the LightGBM model was overall better than the Booli model, where it obtained an RMSE score of 0.330 compared to 0.358 for the Booli model. Nevertheless, when using the MdAPE metric the Booli model beats the rest of the models, as shown in Figure 4.8. In Table 4.1, it can be seen that the Booli model has the highest number of valuations that have an absolute percentage error below fifteen percent. Looking at the second metric (mean of the absolute percentage error excluding the one percent of extreme errors) in the Table 4.1 it can be seen that LightGBM has the most optimal score.

Figure 4.6: Barplot of the RMSE score on the train and test set. CHAPTER 4. RESULTS 51

Figure 4.7: Barplot of the MAPE score on the train and test set.

Figure 4.8: Barplot of the MdAPE score on the train and test set. 52 CHAPTER 4. RESULTS

Model Number of Mean of the valuations that absolute have an absolute percentage error percentage error excluding the one below fifteen percent of extreme percent errors Ridge 1919 31.78% Lasso 1869 32.79% GradientBoostingRegressor 2305 26.52% RandomForestRegressor 2122 28.98% AdaBoostRegressor 2033 29.13% LGBMRegressor 2276 26.11% XGBRegressor 2275 26.24% CatBoostRegressor 2240 26.66% Booli Model 2374 28.06%

Table 4.1: Performance of the models evaluated with metrics used by Booli

When looking at the histogram shown in Figure 4.9, it can be noted that the Booli model has a larger number of predictions where the percentage error is large compared to the LightGBM model that has fewer large percentage error. This results in that the LightGBM model obtains better results when evaluating the MAPE score as shown in Figure 4.7 as the large errors in the Booli model increase the overall MAPE score. CHAPTER 4. RESULTS 53

Figure 4.9: Histogram of the percentage error between the Booli and Light- GBM model, where the black vertical line represents 15%.

Figures 4.10 and 4.11 shows the performance of LightGBM, Booli model and XGBoost on the different DeSO areas that represent the countryside in South Sweden. The results indicate that all three models have approximately the same percentage error when evaluating the residences on Gotland. Both the LightGBM and XGBoost have less red highlighted DeSO areas which indi- cates that the models perform better on average which also the Figure 4.7 con- firms.

(a) (b)

Figure 4.10: a) Shows the mean percentage error in each DESO area using LightGBM. b) Shows the mean percentage error in each DESO area using Booli model. 54 CHAPTER 4. RESULTS

Figure 4.11: The map shows the mean percentage error in each DESO area using XGBoost.

Figure 4.12a shows the scatterplot of the spread of house prices in DESO area vs MAPE when using LightGBM, the correlation coefficient is equal to −0.141. The spread of house prices us just the difference between the maxi- mum and minimum house price in each DeSO area. In Figure 4.12b the scat- terplot of the normalized spread of house prices in DESO area vs MAPE in DESO area using LightGBM, the correlation coefficient is equal to −0.135. Normalization is done by dividing the spread of house prices with the number of transactions in each DeSO area. Similar scatterplots are computed using the Booli model and lasso, Figures 4.13 and 4.14 shows the scatterplots for the respective model. It can be seen that the Booli model has almost equal correlation coefficients as LightGBM in both the normalized and not normal- ized case, while there is no relationship between the spread of house prices and the error as the correlation coefficient is close to zero in both cases. CHAPTER 4. RESULTS 55

(a) (b)

Figure 4.12: a) Shows the scatterplot of the spread of house prices in DESO area vs MAPE in DESO area. b) Shows the scatterplot of the normalized spread of house prices in DESO area vs MAPE in DESO area, where the Light- GBM model is used.

(a) (b)

Figure 4.13: a) Shows the scatterplot of the spread of house prices in DeSO area vs MAPE in DeSO area. b) Shows the scatterplot of the normalized spread of house prices in DeSO area vs MAPE in DeSO area, where the Booli model is used. 56 CHAPTER 4. RESULTS

(a) (b)

Figure 4.14: a) Shows the scatterplot of the spread of house prices in DeSO area vs MAPE in DeSO area. b) Shows the scatterplot of the normalized spread of house prices in DeSO area vs MAPE in DeSO area, where the lasso model is used.

Figures 4.15 and 4.16 and the figures in Appendix A shows the scatterplots of a subset of variables vs the absolute percentage error using the LightGBM model. It can be noted from these figures that the absolute percentage error from the LightGBM model is fairly uncorrelated with these variables in the data set.

(a) (b)

Figure 4.15: a) Shows the scatterplot of the variable assessedValue vs ab- solute percentage error. b) Shows the scatterplot of the variable assessed- ValueBuilding vs absolute percentage error, where the LightGBM model is used. CHAPTER 4. RESULTS 57

(a) (b)

Figure 4.16: a) Shows the scatterplot of the variable assessedValue- Plot vs absolute percentage error. b) Shows the scatterplot of the variable assessmentPoints vs absolute percentage error, where the LightGBM model is used. Chapter 5

Discussion

In this chapter, the results from chapter 4 will be discussed. The models will be compared against each other and optimal model will be compared to the benchmark.

5.1 Results Evaluation

The results are evaluated based on the evaluation metrics used and by compar- ing the optimal models to the benchmark.

5.1.1 Model Comparison Comparing the different models implemented one can clearly see that many of the boosted decision trees outperform the shrinkage methods, AdaBoost seems to perform worse than the shrinkage methods when looking at the Figures 4.4 and 4.5, which shows the RMSE score on the train respectively test set. The best performing model is the LightGBM model when looking at the different evaluation metrics used but when looking at the Table 4.1 one can see that the gradient boosting model performs better than the LightGBM model when looking at the number of valuations that have an absolute percentage error be- low fifteen percent. But when looking at the same table evaluating the metric, mean of the absolute percentage error excluding the one percent of extreme errors, the LightGBM model obtained a more favorable score than the gradi- ent boosting model. XGBoost obtains similar results to LightGBM but where the LightGBM model provides slightly more accurate predictions.

It seems that the reason why the models do not give zero errors depends partly

58 CHAPTER 5. DISCUSSION 59

because of that not all information about the residences is available when train- ing the model. One could have thought that the spread in residence prices could explain the errors that the models produce but showed not to be the case. Furthermore, one could have also thought that expensive or cheap residences are more difficult to predict but looking at figures 4.15 and 4.16, the vari- ables assessedValuePlot, assessmentPoints assessedValue respective assessedValueBuilding that describe the assessment value and the condition of the residence appears to be uncorrelated with the errors. Thus, this means that an expensive or cheap residence is as difficult to predict as the other residences. Similar results can be drawn from the other scatter- plots in Appendix A, all the variables are uncorrelated with the errors.

5.1.2 Benchmark Comparison The LightGBM model is benchmarked to the Booli model, the Booli error provides more accurate predictions within the 15% interval of percentage er- ror when looking at the histogram Figure 4.9, but at the same time, the Booli model yields also larger erroneous predictions compared to LightGBM. The LightGBM has smaller mean errors when evaluating the models on the met- rics RMSE and MAPE, when looking at the MdAPE the models seem to have obtained a similar score. The LightGBM models obtain smaller prediction er- ror with large magnitude compared to the Booli model this provides a lower RMSE and MAPE score as the large magnitude of prediction errors in the Booli model results in an increased overall mean error.

It does not seem that one can explain the errors with regard to the spread in residence prices, Figures 4.12 to 4.14 showed the scatterplots of the normal- ized and not normalized spread of residence prices in DeSO area vs MAPE in DeSO area. The LightGBM and the Booli model obtained similar correla- tion coefficients for both the normalized and not normalized case, while the shrinkage method, lasso model, had no correlation between the spread and the MAPE in each DeSO area. Normally normalization can help with explaining the errors with regard to the spread but in all likelihood it does not seem to be due to the spread, normalized with regard to the number of transactions or not.

Looking at Figure 4.10b it seems that the Booli model has difficulty with pre- dicting the residence prices in some DeSO areas as those DeSO areas in the map are highlighted in red indicating that the MAPE score in those areas are 60 CHAPTER 5. DISCUSSION

very high. The error for the LightGBM and XGBoost model shown in Fig- ures 4.10a and 4.11 seems more geographically spread out compared to the error in the Booli model, there is something about the different red highlighted areas that the Booli model has problems with when predicting the residence prices. Chapter 6

Conclusion

In this chapter, the research questions will be addressed and together with a recommendation on which model to use and why.

6.1 Answering Research Questions

The research questions to be addressed are presented below:

• For a chosen set of boosted decision trees, which boosted decision tree performs best when valuing residences on the countryside?

• How is the performance of the boosted decision trees compared to the more traditional methods ridge regression, lasso regression, and ran- dom forest?

• Can the boosted decision trees yield a better valuation algorithm than the valuation algorithm used by Booli? If so, to what extent?

The overall results showed that there is not really a best boosted decision tree model as the gradient boosting model obtains better results than the LightGBM model on one of the evaluation metrics. If a model had to be chosen as the best model then that would have been the LightGBM model as it had the overall best performance. All the boosted decision trees expect AdaBoost obtained better performance than the traditional methods. The best boosting model yields a better valuation than the Booli model, the Booli model yields larger percentage errors than the LightGBM but at the same time having more valuations that have an absolute percentage error below fifteen percent.

61 62 CHAPTER 6. CONCLUSION

6.2 Future work

There is clearly future work to be done on further development of the mod- els by integrating new variables, in order to make more accurate predictions. The data available for the scope of this thesis did not have any variables that described more specific information and condition of the residences, such as the number of bedrooms, bathrooms, garage, or condition of the bathroom or kitchen. These types of variables could help with increasing the prediction of the models as they give more information about the residence. Derived variables, such as the distance from the closest city center or main road to a residence on the countryside could also affect the quality of the models’ pre- diction.

The scope of this thesis was to benchmark different boosted decision tree mod- els against Booli’s current valuation algorithm which is a k-NN based model, it would be interesting to see how a deep neural network (DNN) would perform. The neural network is prone to overfit but regularization can be incorporated by using regularized neural networks, the only problem with DNN is that one would need more powerful machines for speeding up the process of training the network. Bibliography

[1] Kaggle. Zillow Prize: Zillow’s Home Value Prediction (Zestimate). 2020. url: https://www.kaggle.com/c/zillow- prize- 1/ overview (visited on 2020-06-16). [2] Dan Becker. What is XGBoost. 2018. url: https://www.kaggle. com/dansbecker/xgboost (visited on 2020-07-27). [3] Experiments. 2020. url: https://lightgbm.readthedocs. io/en/latest/Experiments.html (visited on 2020-07-27). [4] CatBoost is a high-performance open source library for gradient boost- ing on decision trees. 2020. url: https://catboost.ai/#benchmark (visited on 2020-07-27). [5] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006. [6] Andrew G Barto and Thomas G Dietterich. “Reinforcement learning and its relationship to supervised learning”. In: Handbook of learning and approximate dynamic programming 10 (2004), p. 9780470544785. [7] Gareth James et al. An introduction to statistical learning. Vol. 112. Springer, 2013. [8] Douglas C Montgomery, Elizabeth A Peck, and G Geoffrey Vining. Introduction to linear regression analysis. Vol. 821. John Wiley & Sons, 2012. [9] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. 10. Springer series in statistics New York, 2001. [10] Naomi S Altman. “An introduction to kernel and nearest-neighbor non- parametric regression”. In: The American Statistician 46.3 (1992), pp. 175– 185.

63 64 BIBLIOGRAPHY

[11] Pang-Ning Tan et al. Introduction to Data Mining (2nd Edition). 2nd. Pearson, 2018. isbn: 0133128903. [12] Leo Breiman et al. Classification and regression trees. CRC press, 1984. [13] Matthew A Olson and Abraham J Wyner. “Making sense of random for- est probabilities: a kernel perspective”. In: arXiv preprint arXiv:1812.05792 (2018). [14] Jerome H Friedman. “Greedy function approximation: a gradient boost- ing machine”. In: Annals of statistics (2001), pp. 1189–1232. [15] Yoav Freund and Robert E Schapire. “A desicion-theoretic generaliza- tion of on-line learning and an application to boosting”. In: European conference on computational learning theory. Springer. 1995, pp. 23– 37. [16] Robert E Schapire. “Explaining AdaBoost”. In: Empirical inference. Springer, 2013, pp. 37–52. [17] Weiming Hu, Wei Hu, and Steve Maybank. “AdaBoost-based algorithm for network intrusion detection”. In: IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38.2 (2008), pp. 577–583. [18] David Pardoe and Peter Stone. “Boosting for regression transfer”. In: Proceedings of the 27th International Conference on International Con- ference on Machine Learning. 2010, pp. 863–870. [19] Harris Drucker. “Improving regressors using boosting techniques”. In: ICML. Vol. 97. 1997, pp. 107–115. [20] Durga L Shrestha and Dimitri P Solomatine. “Experiments with Ad- aBoost. RT, an improved boosting scheme for regression”. In: Neural computation 18.7 (2006), pp. 1678–1710. [21] Liudmila Prokhorenkova et al. “CatBoost: unbiased boosting with cat- egorical features”. In: Advances in neural information processing sys- tems. 2018, pp. 6638–6648. [22] Tianqi Chen and Carlos Guestrin. “XGBoost: A Scalable Tree Boost- ing System”. In: CoRR abs/1603.02754 (2016). arXiv: 1603.02754. url: http://arxiv.org/abs/1603.02754. [23] Guolin Ke et al. “LightGBM: A highly efficient gradient boosting deci- sion tree”. In: Advances in neural information processing systems. 2017, pp. 3146–3154. BIBLIOGRAPHY 65

[24] Xiaolei Sun, Mingxi Liu, and Zeqian Sima. “A novel cryptocurrency price trend forecasting model based on LightGBM”. In: Finance Re- search Letters 32 (2020), p. 101084. [25] Features. 2020. url: https://lightgbm.readthedocs.io/ en/latest/Features.html (visited on 2020-07-25). [26] Parameters Tuning. 2020. url: https://lightgbm.readthedocs. io/en/latest/Parameters-Tuning.html (visited on 2020- 08-26). [27] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Ap- proach. 3rd. USA: Prentice Hall Press, 2009. isbn: 0136042597. [28] Ron Kohavi et al. “A study of cross-validation and bootstrap for accu- racy estimation and model selection”. In: Ijcai. Vol. 14. 2. Montreal, Canada. 1995, pp. 1137–1145. [29] DeSO – Demografiska statistikområden. 2020. url: https://www. scb.se/hitta-statistik/regional-statistik-och- kartor/regionala-indelningar/deso---demografiska- statistikomraden/ (visited on 2020-07-30). [30] SB Kotsiantis, Dimitris Kanellopoulos, and PE Pintelas. “Data prepro- cessing for supervised leaning”. In: International Journal of Computer Science 1.2 (2006), pp. 111–117. [31] SWEREF 99. 2020. url: https://www.lantmateriet.se/ en/maps-and-geographic-information/gps-geodesi- och-swepos/Referenssystem/Tredimensionella-system/ SWEREF-99/ (visited on 2020-08-01). [32] Reference. 2020. url: https://geopandas.org/reference. html (visited on 2020-08-02). [33] The Shapely User Manual. 2020. url: https://shapely.readthedocs. io/en/latest/manual.html (visited on 2020-08-02). [34] SBAB Booli Housing Market Index. 2020. url: https : / / www . sbab.se/1/analys__rapporter/sbab_booli_housing_ market_index.html (visited on 2020-07-30). [35] Booli API. 2020. url: https://www.booli.se/p/api/ (visited on 2020-07-30). [36] Requests: HTTP for HumansTM. 2020. url: https://requests. readthedocs.io/en/master/ (visited on 2020-07-30). 66 BIBLIOGRAPHY

[37] Feature scaling. 2020. url: https : / / en . wikipedia . org / wiki/Feature_scaling (visited on 2020-08-03). [38] Giuseppe Bonaccorso. Machine learning algorithms. Packt Publishing Ltd, 2017. [39] Salvador Garcıa, Julián Luengo, and Francisco Herrera. Data prepro- cessing in data mining. Springer, 2015. [40] Agnan Kessy, Alex Lewin, and Korbinian Strimmer. “Optimal whiten- ing and decorrelation”. In: The American Statistician 72.4 (2018), pp. 309– 314. [41] Alan Jović, Karla Brkić, and Nikola Bogunović. “A review of feature selection methods with applications”. In: 2015 38th international con- vention on information and communication technology, electronics and microelectronics (MIPRO). Ieee. 2015, pp. 1200–1205. Appendix A

Scatterplots of variables vs Ab- solute Percentage Error

(a) (b)

Figure A.1: a) Shows the scatterplot of the variable constructionYear vs absolute percentage error. b) Shows the scatterplot of the variable delta_Date vs absolute percentage error, where the LightGBM model is used.

67 68 APPENDIX A. SCATTERPLOTS OF VARIABLES VS ABSOLUTE PERCENTAGE ERROR

(a) (b)

(c) (d)

(e) (f)

Figure A.2: a) Shows the scatterplot of the variable latitude vs absolute percentage error. b) Shows the scatterplot of the variable longitude vs ab- solute percentage error. c) Shows the scatterplot of the variable totalArea vs absolute percentage error. d) Shows the scatterplot of the variable livin- gArea vs absolute percentage error. e) Shows the scatterplot of the variable plotArea vs absolute percentage error. f) Shows the scatterplot of the vari- able rooms vs absolute percentage error, where the LightGBM model is used. APPENDIX A. SCATTERPLOTS OF VARIABLES VS ABSOLUTE PERCENTAGE ERROR 69

(a) (b)

Figure A.3: a) Shows the scatterplot of the variable distanceToOcean- Front vs absolute percentage error. b) Shows the scatterplot of the vari- able distanceToWater vs absolute percentage error, where the LightGBM model is used.

TRITA -SCI-GRU 2020:302

www.kth.se