Predicting House Prices on the Countryside Using Boosted Decision Trees

DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 Predicting House Prices on the Countryside using Boosted Decision Trees WAR REVEND KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES Predicting House Prices on the Countryside using Boosted Decision Trees WAR REVEND Degree Projects in Mathematical Statistics (30 ECTS credits) Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2020 Supervisors at Booli Search Technologies AB: Christopher Madsen Supervisor at KTH: Joakim Andén-Pantera Examiner at KTH: Joakim Andén-Pantera TRITA-SCI-GRU 2020:302 MAT-E 2020:075 Royal Institute of Technology School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci Abstract This thesis intends to evaluate the feasibility of supervised learning models for predicting house prices on the countryside of South Sweden. It is essen- tial for mortgage lenders to have accurate housing valuation algorithms and the current model offered by Booli is not accurate enough when evaluating residence prices on the countryside. Different types of boosted decision trees were implemented to address this issue and their performances were compared to traditional machine learning methods. These different types of supervised learning models were implemented in order to find the best model with re- gards to relevant evaluation metrics such as root-mean-squared error (RMSE) and mean absolute percentage error (MAPE). The implemented models were ridge regression, lasso regression, random forest, AdaBoost, gradient boosting, CatBoost, XGBoost, and LightGBM. All these models were benchmarked against Booli’s current housing valuation algorithms which are based on a k- NN model. The results from this thesis indicated that the LightGBM model is the optimal one as it had the best overall performance with respect to the cho- sen evaluation metrics. When comparing the LightGBM model to the benchmark, the performance was overall better, the LightGBM model had an RMSE score of 0.330 compared to 0.358 for the Booli model, indicating that there is a potential of using boosted decision trees to improve the predictive accuracy of residence prices on the countryside. Sammanfattning Denna uppsats ämnar utvärdera genomförbarheten hos olika övervakade in- lärningsmodeller för att förutse huspriser på landsbygden i Södra Sverige. Det är viktigt för bostadslånsgivare att ha noggranna algoritmer när de värderar bostäder, den nuvarande modellen som Booli erbjuder har dålig precision när det gäller värderingar av bostäder på landsbygden. Olika typer av boostade beslutsträd implementerades för att ta itu med denna fråga och deras prestanda jämfördes med traditionella maskininlärningsmetoder. Dessa olika typer av övervakad inlärningsmodeller implementerades för att hitta den bästa modellen med avseende på relevanta prestationsmått som t.ex. root-mean-squared error (RMSE) och mean absolute percentage error (MAPE). De övervaka- de inlärningsmodellerna var ridge regression, lasso regression, random forest, AdaBoost, gradient boosting, CatBoost, XGBoost, and LightGBM. Samtliga algoritmers prestanda jämförs med Boolis nuvarande bostadsvärderingsalgo- ritm, som är baserade på en k-NN modell. Resultatet från denna uppsats vi- sar att LightGBM modellen är den optimala modellen för att värdera husen på landsbygden eftersom den hade den bästa totala prestandan med avseende på de utvalda utvärderingsmetoderna. LightGBM modellen jämfördes med Booli modellen där prestandan av LightGBM modellen var i överlag bättre, där LightGBM modellen hade ett RMSE värde på 0.330 jämfört med Booli modellen som hade ett RMSE värde på 0.358. Vilket indikerar att det finns en potential att använda boostade beslutsträd för att förbättra noggrannheten i förutsägelserna av huspriser på landsbygden. Acknowledgements First, I wish to thank my supervisor at KTH, Joakim Andén-Pantera, for his excellent patience, guidance, and support in completing this thesis. I would also like to express my gratitude toward Christopher Madsen, my supervisors at Booli Search Technologies AB for inspiration and support. Further, I want to thank Johan Mattsson and Olof Sjöbergh at Booli for additional advice and providing the opportunity for this thesis. Contents Acknowledgements 1 Introduction 1 1.1 Background and problem formulation . 1 1.2 Purpose . 2 1.3 Research question . 2 1.4 Scope . 3 1.5 Limitation . 4 2 Background 5 2.1 Supervised Learning . 5 2.2 Regression . 6 2.2.1 Simple linear regression . 6 2.2.2 Multiple linear regression . 7 2.3 Shrinkage Methods . 9 2.3.1 Ridge regression . 10 2.3.2 Lasso regression . 11 2.4 k-Nearest Neighbors Algorithm . 12 2.5 Decision Trees . 13 2.5.1 Regression Trees . 13 2.6 Random Forest . 16 2.7 Boosting . 17 2.7.1 AdaBoost . 19 2.7.2 Gradient Boosting . 21 2.7.3 Categorical Boosting: CatBoost . 23 2.7.4 XGBoost . 24 2.7.5 LightGBM . 26 2.8 Hyper-Parameter Tuning . 28 2.8.1 Cross-validation . 28 CONTENTS 2.9 Metrics of interest . 31 3 Methods 33 3.1 Data . 33 3.1.1 Overview of the available data . 33 3.1.2 Preprocessing . 36 3.2 Model Implementation . 43 3.2.1 Hyper-Parameter Tuning of the Models . 43 4 Results 47 5 Discussion 58 5.1 Results Evaluation . 58 5.1.1 Model Comparison . 58 5.1.2 Benchmark Comparison . 59 6 Conclusion 61 6.1 Answering Research Questions . 61 6.2 Future work . 62 Bibliography 63 A Scatterplots of variables vs Absolute Percentage Error 67 List of Tables 3.1 Description of the variables available in the data set obtained from Booli. 35 3.2 Description of the variables available in the data set obtained from SCB . 36 3.3 Hyper-parameter set for lasso and ridge regression . 43 3.4 Hyper-parameter set for gradient boosting regression . 44 3.5 Hyper-parameter set for random forest . 44 3.6 Hyper-parameter set for AdaBoost regression . 44 3.7 Hyper-parameter set for LightGBM . 45 3.8 Hyper-parameter set for XGBoost . 45 3.9 Hyper-parameter set for CatBoost . 46 4.1 Performance of the models evaluated with metrics used by Booli 52 List of Figures 2.1 Illustrating leaf-wise and level-wise tree growth . 28 2.2 Illustrating 5-fold cross-validation on a data set . 30 3.1 Illustrating which variables in the data set have NaN values and the percentage of missing values and have the percentage . 38 3.2 Illustrating one-hot encoding . 39 LIST OF FIGURES 3.3 Distribution of the adjusted residence price together with Q-Q plot . 40 3.4 Distribution of the log-transformed adjusted residence price together with Q-Q plot . 40 3.5 Heat map between numerical variables in the data set. 42 3.6 Heat map between the ten highest numerical variables that cor- relates to the residence price in the data set. 42 4.1 Barplot of the RMSE score on the train set when evaluating different scaling and transformation methods. 47 4.2 Barplot of the RMSE score on the test set when evaluating different scaling and transformation methods. 48 4.3 Evaluating MSE and MAPE as loss function for LightGBM and CatBoost . 48 4.4 Barplot of the RMSE score on the train set when evaluating sample weights. 49 4.5 Barplot of the RMSE score on the test set when evaluating sample weights. 49 4.6 Barplot of the RMSE score on the train and test set. 50 4.7 Barplot of the MAPE score on the train and test set. 51 4.8 Barplot of the MdAPE score on the train and test set. 51 4.9 Histogram of the percentage error between the Booli and Light- GBM model . 53 4.10 Map of the mean percentage error in each DESO area using LightGBM and Booli model . 53 4.11 Map of the mean percentage error in each DESO area using XGBoost . 54 4.12 Scatterplot of the spread of house prices in DeSO area vs MAPE in DeSO area using LightGBM . 55 4.13 Scatterplot of the spread of house prices in DeSO area vs MAPE in DeSO area using Booli model . 55 4.14 Scatterplot of the spread of house prices in DeSO area vs MAPE in DeSO area using lasso . 56 4.15 Scatterplots of the variables assessedValue respective as- sessedValueBuilding vs absolute percentage error, using the LightGBM model . 56 4.16 Scatterplots of the variables assessedValuePlot respective assessmentPoints vs absolute percentage error, using the LightGBM model . 57 LIST OF FIGURES A.1 Scatterplots of the variables constructionYear respective delta_Date vs absolute percentage error, using the Light- GBM model . 67 A.2 Scatterplots of the variables latitude, longitude, to- talArea , livingArea, plotArea respective rooms vs absolute percentage error, using the LightGBM model . 68 A.3 Scatterplots of the variables distanceToOceanFront respective distanceToWater vs absolute percentage error, using the LightGBM model . 69 Chapter 1 Introduction This chapter provides an overview of the aim of the thesis. The topics discussed within this chapter are the thesis’ background, purpose, scope, and limitations of the thesis. 1.1 Background and problem formulation For the majority of the people, purchasing a residence is the biggest financial commitment in their life. Ensuring that homeowners have a trusted way of monitoring the value of their assets is important. Even companies such as Zil- low, an American online real estate database company, offered a competition in 2018 with a one million dollar grand prize for improving their valuation algorithm [1]. The valuation algorithm is an important feature that Booli offers to its consumers, but it is also used by their owner, SBAB Bank AB, when they need to value residences across Sweden. Usually, the bank customers want to get a valuation on their residence when they want to get a loan on the house or when they need a new mortgage for purchasing a new property.

Predicting House Prices on the Countryside Using Boosted Decision Trees

Benchmarking and Optimization of Gradient Boosting Decision Tree Algorithms

A Hybrid Machine Learning/Deep Learning COVID-19 Severity Predictive Model from CT Images and Clinical Data

New Directions in Automated Traffic Analysis

ISBN # 1-60132-514-2; American Council on Science & Education / CSCE 2021

Catboost for Big Data: an Interdisciplinary Review

Xgboost Add-In for JMP Pro

Arxiv:2009.09993V3 [Q-Fin.TR] 14 May 2021 Formats Following Predetermined Protocols and Data Structures

Minimal Variance Sampling in Stochastic Gradient Boosting

Estimating the Pan Evaporation in Northwest China by Coupling Catboost with Bat Algorithm

Catboost for Big Data: an Interdisciplinary Review

Catboost: Unbiased Boosting with Categorical Features

Comparison of Gradient Boosting Decision Tree Algorithms for CPU Performance CPU Performansı Için Gradyan Artırıcı Karar A