Comparison of Machine Learning Techniques When Estimating Probability of Impairment
Total Page:16
File Type:pdf, Size:1020Kb
Comparison of Machine Learning Techniques when Estimating Probability of Impairment Estimating Probability of Impairment through Identification of Defaulting Customers one year Ahead of Time Authors: Supervisors: Alexander Eriksson Prof. Oleg Seleznjev Jacob Långström Xun Su June 13, 2019 Student Master thesis, 30 hp Degree Project in Industrial Engineering and Management Spring 2019 Abstract Probability of Impairment, or Probability of Default, is the ratio of how many customers within a segment are expected to not fulfil their debt obligations and instead go into Default. This isakey metric within banking to estimate the level of credit risk, where the current standard is to estimate Probability of Impairment using Linear Regression. In this paper we show how this metric instead can be estimated through a classification approach with machine learning. By using models trained to find which specific customers will go into Default within the upcoming year, based onNeural Networks and Gradient Boosting, the Probability of Impairment is shown to be more accurately estimated than when using Linear Regression. Additionally, these models provide numerous real-life implementations internally within the banking sector. The new features of importance we found can be used to strengthen the models currently in use, and the ability to identify customers about to go into Default let banks take necessary actions ahead of time to cover otherwise unexpected risks. Key Words Classification, Imbalanced Data, Machine Learning, Probability of Impairment, Risk Management Sammanfattning Titeln på denna rapport är En jämförelse av maskininlärningstekniker för uppskattning av Probability of Impairment. Uppskattningen av Probability of Impairment sker genom identifikation av låntagare som inte kommer fullfölja sina återbetalningsskyldigheter inom ett år. Probability of Impairment, eller Probability of Default, är andelen kunder som uppskattas att inte fullfölja sina skyldigheter som låntagare och återbetalning därmed uteblir. Detta är ett nyckelmått inom banksektorn för att beräkna nivån av kreditrisk, vilken enligt nuvarande regleringsstandard uppskattas genom Linjär Regression. I denna uppsats visar vi hur detta mått istället kan uppskattas genom klassifikation med maskininlärning. Genom användandet av modeller anpassade för att hitta vilka specifika kunder som inte kommer fullfölja sina återbetalningsskyldigheter inom det kommande året, baserade på Neurala Nätverk och Gradient Boosting, visas att Probability of Impairment bättre uppskattas än genom Linjär Regression. Dessutom medför dessa modeller även ett stort antal interna användningsområden inom banksektorn. De nya variabler av intresse vi hittat kan användas för att stärka de modeller som idag används, samt förmågan att identifiera kunder som riskerar inte kunna fullfölja sina skyldigheter låter banker utföra nödvändiga åtgärder i god tid för att hantera annars oväntade risker. Nyckelord Klassificering, Obalanserat Data, Maskininlärning, Probability of Impairment, Riskhantering i Acknowledgements We want to thank Professor Oleg Seleznjev at Umeå University for his mentoring and input leading us to achieve the knowledge necessary to write this thesis, Xun Su at Nordea for her administrative work and for challenging us to explore issues we else would not have considered, Nordea for providing data and letting us write our thesis at their Swedish headquarter located in Stockholm, Alexander Ramström at Nordea for his administrative work related to giving us access to both rooms and data, and finally the remaining people from the IFRS9 team and their team leader Andreas Wirenhammar for welcoming us and answering any questions we had. ii Contents 1 Introduction 1 1.1 Background .......................................... 1 1.2 Problem Definition ...................................... 1 1.3 Purpose and Aim ....................................... 2 1.4 Delimitations ......................................... 2 1.5 Data .............................................. 3 1.6 Approach and Outline .................................... 5 2 Theory 6 2.1 Binary Classification ..................................... 6 2.2 Multiple Linear Regression ................................. 6 2.3 Multiple Imputation by Chained Equations ........................ 7 2.4 Principal Component Analysis ............................... 8 2.5 Imbalanced Data ....................................... 9 2.6 Tree-Based Methods ..................................... 11 2.7 Artificial Neural Networks .................................. 15 2.8 Model Selection ........................................ 18 2.9 Evaluation ........................................... 18 3 Method 21 3.1 Pre-Processing Data ..................................... 21 3.1.1 Creating the Target Variable: YearDefault ..................... 21 3.1.2 Macro Data ...................................... 22 3.1.3 Initial Data Cleaning ................................. 23 3.1.4 Missing Data ..................................... 24 3.1.5 Imputing Missing Values ............................... 25 3.1.6 Grouping of Minority Categories .......................... 26 3.1.7 Historical Customer Data .............................. 27 3.1.8 Splitting the Data .................................. 28 3.1.9 Oversampling, One-Hot Encoding and Standardization .............. 28 3.1.10 Data for Linear Regression ............................. 29 3.2 Models ............................................. 30 3.2.1 ANN - Artificial Neural Network .......................... 30 3.2.2 RF - Random Forest ................................. 32 3.2.3 XGBoost - Extreme Gradient Boosting ...................... 33 3.2.4 Ensemble of ANN and XGBoost .......................... 34 3.2.5 Linear Regression ................................... 37 4 Results 41 4.1 Linear Regression ....................................... 41 4.2 Classifiers ........................................... 42 4.3 Comparing all Models .................................... 44 5 Discussion 52 5.1 Conclusion .......................................... 52 5.2 Classification Problem .................................... 52 5.3 Complexity .......................................... 52 5.4 Important Features ...................................... 53 5.5 Non-Linear Risk Grade ................................... 53 5.6 Development Opportunities ................................. 53 6 Reference List 55 iii Abbreviations ACC Accuracy MICE Multiple Imputation by Chained Equations ACCE Expected Accuracy MLR Multiple Linear Regression ANN Artificial Neural Network NPV Negative Predictive Value AP Average Precision PC Principal Components AUC Area Under Curve PCA Principal Component Analysis AUPRC Area Under Precision-Recall Curve PD Probability of Default Default Borrowers failing to fully meet their PI Probability of Impairment obligations to clear their debt Precision Positive Predictive Value ECL Expected Credit Loss PRC Precision-Recall Curve ENN Wilson’s Edited Nearest Neighbours rule Recall True Positive Rate Ensemble Ensemble model of RF Random Forest Artificial Neural Network and ROC Receiver Operating Characteristic Extreme Gradient Boosting Shipping Refers to data containing features for FN False Negative customers whose business activities are FNR False Negative Rate related to shipping FP False Positive SMOTE Synthetic Minority Oversampling Technique FPR False Positive Rate SMOTEENN Synthetic Minority Oversampling G-mean Geometric Mean Technique, followed by cleaning using Edited Nearest Neighbours IFRS 9 International Financial Reporting Standard Specificity True Negative Rate Kappa Cohen’s Kappa TN True Negative TP True Positive LR Linear Regression XGBoost Extreme Gradient Boosting Macro Refers to data containing macroeconomic features *model*_cla Classifier version of *model* MCC Matthews Correlation Coefficient *model*_reg Regression version of *model* iv 1 Introduction The purpose of this chapter is to provide necessary knowledge regarding the problem we seek to examine throughout the report. This includes a description of the problem, how we approach it and what data we have access to. The data are provided by Nordea and this report is written at their Swedish headquarter located in Stockholm. We have not been awarded any monetary compensation or been made any promises of future gain and are thus unbiased when writing this report. The PI being presented in this report is based on coded variables we have created to not leak any classified information such as customer identifications, where every observation where a customer appears an additional time within the data, we treat that additional observation as an entirely new customer. This also encodes Nordea’s true realised PI without losing any for us valuable properties. 1.1 Background Probability of Impairment [PI] has the same meaning as Probability of Default [PD] or Risk of Default, and is defined as a theoretical percentage value explaining how many out of a group of borrowers, for any reason, are expected to not fulfil their obligated debt payments (Nordea, 2017, p. 45). Thisisa key parameter within banking when estimating Expected Credit Loss [ECL], which is the expected loss due to borrowers failing to fully meet their obligations to clear their debts. Estimating ECL is not only important from a business standpoint to determine capital buffers and interest rates but is also a legal requirement. Capital of banks within the European Union [EU] are subjects to many legal frameworks such as CRD