Modeling the Approval Rates of Iowa Home Loan Applications

Iowa State University Capstones, Theses and Creative Components Dissertations Spring 2019 Modeling the approval rates of Iowa Home Loan Applications Yue Zhang Follow this and additional works at: https://lib.dr.iastate.edu/creativecomponents Part of the Applied Statistics Commons, Business Analytics Commons, and the Statistical Models Commons Recommended Citation Zhang, Yue, "Modeling the approval rates of Iowa Home Loan Applications" (2019). Creative Components. 289. https://lib.dr.iastate.edu/creativecomponents/289 This Creative Component is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Creative Components by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Modeling the approval rates of Iowa Home Loan Applications By Yue Zhang A Creative Component submitted to the graduate faculty in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Major: Statistics Program of Study Committee: Li Wang, Major Professor Mark Kaiser Lei Gao Iowa State University Ames, Iowa 2019 Copyright © Yue Zhang, 2019. All rights reserved. Table of Contents Table of Contents ............................................................................................................................ 1 List of Figures ................................................................................................................................. 3 List of Tables .................................................................................................................................. 5 Acknowledgement .......................................................................................................................... 7 Abstract ........................................................................................................................................... 8 Chapter 1 Introduction .................................................................................................................. 9 Chapter 2 Data Description and Cleaning ................................................................................... 12 2.1 Data Source .................................................................................................................... 12 2.2 Data Cleaning ................................................................................................................. 12 2.3 Variable Descriptions ..................................................................................................... 15 Chapter 3 Methods ...................................................................................................................... 18 3.1 The χ2-test ....................................................................................................................... 18 3.2 Classifiers ....................................................................................................................... 19 3.2.1 Logistic Regression ................................................................................................. 19 3.2.2 Linear Discriminant Analysis (LDA) ..................................................................... 20 3.2.3 Generalized Additive Model (GAM) ...................................................................... 21 3.2.4 Random Forests ...................................................................................................... 24 3.3 Model Assessment Metrics ............................................................................................ 26 3.3.1 ROC curve and AUC .............................................................................................. 27 3.3.2 Cross-validation ...................................................................................................... 28 3.4 Model Selection Technique ............................................................................................ 32 3.4.1 Best Subset Selection .............................................................................................. 32 3.4.2 Group LASSO ......................................................................................................... 33 Chapter 4 Hypothesis Testing of the Denial Rates ..................................................................... 36 4.1 Testing the Denial Rates by Applicant’s Ethnicity ........................................................ 37 4.2 Testing the Denial Rates by Applicant’s Race ............................................................... 40 4.3 Testing the Denial Rates by Applicant’s Gender ........................................................... 43 4.4 Summary and Discussion of the Hypothesis Testing Results ........................................ 46 1 Chapter 5 Modeling the Approval/Denial Rates Using Ethnicity and Gender ........................... 48 5.1 Logistic Regression of the Full Model ........................................................................... 49 5.2 Model Selection Using Group LASSO .......................................................................... 53 5.3 Best Subset Approach in Logistic Regression, LDA and GAM .................................... 56 5.4 Random Forest and Performance Comparison............................................................... 59 5.5 Summary of Model Comparison .................................................................................... 68 Chapter 6 Modeling the Approval/Denial Rates Using Race and Gender .................................. 70 6.1 Logistic Regression of the Full Model ........................................................................... 71 6.2 Model Selection Using Group LASSO .......................................................................... 75 6.3 Best Subset Approach in Logistic Regression, LDA and GAM .................................... 78 6.4 Random Forest and Performance Comparison............................................................... 80 6.5 Summary of Model Comparison .................................................................................... 87 Chapter 7 Conclusions ................................................................................................................ 89 References ..................................................................................................................................... 91 Appendix ....................................................................................................................................... 94 Appendix A: R code for hypothesis testing .............................................................................. 94 Appendix B: R code for logistic regression ............................................................................ 104 Appendix C: R code for LDA ................................................................................................. 109 Appendix D: R code for GAM ................................................................................................ 114 Appendix E: R code for random forest ................................................................................... 120 Appendix F: R code for group LASSO in logistic regression................................................. 125 2 List of Figures Figure 2.1 The six metropolitan areas defined in this study (indicated by red circles) ................ 14 Figure 3.1 Example ROC curve, the red and blue points represent the perfect classification and limit of random guess, respectively. ............................................................................................. 28 Figure 5.1 The group LASSO's path plot for each variable .......................................................... 54 Figure 5.2 The evolution of the mean AUC in 10-fold cross-validation as a function of the penalty, the error bar was the standard deviation of the 10 AUCs calculated in the cross-validation ....... 55 Figure 5.3 The mean AUC of all the models fitted in the best subset approach (the variable included in the models are shaded in the corresponding cells) ................................................................... 56 Figure 5.4 The cubic splines fitted using back-fitting (top panel) and penalized likelihood (bottom panel) for the numeric variables: applicant income, number of 1-4 family units and DTI (all after log-transformation) ....................................................................................................................... 58 Figure 5.5 The AUC as a function of the Ntree parameter in random forest ............................... 60 Figure 5.6 The variable importance in the random forest using Ntree=512, the x-axis shows the number of cases that will be misclassified if the variable is excluded in the trees. There are about 7500 cases in each test fold in the 10-fold cross-validation. ........................................................ 61 Figure 5.7 Comparison of performance among models built using logistic regression, LDA, GAM and random forests (with Ntree=64,128,256,512) as a function of number of variables p, all the models with the same p also have the same set of variables ........................................................ 62 Figure 5.8 Comparison of performance among models built using logistic regression, LDA, GAM and random forests (with Ntree=64,128,256,512) as a function of number of variables p, all the

Modeling the Approval Rates of Iowa Home Loan Applications

Variable Selection Strategies and Its Importance in Clinical Prediction Modelling

A Multicentre Development and Validation Study of a Novel Lower Gastrointestinal Bleeding Score—The Birmingham Score

Start-Up Valuation in Switzerland: Analysis and Methods

An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics

Ph.D. Og Speciale

An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics

Predicting Presence of Macrovascular Causes in Non-Traumatic Intracerebral Hemorrhage; the DIAGRAM Prediction Score

Bleeding on Antithrombotic Treatment in Secondary Stroke Prevention

Author Comments to Editors Decision