Modeling the Approval Rates of Iowa Home Loan Applications

Total Page:16

File Type:pdf, Size:1020Kb

Modeling the Approval Rates of Iowa Home Loan Applications Iowa State University Capstones, Theses and Creative Components Dissertations Spring 2019 Modeling the approval rates of Iowa Home Loan Applications Yue Zhang Follow this and additional works at: https://lib.dr.iastate.edu/creativecomponents Part of the Applied Statistics Commons, Business Analytics Commons, and the Statistical Models Commons Recommended Citation Zhang, Yue, "Modeling the approval rates of Iowa Home Loan Applications" (2019). Creative Components. 289. https://lib.dr.iastate.edu/creativecomponents/289 This Creative Component is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Creative Components by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Modeling the approval rates of Iowa Home Loan Applications By Yue Zhang A Creative Component submitted to the graduate faculty in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Major: Statistics Program of Study Committee: Li Wang, Major Professor Mark Kaiser Lei Gao Iowa State University Ames, Iowa 2019 Copyright © Yue Zhang, 2019. All rights reserved. Table of Contents Table of Contents ............................................................................................................................ 1 List of Figures ................................................................................................................................. 3 List of Tables .................................................................................................................................. 5 Acknowledgement .......................................................................................................................... 7 Abstract ........................................................................................................................................... 8 Chapter 1 Introduction .................................................................................................................. 9 Chapter 2 Data Description and Cleaning ................................................................................... 12 2.1 Data Source .................................................................................................................... 12 2.2 Data Cleaning ................................................................................................................. 12 2.3 Variable Descriptions ..................................................................................................... 15 Chapter 3 Methods ...................................................................................................................... 18 3.1 The χ2-test ....................................................................................................................... 18 3.2 Classifiers ....................................................................................................................... 19 3.2.1 Logistic Regression ................................................................................................. 19 3.2.2 Linear Discriminant Analysis (LDA) ..................................................................... 20 3.2.3 Generalized Additive Model (GAM) ...................................................................... 21 3.2.4 Random Forests ...................................................................................................... 24 3.3 Model Assessment Metrics ............................................................................................ 26 3.3.1 ROC curve and AUC .............................................................................................. 27 3.3.2 Cross-validation ...................................................................................................... 28 3.4 Model Selection Technique ............................................................................................ 32 3.4.1 Best Subset Selection .............................................................................................. 32 3.4.2 Group LASSO ......................................................................................................... 33 Chapter 4 Hypothesis Testing of the Denial Rates ..................................................................... 36 4.1 Testing the Denial Rates by Applicant’s Ethnicity ........................................................ 37 4.2 Testing the Denial Rates by Applicant’s Race ............................................................... 40 4.3 Testing the Denial Rates by Applicant’s Gender ........................................................... 43 4.4 Summary and Discussion of the Hypothesis Testing Results ........................................ 46 1 Chapter 5 Modeling the Approval/Denial Rates Using Ethnicity and Gender ........................... 48 5.1 Logistic Regression of the Full Model ........................................................................... 49 5.2 Model Selection Using Group LASSO .......................................................................... 53 5.3 Best Subset Approach in Logistic Regression, LDA and GAM .................................... 56 5.4 Random Forest and Performance Comparison............................................................... 59 5.5 Summary of Model Comparison .................................................................................... 68 Chapter 6 Modeling the Approval/Denial Rates Using Race and Gender .................................. 70 6.1 Logistic Regression of the Full Model ........................................................................... 71 6.2 Model Selection Using Group LASSO .......................................................................... 75 6.3 Best Subset Approach in Logistic Regression, LDA and GAM .................................... 78 6.4 Random Forest and Performance Comparison............................................................... 80 6.5 Summary of Model Comparison .................................................................................... 87 Chapter 7 Conclusions ................................................................................................................ 89 References ..................................................................................................................................... 91 Appendix ....................................................................................................................................... 94 Appendix A: R code for hypothesis testing .............................................................................. 94 Appendix B: R code for logistic regression ............................................................................ 104 Appendix C: R code for LDA ................................................................................................. 109 Appendix D: R code for GAM ................................................................................................ 114 Appendix E: R code for random forest ................................................................................... 120 Appendix F: R code for group LASSO in logistic regression................................................. 125 2 List of Figures Figure 2.1 The six metropolitan areas defined in this study (indicated by red circles) ................ 14 Figure 3.1 Example ROC curve, the red and blue points represent the perfect classification and limit of random guess, respectively. ............................................................................................. 28 Figure 5.1 The group LASSO's path plot for each variable .......................................................... 54 Figure 5.2 The evolution of the mean AUC in 10-fold cross-validation as a function of the penalty, the error bar was the standard deviation of the 10 AUCs calculated in the cross-validation ....... 55 Figure 5.3 The mean AUC of all the models fitted in the best subset approach (the variable included in the models are shaded in the corresponding cells) ................................................................... 56 Figure 5.4 The cubic splines fitted using back-fitting (top panel) and penalized likelihood (bottom panel) for the numeric variables: applicant income, number of 1-4 family units and DTI (all after log-transformation) ....................................................................................................................... 58 Figure 5.5 The AUC as a function of the Ntree parameter in random forest ............................... 60 Figure 5.6 The variable importance in the random forest using Ntree=512, the x-axis shows the number of cases that will be misclassified if the variable is excluded in the trees. There are about 7500 cases in each test fold in the 10-fold cross-validation. ........................................................ 61 Figure 5.7 Comparison of performance among models built using logistic regression, LDA, GAM and random forests (with Ntree=64,128,256,512) as a function of number of variables p, all the models with the same p also have the same set of variables ........................................................ 62 Figure 5.8 Comparison of performance among models built using logistic regression, LDA, GAM and random forests (with Ntree=64,128,256,512) as a function of number of variables p, all the
Recommended publications
  • Variable Selection Strategies and Its Importance in Clinical Prediction Modelling
    Open access Methodology Fam Med Com Health: first published as 10.1136/fmch-2019-000262 on 16 February 2020. Downloaded from Variable selection strategies and its importance in clinical prediction modelling Mohammad Ziaul Islam Chowdhury,1 Tanvir C Turin1,2 To cite: Chowdhury MZI, ABSTRACT can be classified broadly into two catego- Turin TC. Variable selection Clinical prediction models are used frequently in clinical ries: mathematical/statistical modelling and strategies and its importance practice to identify patients who are at risk of developing computer- based modelling. Regardless of in clinical prediction modelling. an adverse outcome so that preventive measures can Fam Med Com Health the modelling technique used, one needs to be initiated. A prediction model can be developed in 2020;8:e000262. doi:10.1136/ apply appropriate variable selection methods a number of ways; however, an appropriate variable fmch-2019-000262 during the model building stage. Selecting selection strategy needs to be followed in all cases. Our purpose is to introduce readers to the concept of variable appropriate variables for inclusion in a model selection in prediction modelling, including the importance is often considered the most important and of variable selection and variable reduction strategies. We difficult part of model building. In this paper, will discuss the various variable selection techniques that we will discuss what is meant by variable selec- can be applied during prediction model building (backward tion, why variable selection is important, the elimination, forward selection, stepwise selection and all different methods for variable selection and possible subset selection), and the stopping rule/selection their advantages and disadvantages.
    [Show full text]
  • A Multicentre Development and Validation Study of a Novel Lower Gastrointestinal Bleeding Score—The Birmingham Score
    International Journal of Colorectal Disease (2020) 35:285–293 https://doi.org/10.1007/s00384-019-03459-z ORIGINAL ARTICLE A multicentre development and validation study of a novel lower gastrointestinal bleeding score—The Birmingham Score Samuel C. L. Smith1 & Alina Bazarova1 & Efe Ejenavi2 & Maria Qurashi2 & Uday N. Shivaji1,3 & Phil R. Harvey4 & Emma Slaney4 & Michael McFarlane5 & Graham Baker6 & Mohamed Elnagar6 & Sarah Yuzari6 & Georgios Gkoutos1 & Subrata Ghosh1,2,3 & Marietta Iacucci1,2,3,7 Accepted: 12 November 2019 /Published online: 16 December 2019 # The Author(s) 2019 Abstract Purpose Lower gastrointestinal bleeding (LGIB) is common and risk stratification scores can guide clinical decision-making. There is no robust risk stratification tool specific for LGIB, with existing tools not routinely adopted. We aimed to develop and validate a risk stratification tool for LGIB. Methods Retrospective review of LGIB admissions to three centres between 2010 and 2018 formed the derivation cohort. Using regressional analysis within a machine learning technique, risk factors for adverse outcomes were identified, forming a simple risk stratification score—The Birmingham Score. Retrospective review of an additional centre, not included in the derivation cohort, was performed to validate the score. Results Data from 469 patients were included in the derivation cohort and 180 in the validation cohort. Admission haemoglobin OR 1.07(95% CI 1.06–1.08) and male gender OR 2.29(95% CI 1.40–3.77) predicted adverse outcomes in the derivation cohort AUC 0.86(95% CI 0.82–0.90) which outperformed the Blatchford 0.81(95% CI 0.77–0.85), Rockall 0.60(95% CI 0.55–0.65) and AIM65 0.55(0.50–0.60) scores and in the validation cohort AUC 0.80(95% CI 0.73–0.87) which outperformed the Blatchford 0.77(95% CI 0.70–0.85), Rockall 0.67(95% CI 0.59–0.75) and AIM 65 scores 0.61(95% CI 0.53–0.69).
    [Show full text]
  • Start-Up Valuation in Switzerland: Analysis and Methods
    Start-up valuation in Switzerland: analysis and methods Master Thesis Candidate: Silvia Lama Supervisor: Prof. Dr. Didier Sornette ETH Zürich Department of Management, Technology and Economics (D-MTEC) Chair of Entrepreneurial Risks June 2019 – November 2019 1.1 Motivation and overview 2 0 Abstract 3 ABSTRACT The aim of this master thesis is to provide an overview of start-up valuations in Switzerland. The first part focuses on the analysis of funding rounds closed in Switzerland, between 2010 and 2019. The existence of patterns and trends is investigated, visualized, and commented. The second part selects the best model to estimate a range of pre-money valuation for a target start- up, as a fair benchmark. This could be used by investors and co-founders as a starting point in their investment negotiation process. Indeed, traditional valuation methods1 cannot be applied to start-ups, due to the uncertainty of the latter, their short history, absence of publicly available data on financials, comparable companies or transactions. As a consequence, new valuation methods have emerged and, in the conclusion chapter, they are compared to our approach, stressing their lack of objectivity, contextuality, accuracy, and precision. Finally, several traces for further research are recommended. 1 (e.g. the Discounted Cash Flow, the Valuation Multiples) 1.1 Motivation and overview 4 ACKNOWLEDGEMENTS I would like to express all my gratitude to Professor Didier Sornette, for the opportunity to conduct my Master Thesis at the Chair of Entrepreneurial Risks and for his guidance, and to Dr. Spencer Wheatley for the useful advice. Besides, I would like to sincerely thank Steffen Wagner and Michael Blank, for the trust demonstrated by choosing me to pursue this delicate and extremely interesting research project at Investiere.
    [Show full text]
  • An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics
    HHS Public Access Author manuscript Author ManuscriptAuthor Manuscript Author Cell. Author Manuscript Author manuscript; Manuscript Author available in PMC 2019 April 05. Published in final edited form as: Cell. 2018 April 05; 173(2): 400–416.e11. doi:10.1016/j.cell.2018.02.052. An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics Jianfang Liu1, Tara Lichtenberg2, Katherine A. Hoadley3, Laila M. Poisson4, Alexander J. Lazar5, Andrew D. Cherniack6, Albert J. Kovatich7, Christopher C. Benz8, Douglas A. Levine9, Adrian V. Lee10, Larsson Omberg11, Denise M. Wolf12, Craig D. Shriver13, Vesteinn Thorsson14, The Cancer Genome Atlas Research Network, and Hai Hu1,15,* 1Chan Soon-Shiong Institute of Molecular Medicine at Windber, Windber, PA 15963, USA 2Nationwide Children’s Hospital, Columbus, OH 43205, USA 3Department of Genetics, Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). *Correspondence: [email protected]. 15Lead Contact AUTHOR CONTRIBUTIONS Conceptualization, H.H.; Methodology, J.L. and L.M.P.; Formal Analysis, J.L., L.M.P., D.A.L., and L.O.; Investigation, T.L., K.A.H., and A.D.C.; Resources, T.L., K.A.H., A.D.C., and TCGA Network; Data Curation, J.L., T.L., K.A.H., and TCGA Network; Writing – Original Draft, H.H., J.L., K.A.H., T.L., and A.J.K., Writing – Review & Editing, H.H., J.L., C.C.B., L.M.P., K.A.H., A.J.L., A.V.L., D.A.L., D.M.W., V.T., A.J.K., and C.D.S.; Visualization, J.L.; Funding Acquisition, C.D.S.
    [Show full text]
  • Ph.D. Og Speciale
    Survival after treatment of metastatic bone disease in the extremities Evaluation of factors influencing survival and subsequent development and validation of a prediction model for postoperative survival MICHALA SKOVLUND SØRENSEN, M.D. PhD thesis originated from Musculoskeletal Tumor Section, Department of Orthopedics, Rigshospitalet, University of Copenhagen, Denmark & Faculty of Health and Medical Sciences, University of Copenhagen ii Date of submission February 1st, 2018 Academic supervisors Principal supervisor: Michael Mørk Petersen, M.D., DMSc, professor Musculoskeletal Tumor Section, Department of Orthopedics, Rigshospitalet, University of Copenhagen, Denmark Co-supervisor: Klaus Hindsø, M.D., PhD, associate professor Section of Pediatrics, Department of Orthopedics Rigshospitalet, University of Copenhagen, Denmark Assessment committee Chair Jes Bruun Lauritzen, M.D., DMSc, professor Department of Orthopedic Surgery, Bispebjerg Hospital, University of Copenhagen, Denmark International assessor Henrik C. Bauer, M.D., PhD, professor Department Orthopedics/Oncology Service, Karolinska Hospital, Stockholm, Sweden National assessor Alma Becic Pedersen, M.D., PhD, DMSc, associate professor Department of Clinical Epidemiology - Department of Clinical Medicine - Aarhus University, Denmark. iii ACKNOWLEDGEMENT The deepest gratitude to all of the 164 patients who took their time and went the extra miles to participate in the prospective part of current thesis during a period of their life where the risk of ambulatory function and end of life was pending. I would like to thank my principal supervisor, Michael, for your time, your patience, for giving me elbow space, bearing with me during my stubbornness and for making all of this happening. It has been a project in the making for 6 years and I’m sorry for not providing you with a Dark score, but you compensate for the Hindsø score in so many other ways that you do not need one.
    [Show full text]
  • An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics
    HHS Public Access Author manuscript Author ManuscriptAuthor Manuscript Author Cell. Author Manuscript Author manuscript; Manuscript Author available in PMC 2019 April 05. Published in final edited form as: Cell. 2018 April 05; 173(2): 400–416.e11. doi:10.1016/j.cell.2018.02.052. An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics Jianfang Liu1, Tara Lichtenberg2, Katherine A. Hoadley3, Laila M. Poisson4, Alexander J. Lazar5, Andrew D. Cherniack6, Albert J. Kovatich7, Christopher C. Benz8, Douglas A. Levine9, Adrian V. Lee10, Larsson Omberg11, Denise M. Wolf12, Craig D. Shriver13, Vesteinn Thorsson14, The Cancer Genome Atlas Research Network, and Hai Hu1,15,* 1Chan Soon-Shiong Institute of Molecular Medicine at Windber, Windber, PA 15963, USA 2Nationwide Children’s Hospital, Columbus, OH 43205, USA 3Department of Genetics, Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). *Correspondence: [email protected]. 15Lead Contact AUTHOR CONTRIBUTIONS Conceptualization, H.H.; Methodology, J.L. and L.M.P.; Formal Analysis, J.L., L.M.P., D.A.L., and L.O.; Investigation, T.L., K.A.H., and A.D.C.; Resources, T.L., K.A.H., A.D.C., and TCGA Network; Data Curation, J.L., T.L., K.A.H., and TCGA Network; Writing – Original Draft, H.H., J.L., K.A.H., T.L., and A.J.K., Writing – Review & Editing, H.H., J.L., C.C.B., L.M.P., K.A.H., A.J.L., A.V.L., D.A.L., D.M.W., V.T., A.J.K., and C.D.S.; Visualization, J.L.; Funding Acquisition, C.D.S.
    [Show full text]
  • Predicting Presence of Macrovascular Causes in Non-Traumatic Intracerebral Hemorrhage; the DIAGRAM Prediction Score
    Predicting presence of macrovascular causes in non-traumatic intracerebral hemorrhage; the DIAGRAM prediction score Nina A Hilkens, MD*1, Charlotte JJ van Asch, MD, PhD *2,3, David J Werring, MD, PhD*4, Duncan Wilson, MD4, Gabriël JE Rinkel, MD, FRCPE2, Ale Algra, MD, PhD1,2, Birgitta K Velthuis, MD, PhD5, Gérard AP de Kort, MD5, Theo D Witkamp, MD, PhD5, Koen M van Nieuwenhuizen, MD2, Frank-Erik de Leeuw, MD, PhD6, Wouter J Schonewille, MD, PhD7 Paul LM de Kort, MD, PhD8, Diederik WJ Dippel, MD, PhD9, Theodora WM Raaymakers, MD10, Jeannette Hofmeijer, MD, PhD11, Marieke JH Wermer, MD, PhD12, Henk Kerkhoff, MD, PhD13, Korné Jellema, MD, PhD14, Irene M Bronner, MD, PhD15, Michel JM Remmers, MD16, Henri Paul Bienfait, MD17, Ron JGM Witjes, MD18, H Rolf Jäger, MD, FRCR19, Jacoba P Greving, PhD1, Catharina JM Klijn, MD, PhD2,6; the DIAGRAM study group * These authors contributed equally to the manuscript 1. Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, the Netherlands 2. Department of Neurology and Neurosurgery, Brain Center Rudolf Magnus, University Medical Center, Utrecht, the Netherlands 3. Kempenhaeghe, Academic Centre for Epileptology, Heeze, the Netherlands 4. Stroke Research Centre, Department of Brain Repair and Rehabilitation, UCL Institute of Neurology and The National Hospital for Neurology and Neurosurgery, Queen Square, London, UK 5. Department of Radiology, University Medical Center Utrecht, Utrecht, the Netherlands 6. Department of Neurology, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, the Netherlands 7. Department of Neurology, St. Antonius Hospital, Nieuwegein, the Netherlands 8. Department of Neurology, St.
    [Show full text]
  • Bleeding on Antithrombotic Treatment in Secondary Stroke Prevention
    BLEEDING ON ANTITHROMBOTIC TREATMENT IN SECONDARY STROKE PREVENTION Nina Hilkens Bleeding on antithrombotic treatment in secondary stroke prevention ISBN:978-94-6295-940-8 Design: Paula Baggen, ProefschriftOntwerp.nl, Nijmegen Cover image: Divinity School, Oxford Printed by: Proefschriftmaken.nl Copyright 2018, Nina Hilkens, The Netherlands BLEEDING ON ANTITHROMBOTIC TREATMENT IN SECONDARY STROKE PREVENTION Bloedingen bij het gebruik van antitrombotica voor secundaire preventie na cerebrale ischemie (met een samenvatting in het Nederlands) Proefschrift ter verkrijging van de graad van doctor aan de Universiteit Utrecht ingevolge het besluit van het college voor promoties op gezag van de rector magnificus, prof.dr. H.R.B.M. Kummeling, in het openbaar te verdedigen op dinsdag 19 juni 2018 des middags te 12.45 uur door Nina Adriana Hilkens geboren op 5 juli 1990 te Vlaardingen Promotor: Prof.dr. A. Algra Copromotor: Dr.ir. J.P. Greving The research described in this thesis was supported by a grant of the Dutch Heart Foundation (2013T128). Financial support by the Dutch Heart Foundation for the publication of this thesis is gratefully . acknowledged. Publication of this thesis was also financially supported by Boehringer Ingelheim CONTENTS Chapter 1 General introduction 7 Chapter 2 Prediction models for intracranial haemorrhage or major bleeding in 17 patients on antiplatelet therapy: a systematic review and external validation study Chapter 3 Predicting major bleeding in patients with noncardioembolic stroke on 37 antiplatelets: S2TOP-BLEED
    [Show full text]
  • Author Comments to Editors Decision
    Author Comments to Editors Decision See author comments in green colour Comments to the Author: The manuscript has been revised thoroughly according to the reviewers’ comments. In the present form, this paper provides a comprehensible overview of post-processing routines run operationally at DWD, which is definitely relevant to the audience of this NPG special issue. Hence, I think the paper will be ready for publication subject to a very minor revision addressing the following issues: - L50: EMOS does not necessary rely on Gaussian distributions (e.g. EMOS for precipitation based on shifted GEV or shifted Gamma distributions). So I would either call the method NGR (non-homogenous Gaussian regression) here, or rephrase the sentence in order emphasize that depending on the variable of interest EMOS methods relying on different parametric distribution families exist. Thanks for the clarification. Method called NGR and sentence is rephrased appropriately. - Equation (1): Why is a subscript k attached to predictand y. This notation is somewhat unusual and needs to be clarified. The predictand is y. \hat y_k is the estimation using a number of k predictors. The explicit labelling with k is used in the prescription of the stepwise regression in the paragraph after Eq. 7, lines 277ff- Sentence is rephrased for clarification. - L193: is this still the case (considering the developments in statistical post-processing over the last decade)? Actually, I think it is. Nevertheless, statement changed to „a classical approach“. - L213: What do you mean by variable here? NWP output? Yes. „independent variables“ removed since redundant here. Some typos (list probably not exhaustive): all corrected, thanks - L48: forecast errors - L63/64: use of calibrated ensembles - L89: thunderstorms? - L137: Too many observations - L157: Bougeault et al.
    [Show full text]