Regularization Methods for Predicting an Ordinal Response Using Longitudinal High-Dimensional Genomic Data
Total Page:16
File Type:pdf, Size:1020Kb
Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2013 Regularization Methods for Predicting an Ordinal Response using Longitudinal High-dimensional Genomic Data Jiayi Hou Virginia Commonwealth University Follow this and additional works at: https://scholarscompass.vcu.edu/etd Part of the Biostatistics Commons © The Author Downloaded from https://scholarscompass.vcu.edu/etd/3242 This Dissertation is brought to you for free and open access by the Graduate School at VCU Scholars Compass. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of VCU Scholars Compass. For more information, please contact [email protected]. c Jiayi Hou 2013 All Rights Reserved REGULARIZATION METHODS FOR PREDICTING AN ORDINAL RESPONSE USING LONGITUDINAL HIGH-DIMENSIONAL GENOMIC DATA A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at Virginia Commonwealth University By Jiayi Hou B.S. Mathematics, Sichuan University, 2008 Advisor: Kellie J. Archer Associate Professor, Department of Biostatistics Director, VCU Massey Cancer Center Biostatistics Shared Resource Virginia Commonwealth University Richmond, Virginia December, 2013 Acknowledgement I own my sincere thanks to countless people who have helped, supported, encouraged me during my long Ph.D journey. Without your help, this thesis would not be possible. First and foremost, I must thank my thesis advisor Dr. Kellie Archer for the generous support she gave to help pave the path to achievement. Dr.Archer has inspired me to devote to statistical learning area by her enthusiasm, vision and determination. Learning from Dr. Archer has been the greatest pleasure and will be the lifetime treasure. Thank you to the rest of my thesis committee: Dr. Chris Gennings, Dr. Robert Johnson who are experts in the fields of biostatistics; Dr. Sam Chen, who has tremendous research experience in genetics and genomics; and Dr. Juan Lu, who contributes enormously to the epidemiology area. Your constructive feedback, rich support and kind encouragement help me grow both as a researcher and as a person. I truly appreciate the time, effort and energy you dedicated to help make this happen. Many other great supervisors I got to know through various projects and positions have taught me a lot. Thank you to Dr. Mark Reimers for introducing me to the field of bioinformatics and offering me the rigorous training. Thank you to Dr. Charlie Kish and Dr. Carol Summitt for providing a unique opportunity to gain real industrial experience and broadening the areas where my knowledge can be applied. Thank you to Dr. Phillip Yates, Dr. Max Kuhn and other team members at Pfizer Inc for offering me an enjoyable, memorable and cool summer in Connecticut. I owe a big thanks to my classmates Adam Sima, Caroline Carrico, Amber Wilk, Sarah i Reese, Chunfeng Ren, Qing Zhou, Yan Jin who grow, laugh, gossip and enjoy graduate school with me. There were other people in the department who provided generous support and helped me get through school: Yvonne Hargrove, Gayle Spivery, Helen Wang, Brian Bush, Russ Boyle, Dr. Donna McClish, Dr. Roy Sabo, Dr. Jessica Ketchum, Dr. Guimin Gao and Dr. Shumei Sun, I owe you a great debt of gratitude. I am grateful to all my other friends in Richmond, VA for your accompaniment, tolerance and encouragement which makes the journey wonderful. Finally, I owe everything to my parents and family members who unconditionally support me to pursue my dream. I feel blessed and will always cherish the memory at Virginia Commonwealth University. ii Table of Contents Page Table of Contents iii List of Figures ix List of Tables xii Abstract xx 1 Introduction to Ordinal Model 1 1.1 Ordinal Responses . 2 1.2 Model Framework for Ordinal Responses . 3 1.2.1 Cumulative Logit Model . 5 1.2.2 Adjacent Categories Model . 6 1.2.3 Continuation Ratio Model . 8 1.3 Estimation of the Coefficients . 10 1.3.1 Maximum Likelihood Estimate . 10 1.3.2 Optimization Technique . 11 1.3.3 Software Implementation . 13 1.4 NIMH Schizophrenia Example . 14 2 Regularization Methods for High-dimensional Data 20 iii 2.1 Regularization Methods for Continuous Response . 21 2.1.1 LASSO . 24 2.1.2 Forward Stagewise Method . 26 2.1.3 LAR . 27 2.2 Regularization Methods for Dichotomous Responses . 28 2.2.1 LASSO for Logistic Regression . 29 2.2.2 Forward Stagewise for Logistic Regression . 30 2.3 Coordinate Descent for LASSO Regularization Paths . 32 2.4 Some Discussion . 35 3 Statistical Models for Longitudinal Data 37 3.1 Linear Mixed Model . 40 3.1.1 Linear Regression Model . 40 3.1.2 ANOVA and MANOVA Approaches for Repeated Measurement . 41 3.1.3 Linear Mixed Model . 47 3.1.4 Estimating Parameters for a Linear Mixed Model . 55 3.2 Nonlinear Mixed Model . 66 3.2.1 The Model Framework . 66 3.2.2 The Marginal Likelihood and its Approximation . 67 3.2.3 Estimating of the Parameters . 71 3.2.4 Orange Tree Example . 72 3.3 Generalized Linear Model . 79 3.3.1 Generalized Linear Model Framework . 79 3.3.2 Moments and Likelihood for GLM . 81 iv 3.3.3 Maximum Likelihood Estimates for GLM . 83 3.3.4 Quasi-Likelihood Estimates for GLM . 84 3.4 Generalized Linear Mixed Model . 87 3.4.1 Generalized Equation Estimation for Marginal Model . 87 3.4.2 Penalized Quasi-likelihood for GLMM . 89 4 Random Coefficient Model with Ordinal Response 94 4.1 Random Coefficient Model with Ordinal Response . 95 4.2 The Marginal Likelihood and its Approximation . 97 4.3 Estimating Model Parameters . 105 4.4 Estimating the Random Effects . 108 4.5 NIMH Schizophrenia Example Revisited . 109 4.6 Health Services Research Example . 122 5 Penalized Model for Traditional Longitudinal High-dimensional Data with an Ordinal Response 127 5.1 Review of Forward Stagewise Method . 128 5.2 Regularization Method for High-dimensional Data with Ordinal Response . 131 5.3 Regularization Method for Longitudinal High-dimensional Data with an Or- dinal Response . 137 5.4 Model Assessment and Selection . 143 5.5 Software Implementation . 147 5.6 Simulations to Evaluate the Proposed Model . 166 5.6.1 Simulation for High-dimensional Data . 166 v 5.6.2 Simulation for Longitudinal High-dimensional Data . 167 5.7 Some Discussion . 168 6 Application of Proposed Methodology 172 6.1 Application to the Smoking Study . 173 6.2 Application to the Glue Grant Study . 179 6.2.1 Marshall score for the renal system . 182 6.2.2 Marshall score for the central nervous system . 190 6.2.3 Aggregated Marshall score . 196 6.3 Discussion . 200 7 Conclusions and Future Work 203 7.1 Conclusions . 203 7.2 Future Work . 206 7.2.1 Variable Selection using LAR type Algorithm . 206 7.2.2 Variable Selection with Consideration of the Correlations between Fea- tures . 208 7.2.3 Application to Other Genomic and Medical Data . 210 Bibliography 213 Appendices 223 A NIMH Schizophrenia Data Code 224 A.1 R code for NIMH Schizophrenia Data . 224 A.2 R code for NIMH Schizophrenia Data using VGAM package . 227 vi A.3 SAS code for NIMH Schizophrenia Data . 228 B Orange Tree Example Code 230 B.1 R code for Orange Tree Example . 230 B.2 R code for Orange Tree Example using lme4 package . 233 B.3 SAS code Orange Tree Example . 234 B.4 WinBUGS code for Orange Tree Example . 235 C NIMH Schizophrenia Longitudinal Data Code 237 C.1 R code for NIMH Schizophrenia Longitudinal Data . 237 C.2 SAS code for NIMH Schizophrenia Longitudinal Data . 237 C.3 R code for NIMH Schizophrenia Longitudinal Data using ordinal pacakge . 241 D NIMH Schizophrenia Longitudinal Data Additional Results 242 D.1 Random Coefficient Model with Adjacent Categories Logit . 242 D.2 Random Coefficient Model with Backward Continuation Ratio . 248 D.3 Random Coefficient Model with Forward Continuation Ratio . 254 E Health Service Research Example Code 260 E.1 R code for Health Service Research Example . 260 E.2 SAS code for Health Service Research Example . 260 F Health Service Research Example Additional Results 262 F.1 Health Service Research Example output: Random Intercept Model with Adjacent-Category Logit . 262 F.2 Random Intercept Model with Backward Continuation Ratio . 263 vii F.3 Random Intercept Model with Forward Continuation Ratio Logit . 265 G GSE10006 Smoking Study Additional Results 267 H Glue Grant Burn Injury Study Example Additional Results 269 I R code for R package ordinalmixed with Applications 277 I.1 Source Code . 277 I.2 Application to NIMH Schizophrenia Longitudinal Data . 305 I.3 Application to Health Service Research Example . 306 I.4 Application to GSE10006 Smoking Study . 307 I.5 Application to Glue Grant Burn Injury Study . 310 I.6 High-dimensional Data Simulation . 317 I.7 Longitudinal High-dimensional Data Simulation . 319 viii List of Figures 2.1 Estimation picture for the LASSO . 25 2.2 Least square projection in linear regression model . 27 3.1 Orange Tree Growth Curves . 72 4.1 Summary of IMPS score (Normal, Mild, Marked, Severe) by Time in the Placebo Group . 110 4.2 Summary of IMPS score (Normal, Mild, Marked, Severe) by Time in the Intervention Group . 111 4.3 Summary of Housing Status by Time in Group with Section 8 Certificates . 124 4.4 Summary of Housing Status by Time in Group without Section 8 Certificates 124 5.1 Flowchart for function FSPenFixed in R package ordinalmixed. The blue circle represents input/output and the cyan rectangle represents an R func- tion. The FSPenFixed function first calls function forward.stagewise.cum to perform steps 1,2 and 3 described in GMIFS for ordinal response with high- dimensional data.