Linear Mixed Model Selection by Partial Correlation
Total Page:16
File Type:pdf, Size:1020Kb
LINEAR MIXED MODEL SELECTION BY PARTIAL CORRELATION Audry Alabiso A Dissertation Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY May 2020 Committee: Junfeng Shang, Advisor Andy Garcia, Graduate Faculty Representative Hanfeng Chen Craig Zirbel ii ABSTRACT Junfeng Shang, Advisor Linear mixed models (LMM) are commonly used when observations are no longer independent of each other, and instead, clustered into two or more groups. In the LMM, the mean response for each subject is modeled by a combination of fixed effects and random effects. The fixed effects are characteristics shared by all individuals in the study; they are analogous to the coefficients of the linear model. The random effects are specific to each group or cluster and help describe the correlation structure of the observations. Because of this, linear mixed models are popular when multiple measurements are made on the same subject or when there is a natural clustering or grouping of observations. Our goal in this dissertation is to perform fixed effect selection in the high-dimensional linear mixed model. We generally define high-dimensional data to be when the number of potential predictors is large relative to the sample size. High-dimensional data is common in genomic and other biological datasets. In the high-dimensional setting, selecting the fixed effect coefficients can be difficult due to the number of potential models to choose from. However, it is important to be able to do so in order to build models that are easy to interpret. Many current techniques for fixed effect selection in the high-dimensional LMM are based on the penalized log likelihood. However, adding a penalized term to the log likelihood results in a non-convex optimization problem which requires numerical methods to solve and a data dependent tuning parameter to select the amount of regularization. In contrast to the penalized likelihood, the partial correlation is based on the marginal mea- sures of association between each predictor and the conditioned response. Techniques based on the partial correlation have two main advantages to those based on penalized likelihoods: no data dependent tuning parameter is required to select the fixed effects and the partial correlation is not influenced by strong correlation between covariates. In this dissertation we propose using the par- tial correlation between the response variable conditioned on the random effects to select fixed effects in the LMM. This is an extension of variable selection using partial correlation developed iii by Buhlmann,¨ Kalisch, and Maathuis (2010) to the linear mixed model by conditioning the re- sponse variable on the random effects. At the time of this writing, selection methods using partial correlation have not been attempted to select fixed effects in the linear mixed model. This dissertation proposes a two stage procedure for selecting the fixed effects in the high- dimensional linear mixed model. In the first stage, we use the partial correlation to perform an initial fixed effect variable screening procedure in order to estimate an initial linear mixed model. In the second stage, we use the initial linear mixed model to predict the values of the random effects using the Best Linear Unbiased Predictor (BLUP). These predicted values are used to condition the response variable by subtracting the group-specific random effects from the response. After conditioning on the random effects, the observations are effectively independent, and we select variables using the partial correlation between the covariates and the conditioned response. In this dissertation, we show that this procedure consistently selects fixed effects in the linear mixed model. To use the partial correlation to select variables in the LMM, we require the assumption of par- tial faithfulness on the design matrix X. The partial faithfulness assumption in the LMM describes the relationship between the response conditioned on the random effects and the coefficients of the fixed effects of the LMM. Partial faithfulness in the LMM says that the fixed effect coefficient is equal to zero if and only if the partial correlation between the conditioned response and a predictor under consideration is equal to zero for some set of controlling variables. We present theoretical results that demonstrate that when partial faithfulness holds for the LMM, the relationship between the partial correlation and the coefficients of the fixed effects holds. We investigate the performance of this method in a variety of simulated high-dimensional sce- narios, including non-normal distributions of the random effects. We find that the method is effec- tive at selecting the active set of variables even in the presence of many covariates. Through these simulations, we observe that the proposed technique selects variables quickly and with few false positives, especially in the case where the covariates are highly correlated with each other. We also apply the method to a real high-dimensional dataset regarding the production of riboflavin. iv ACKNOWLEDGMENTS I would like to gratefully acknowledge my advisor, Dr. Junfeng Shang for all the help she has given me throughout this process. I cannot begin to express my appreciation to her for encouraging my independence as a researcher, for helping me stay focused on the goal and for the countless amount of time and feedback she has given me at all stages of this dissertation. This manuscript is immeasurably better because of her invaluable advice. Beyond this dissertation, I am also thankful for her help with my research statement for my job applications and for working with me remotely over the last few semesters. A special thanks to my committee, Dr. Hanfeng Chen, Dr. Craig Zirbel, and Dr. Andy Garcia. Your feedback and insight has greatly helped me improve my dissertation. I would like to give additional thanks to Dr. Zirbel for his help and feedback on my job applications. To Sima Sharghi, Kelsey Meyer, Kim Brooks, Yi-Ching Lee, who have made my time in Bowling Green so much better. Thanks for coming into my life and enriching it so much. Extra thanks to Sherri Stefanovic, whose constant encouragement and optimism kept me going. To my dad, my siblings, Michele and Aaron, and my in-laws, Be and Cam, I hope I made you all proud. Thank you for all the support and encouragement, I don’t have the words to express my appreciation for everything you all have done for me. Finally, I would like to give my eternal gratitude to my husband, La. I could not have done any of this without you. Whenever I doubted, you always had faith. I could write an entire dissertation on how much you mean to me. v TABLE OF CONTENTS Page CHAPTER 1 INTRODUCTION . 1 1.1 Motivation . 1 1.2 Objectives . 3 1.3 Outline . 6 CHAPTER 2 LINEAR MIXED MODELS (LMMs) . 7 2.1 Introduction . 7 2.2 Linear Model . 7 2.3 Linear Mixed Model . 9 2.3.1 Notation and Estimation . 9 2.3.2 Prediction of Random Effects . 12 2.3.3 Random Intercepts and Random Slopes . 13 2.3.4 Covariance Structures for Longitudinal Data . 14 2.3.5 Random Effects with Skew t Distributions . 16 CHAPTER 3 MODEL SELECTION IN THE LM AND LMM . 19 3.1 Introduction . 19 3.1.1 High-Dimensional and Ultrahigh-Dimensional Data . 19 3.2 Penalized Likelihood Methods . 22 3.2.1 LASSO . 22 3.2.2 ALASSO . 23 3.3 Marginal Methods . 24 3.3.1 Sure Independence Screening . 25 3.3.2 Multiple Hypothesis Testing . 26 3.3.3 Variable Selection via Partial Correlation . 26 vi CHAPTER 4 VARIABLE SELECTION USING THE THRESHOLDED PARTIAL COR- RELATION CONDITIONAL ALGORITHM . 28 4.1 Partial Faithfulness in the Linear Model . 29 4.1.1 Definitions . 29 4.2 Partial Faithfulness in the Linear Mixed Model . 30 4.2.1 Proof of Theorem 4.2.4 . 32 4.3 The TPCc Algorithm for the LMM . 35 4.4 Asymptotic Theory of the TPCc . 41 4.4.1 Lemma 1 . 43 4.4.2 Lemma 2 . 43 4.4.3 Lemma 3 . 43 CHAPTER 5 SIMULATION STUDY . 58 5.1 Setup . 58 5.2 Normally Distributed Random Effects . 60 5.2.1 Compound Symmetric . 60 5.2.2 AR1 ...................................... 73 5.2.3 Random Slope . 83 5.3 Random Effects From a skew t Distribution . 88 CHAPTER 6 APPLICATION . 92 6.1 Description of Data . 92 6.2 Analysis . 92 6.2.1 A Permutation Test for Variable Selection . 96 CHAPTER 7 CONCLUDING REMARKS AND FUTURE WORKS . 99 7.1 Concluding Remarks . 99 7.2 Future Works . 100 vii BIBLIOGRAPHY . 102 CHAPTER A APPENDIX SELECTED R PROGRAMS . 105 A.1 Selected Code for Chapter 5: Simulation Study . 105 A.1.1 Code to Create Random Intercept or Random Slope Data . 105 A.1.2 Code to Analyze Random Intercept Data . 107 A.1.3 Code to Create AR1 Data . 108 A.1.4 Code to Analyze AR1 Data . 109 A.1.5 Code to Create skew t Data . 110 A.1.6 Code to Analyze skew t Data . 111 A.2 Selected Code for Chapter 6: Application . 112 A.2.1 Code for Riboflavin Analysis . 112 A.2.2 Code for Permutation Test . 114 viii LIST OF FIGURES Figure Page 2.1 Representation of Random Intercepts and Slopes. 14 5.1 CS False Positive Number by Dimension, ρx = 0:0. 69 5.2 CS Overfit by Dimension, ρx = 0:0. ......................... 69 5.3 CS Median Model Error by Dimension, ρx = 0:0................... 70 5.4 CS Correct Fit by Dimension, ρx = 0:0. ....................... 70 5.5 CS False Positive Number by Dimension, ρx = 0:3. 70 5.6 CS Overfit by Dimension, ρx = 0:3.