Theoretical and Computational Studies Of

THEORETICAL AND COMPUTATIONAL STUDIES OF BAYESIAN LINEAR MODELS A Dissertation Submitted to the Faculty of The Graduate School Baylor College of Medicine In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by QUAN ZHOU Houston, Texas March 16, 2017 APPROVED BY THE DISSERTATION COMMITTEE Signed Yongtao Guan, Ph.D., Chairman Rui Chen, Ph.D. Dennis Cox, Ph.D. Chris Man, Ph.D. Michael Schweinberger, Ph.D. APPROVED BY THE STRUCTURAL AND COMPUTATIONAL BIOLOGY & MOLECULAR BIOPHYSICS GRADUATE PROGRAM Signed Aleksandar Milosavljevic, Ph.D., Director of SCBMB APPROVED BY THE INTERIM DEAN OF GRADUATE BIOMEDICAL SCIENCES Signed Adam Kuspa, Ph.D. Date 2 Acknowledgements First of all, I want to express my deepest gratitude to my advisor Dr. Yongtao Guan and all the other members of our group for their tremendous help in every aspect of my life throughout the past 5 years. The completion of my degree projects should be attributed to Dr. Guan's amazing ideas, selfless support and unceasing hard work. From him I learned programming and statistical skills, enthusiasm and devotion to science and how to conduct academic research. My sincere thanks also go to other group members, Hang Dai, Zhihua Qi, Hanli Xu and Liang Zhao, for their kindness, friendship and encouragement. Second, I want to thank my thesis committee, Dr. Rui Chen, Dr. Dennis Cox, Dr. Chris Man and Dr. Michael Schweinberger, for their amiable, generous and continuous academic help and guidance. Special thanks to Dr. Cox for his excellent courses (Mathematical Statistics I and II, Stochastic Process, Multivari- ate Analysis, Functional Data Analysis) which opened a new world for me. In addition, I want to thank Dr. Philip Ernst, who has collaborated with me on a probability paper and offered me priceless opportunities like giving lectures for the course Mathematical Probability. Third, I am truly grateful to my program SCBMB and the Graduate School of BCM, which always support me in choosing my research and studying the skills that I like. Without the learning environment they have created almost nothing could I have achieved. I also want to thank Rice University for the courses I took over four years and its hospitality to a visiting student like me. Last, but by no means least, I want to thank my parents, my friends and teachers at and out of Houston as they have filled my PhD journey with hope and happiness. Dedication This thesis is dedicated to my family, especially my grandfather and grandmother who have passed away during my PhD study. To their love I shall be immensely and forever indebted. 3 Abstract Statistical methods have been extensively applied to genome-wide association studies to demystify the genetic architecture of many common complex diseases in the past decade. Bayesian methods, though not as popular as traditional methods, have been used for various purposes, like association testing, causal SNP identi- fication, heritability estimation and genotype imputation. This work focuses on the Bayesian methods based on linear regression. Bayesian hypothesis testing reports a (null-based) Bayes factor instead of a p-value. For linear regression, it is shown in Chap. 2.1 that under the null model of no effect, 2 log(Bayes factor) is asymptotically distributed as a weighted sum of 2 independent χ1 random variables. The weights are all between 0 and 1. Similarly, under the alternative model with some necessary conditions on the effect size, 2 log(Bayes factor) is asymptotically distributed as a weighted sum of independent noncentral chi-squared random variables. An immediate benefit is that the p-values associated with the Bayes factors can be analytically computed rather than by permutation, which is of vital importance in genome-wide association studies. Due to multiple testing, in whole-genome studies the significance thresh- old is extremely small and thus permutation is in fact impractical. Furthermore, the asymptotic results help explain the behaviour of the Bayes factor and the origin of some well-known paradoxes, like Bartlett's paradox (Chap. 2.2). Lastly, in light of this null distribution, a new statistic named the scaled Bayes factor is proposed. It is defined via a rescaling of the Bayes factor so that the expectation of log(scaled Bayes factors) is fixed to zero (or some other constant). In Chap. 5.1 its practical and theoretical benefits are discussed. Chap. 5.2 describes an appli- cation of the scaled Bayes factor to the analysis of a real whole-genome dataset for intraocular pressure. For multi-linear regression, the computation of the p-value associated with the Bayes factor requires the evaluation of the distribution function of a weighted sum 4 2 of independent χ1 random variables. We implemented in C++ a recent polynomial method of Bausch [2013], which appears to be the most efficient solution so far (Chap. 2.3.2 and 2.3.3). Simulation studies (Chap. 2.3.4) show that the p-values computed according to the asymptotic null distribution have very good calibration, even for very large Bayes factors, validating the use of this method in genome-wide association studies. The expression of the Bayes factor for linear regression contains the posterior mean estimator for the regression coefficient, which is also called the ridge estimator by non-Bayesians. When XtX is available (X denotes the design matrix), ridge estimators are usually computed via the Cholesky decomposition of the matrix XtX + cI, which is efficient but still has cubic complexity in the number of regressors. A new iterative method, called ICF (iterative solutions using complex factorization), is proposed in Chap. 3. It assumes that the Cholesky decomposition of XtX is already obtained. Simulation (Chap. 3.5) shows that, when ICF is applicable, it is much better than the Cholesky decomposition and other iterative methods like Gauss-Seidel algorithm. The ICF algorithm fits perfectly with the Bayesian variable selection regression proposed by Guan and Stephens [2011] since in MCMC, the Cholesky decomposition of XtX can be obtained by efficient updating algorithms (but not for XtX + cI if c is changing). A reimplementation of their method using ICF substantially improves the efficiency of posterior inferences (Chap. 4.3). Sim- ulation studies (Chap. 4.4) show that the new method can efficiently estimate the heritability of a quantitative trait and report well-calibrated posterior inclu- sion probabilities. Furthermore, compared with another popular software package GCTA (Chap. 7.5), the new method has much better performance in prediction (Chap. 4.4.3). The use of the scaled Bayes factor for variable selection is discussed in Chap. 5.3. To achieve consistency, the scaling factor is calibrated using the data (Chap. 5.3.1). 5 Simulation studies demonstrate that, after the calibration, the scaled Bayes factor performs at least as well as the unscaled Bayes factor in both heritability estimation and prediction (Chap. 5.4). 6 Contents Approvals 2 Acknowledgements 3 Abstract 4 Symbols and Notations 13 Abbreviations 14 1 Introduction 15 1.1 Genome-Wide Association Studies . 15 1.1.1 Some Genetic Concepts . 16 1.1.2 Some Statistical Concepts . 20 1.2 Bayesian Linear Regression . 25 1.3 Applications of Bayesian Linear Regression to GWAS . 28 1.3.1 Association Testing . 28 1.3.2 Variable Selection, Heritability Estimation and Prediction . 30 2 Distribution and P-value of the Bayes Factor 34 2.1 Distribution of Bayes Factors in Linear Regression . 34 2.1.1 Distributions of Quadratic Forms . 35 2.1.2 Asymptotic Distributions of log BFnull ............. 39 2.1.3 Asymptotic Results in Presence of Confounding Covariates . 43 2.2 Properties of the Bayes Factor and Its P-value . 45 2.2.1 Comparison with the P-values of the Frequentists' Tests . 45 7 2.2.2 Independent Normal Prior and Zellner's g-prior . 47 2.2.3 Behaviour of the Bayes Factor and Three Paradoxes . 49 2.2.4 Behaviour of the P-value Associated with the Bayes Factor . 54 2.2.5 More about Simple Linear Regression . 56 2.3 Computation of the P-values Associated with Bayes Factors . 59 2.3.1 Bartlett-type Correction . 60 2.3.2 Bausch's Method . 62 2.3.3 Implementation of Bausch's Method . 66 2.3.4 Calibration of the P-values . 69 3 A Novel Algorithm for Computing Ridge Estimators 72 3.1 Background . 72 3.2 Direct Methods for Computing Ridge Estimators . 74 3.2.1 Spectral Decomposition of XtX . 75 3.2.2 Cholesky Decomposition of XtX + Σ . 76 3.2.3 QR Decomposition of the Block Matrix Xt Σ1=2t . 77 3.2.4 Bidiagonalization Methods . 78 3.3 Iterative Methods for Computing Ridge Estimators . 79 3.3.1 Jacobi, Gauss-Seidel and Successive Over-Relaxation . 79 3.3.2 Steepest Descent and Conjugate Gradient . 82 3.4 A Novel Iterative Method Using Complex Factorization . 84 3.4.1 ICF and Its Convergence Properties . 84 3.4.2 Tuning the Relaxation Parameter for ICF . 88 3.5 Performance Comparison by Simulation . 92 3.5.1 Methods . 92 3.5.2 Wall-time Usage, Convergence Rate and Accuracy . 93 4 Bayesian Variable Selection Regression 98 4.1 Background and Literature Review . 98 4.1.1 Models for Bayesian Variable Selection . 100 4.1.2 Methods for the Model Fitting . 103 8 4.2 The BVSR Model of Guan and Stephens . 105 4.2.1 Model and Prior . 106 4.2.2 MCMC Implementation . 109 4.3 A Fast Novel MCMC Algorithm for BVSR using ICF . 114 4.3.1 The Exchange Algorithm . 115 4.3.2 Updating of the Cholesky Decomposition . 116 4.3.3 Summary of fastBVSR Algorithm .

Load more