University of Leicester, Msc Medical Statistics, Thesis, Wilmar
Total Page:16
File Type:pdf, Size:1020Kb
Thesis MSc in Medical Statistics Department of Health Sciences University of Leicester, United Kingdom Application of Bayesian hierarchical generalized linear models using weakly informative prior distributions to identify rare genetic variant effects on blood pressure Wilmar Igl March 2015 Summary Background Currently rare genetic variants are discussed as a source of \missing heritability" of complex traits. Bayesian hierarchical models were proposed as an efficient method for the estimation and aggregation of conditional effects of rare variants. Here, such models are applied to identify rare variant effects on blood pressure. Methods Empirical data provided by the Genetic Analysis Workshop 19 (2014) included 1,851 Mexican-American individuals with diastolic blood pressure (DBP), systolic blood pressure (SBP), hypertension (HTN) and 461,868 variants from whole- exome sequencing of odd-numbered chromosomes. Bayesian hierarchical generalized linear models using weakly informative prior distributions were applied. Results Associations of rare variants chr1:204214013 (estimate = 39.6, Credible In- terval (CrI) 95% = [25.3, 53.9], Bayesian p = 6:8 × 10−8) in the PLEKHA6 gene and chr11:118518698 (estimate = 32.2, CrI95% = [20.6, 43.9], Bayesian p = 7:0 × 10−8) in the PHLEDB1 gene were identified. Joint effects of grouped rare variants on DBP in 23 genes (Bayesian p = [8:8 × 10−14, 9:3 × 10−8]) and on SBP in 21 genes (Bayesian p = [8:6 × 10−12, 7:8 × 10−8]) in pathways related to hemostasis, sodi- um/calcium transport, ciliary activity, and others were found. No association with hypertension was detected. Conclusions Bayesian hierarchical generalized linear models with weakly informa- tive priors can successfully be applied in exome-wide genetic association analyses of rare variants. ii Acknowledgments I would like to thank my supervisors Professor John Thompson, PhD, and Louise Wain, PhD, for developing this challenging project with me and for supporting it with their encouragement and advice. Additionally, I am grateful to the Department of Health Sciences and all lecturers and tutors for providing such a stimulating and friendly environment for the Master of Medical Statistics program. Especially, Stephanie Hubbard as the course director contributed largely to the success of this program. Furthermore, I would like to express my gratitude to the Department of Health Sciences for supporting the presentation of first results of my thesis at the Genetic Analysis Workshop 19, Vienna, from August 24 to 27, 2014, with a travel grant. Moreover, I want to thank my former employer, Professor Iris Heid, University of Regensburg, for enabling me to attend the courses of the Master of Medical Statistics program and for granting me special leave to complete my thesis. Finally, freely adapting another proverb I want to emphasize that \It takes a village to write a thesis.". Therefore, I would like to give my special thanks to my partner Bettina, our parents, especially Lieselotte, our friend Petra, and all others who made sure that I had the time to work and study and that my son Moritz always found himself in a safe, understanding, and loving environment. F¨urth,March 2015 Wilmar Igl iii Table of contents Summary ii Acknowledgments iii Table of contents iv List of tables viii List of figures xiii 1 Introduction 1 1.1 Missing heritability, rare variants, and Bayesian statistics . .1 1.2 Previous publications . .4 1.3 Overview . .4 2 Biological and medical background 5 2.1 Definitions . .5 2.2 General epidemiology . .6 2.3 Genetic epidemiology . .7 2.3.1 Rare and common genetic variation . .7 2.3.2 Genetics of blood pressure . .7 3 Statistical background 10 3.1 Bayesian and frequentist statistics . 10 3.2 Bayes theorem . 11 3.3 Bayes factors . 12 iv Table of contents 3.4 Bayesian p values . 15 3.4.1 Frequentist p values . 15 3.4.2 The one-sided testing problem . 16 3.4.3 Classes of prior distributions . 17 3.4.4 Non-informative and weakly informative priors . 18 3.4.5 Uniformity of p values . 20 3.4.6 Summary . 20 4 Bayesian hierarchical generalized linear models 21 4.1 Overview . 21 4.2 Linear predictor, link function, and error distribution . 23 4.3 Prior distributions . 25 4.3.1 Variant effects . 25 4.3.2 Group effects . 27 4.3.3 Covariate effects, intercept, and dispersion . 28 4.4 Estimation algorithm (IWLS-EM) . 28 4.5 Bayesian p values . 31 4.5.1 Individual variants . 31 4.5.2 Grouped variants . 32 5 Methods 34 5.1 Data . 34 5.1.1 Simulated data . 34 5.1.2 GAW19 Data . 35 5.2 Statistical models . 40 5.2.1 Frequentist analysis of individual variants . 40 5.2.2 Bayesian analysis of multiple variants . 41 5.2.3 Multiple testing . 42 5.2.4 Missing data . 42 5.2.5 Correction of ethnicity and relatedness . 42 5.2.6 Criteria for model evaluation . 43 v Table of contents 5.3 Software . 44 5.3.1 EPACTS . 44 5.3.2 R/BhGLM . 45 5.3.3 Other . 45 5.4 Hardware . 46 6 Results 47 6.1 Descriptives . 47 6.1.1 Simulated data I . 47 6.1.2 Simulated data II (GAW 19) . 48 6.1.3 Empirical data (GAW 19) . 48 6.2 Simulated data I . 50 6.2.1 Frequentist single-variant analysis . 51 6.2.2 Bayesian hierarchical multiple-variant analysis . 52 6.2.3 Summary . 58 6.3 Simulated data II (GAW 19) . 59 6.3.1 Frequentist single-variant analysis . 59 6.3.2 Bayesian hierarchical multiple-variant analysis . 61 6.3.3 Summary . 64 6.4 Empirical data (GAW 19) . 66 6.4.1 Frequentist single-variant analysis . 66 6.4.2 Bayesian hierarchical multiple-variant analysis . 70 6.4.3 Comparison of single and multiple variant analysis . 80 7 Discussion 86 7.1 Summary . 86 7.2 Statistical models . 87 7.2.1 Frequentist models . 87 7.2.2 Bayesian hierarchical models . 88 7.2.3 Comparison of applied models . 89 vi Table of contents 7.3 Genetics of blood pressure . 90 7.3.1 Individual variant effects . 91 7.3.2 Grouped variant effects . 92 7.4 Modelling of prior biological information . 96 7.5 Strengths and weaknesses . 98 7.5.1 Topic . 98 7.5.2 Method . 98 7.5.3 Model . 99 7.6 Conclusion . 99 References 100 A Appendix 114 A.1 Generalized linear model of the normal distribution . 115 A.2 Generalized linear model of the binomial distribution . 118 A.3 Examples of prior distributions . 122 A.4 Supplementary tables . 128 A.5 Supplementary figures . 143 vii List of tables 2.1 Top ten genes or gene regions associated with blood pressure, hyper- tension, or hypotension according to the GWAS catalog (Welter et al. (2014), retrieved December 8, 2014). Genes are sorted by p value of the index variant. .8 3.1 Scale of interpretation of the Bayes factor according to Kass and Raftery (1995) (adapted from Wagenmakers et al. (2008)) . 13 6.1 Descriptives of simulated data II (1,943 individuals, GAW19) . 49 6.2 Descriptives of the empirical data (1,943 individuals, GAW19) . 50 6.3 Genetic variants in the GAW19 data (1,943 individuals) . 50 6.4 Table of average coefficients of a frequentist normal linear model of simulated systolic blood pressure (SBP). SBP was predicted from individual variants (model 1), unweighted genetic scores (model 2), or weighted genetic scores (model 3). Analysis was performed in 1,000 datasets with 10,000 individuals without genetic effects (simulated data Ia). 52 6.5 Table of average coefficients of a reduced Bayesian hierarchical model predicting simulated systolic blood pressure using weakly informative prior Cauchy (0, 1) distributions. Analysis was performed in 1,000 samples with 10,000 individuals without genetic effects (simulated data Ia). Associations between systolic blood pressure and individual variant effects (model 1), average variant effects (model 2), or joint variant effects (model 3) of grouped variants were tested. 54 viii List of tables 6.6 Table of average coefficients of a full Bayesian hierarchical model pre- dicting systolic blood pressure using weakly informative prior Cauchy (0, 1) distributions for group effects and Cauchy distributions with selected parameters for individual variants. Analysis was performed in 1,000 samples with 10,000 individuals without genetic effects (sim- ulated data Ia). 55 6.7 Table of average coefficients of a full Bayesian hierarchical model pre- dicting simulated systolic blood pressure using weakly informative prior Cauchy (0, 1) distributions for group effects and Cauchy dis- tributions with selected parameters for individual variants. Analysis was performed in 1,000 samples with 10,000 individuals with genetic effects (simulated data Ib). 58 6.8 Overview on genome-wide significant associations in 128 candidate genes with non-null effects and nominally significant associations in 1,000 genes with null effects in simulated data II (GAW19) . 65 6.9 Associations of top ten variants with observed diastolic blood pressure using a normal linear model in empirical GAW19 data (sorted by p value) ................................... 67 6.10 Associations of top ten variants with observed systolic blood pressure using a normal linear model in empirical GAW19 data (sorted by p value) ................................... 68 6.11 Associations of top ten variants with observed hypertension using the Firth logistic model in empirical GAW19 data (sorted by p value) . 69 6.12 Associations of top ten variants associated with observed diastolic blood pressure using a reduced Bayesian hierarchical generalized lin- ear model (normal) in empirical GAW19 data (sorted by Bayesian p value) ................................... 71 6.13 Associations of top ten variants with observed systolic blood pres- sure using a reduced Bayesian hierarchical generalized linear model (normal) in empirical GAW19 data (sorted by Bayesian p value) .