Thesis

MSc in Medical Statistics Department of Health Sciences University of Leicester, United Kingdom

Application of Bayesian hierarchical generalized linear models using weakly informative prior distributions to identify rare genetic variant effects on blood pressure

Wilmar Igl

March 2015 Summary

Background Currently rare genetic variants are discussed as a source of “missing heritability” of complex traits. Bayesian hierarchical models were proposed as an efficient method for the estimation and aggregation of conditional effects of rare variants. Here, such models are applied to identify rare variant effects on blood pressure.

Methods Empirical data provided by the Genetic Analysis Workshop 19 (2014) included 1,851 Mexican-American individuals with diastolic blood pressure (DBP), systolic blood pressure (SBP), hypertension (HTN) and 461,868 variants from whole- exome sequencing of odd-numbered . Bayesian hierarchical generalized linear models using weakly informative prior distributions were applied.

Results Associations of rare variants chr1:204214013 (estimate = 39.6, Credible In- terval (CrI) 95% = [25.3, 53.9], Bayesian p = 6.8 × 10−8) in the PLEKHA6 and chr11:118518698 (estimate = 32.2, CrI95% = [20.6, 43.9], Bayesian p = 7.0 × 10−8) in the PHLEDB1 gene were identified. Joint effects of grouped rare variants on DBP in 23 (Bayesian p = [8.8 × 10−14, 9.3 × 10−8]) and on SBP in 21 genes (Bayesian p = [8.6 × 10−12, 7.8 × 10−8]) in pathways related to hemostasis, sodi- um/calcium transport, ciliary activity, and others were found. No association with hypertension was detected.

Conclusions Bayesian hierarchical generalized linear models with weakly informa- tive priors can successfully be applied in exome-wide genetic association analyses of rare variants.

ii Acknowledgments

I would like to thank my supervisors Professor John Thompson, PhD, and Louise Wain, PhD, for developing this challenging project with me and for supporting it with their encouragement and advice. Additionally, I am grateful to the Department of Health Sciences and all lecturers and tutors for providing such a stimulating and friendly environment for the Master of Medical Statistics program. Especially, Stephanie Hubbard as the course director contributed largely to the success of this program. Furthermore, I would like to express my gratitude to the Department of Health Sciences for supporting the presentation of first results of my thesis at the Genetic Analysis Workshop 19, Vienna, from August 24 to 27, 2014, with a travel grant. Moreover, I want to thank my former employer, Professor Iris Heid, University of Regensburg, for enabling me to attend the courses of the Master of Medical Statistics program and for granting me special leave to complete my thesis. Finally, freely adapting another proverb I want to emphasize that “It takes a village to write a thesis.”. Therefore, I would like to give my special thanks to my partner Bettina, our parents, especially Lieselotte, our friend Petra, and all others who made sure that I had the time to work and study and that my son Moritz always found himself in a safe, understanding, and loving environment.

F¨urth,March 2015 Wilmar Igl

iii Table of contents

Summary ii

Acknowledgments iii

Table of contents iv

List of tables viii

List of figures xiii

1 Introduction 1 1.1 Missing heritability, rare variants, and Bayesian statistics ...... 1 1.2 Previous publications ...... 4 1.3 Overview ...... 4

2 Biological and medical background 5 2.1 Definitions ...... 5 2.2 General epidemiology ...... 6 2.3 Genetic epidemiology ...... 7 2.3.1 Rare and common genetic variation ...... 7 2.3.2 Genetics of blood pressure ...... 7

3 Statistical background 10 3.1 Bayesian and frequentist statistics ...... 10 3.2 Bayes theorem ...... 11 3.3 Bayes factors ...... 12

iv Table of contents

3.4 Bayesian p values ...... 15 3.4.1 Frequentist p values ...... 15 3.4.2 The one-sided testing problem ...... 16 3.4.3 Classes of prior distributions ...... 17 3.4.4 Non-informative and weakly informative priors ...... 18 3.4.5 Uniformity of p values ...... 20 3.4.6 Summary ...... 20

4 Bayesian hierarchical generalized linear models 21 4.1 Overview ...... 21 4.2 Linear predictor, link function, and error distribution ...... 23 4.3 Prior distributions ...... 25 4.3.1 Variant effects ...... 25 4.3.2 Group effects ...... 27 4.3.3 Covariate effects, intercept, and dispersion ...... 28 4.4 Estimation algorithm (IWLS-EM) ...... 28 4.5 Bayesian p values ...... 31 4.5.1 Individual variants ...... 31 4.5.2 Grouped variants ...... 32

5 Methods 34 5.1 Data ...... 34 5.1.1 Simulated data ...... 34 5.1.2 GAW19 Data ...... 35 5.2 Statistical models ...... 40 5.2.1 Frequentist analysis of individual variants ...... 40 5.2.2 Bayesian analysis of multiple variants ...... 41 5.2.3 Multiple testing ...... 42 5.2.4 Missing data ...... 42 5.2.5 Correction of ethnicity and relatedness ...... 42 5.2.6 Criteria for model evaluation ...... 43

v Table of contents

5.3 Software ...... 44 5.3.1 EPACTS ...... 44 5.3.2 R/BhGLM ...... 45 5.3.3 Other ...... 45 5.4 Hardware ...... 46

6 Results 47 6.1 Descriptives ...... 47 6.1.1 Simulated data I ...... 47 6.1.2 Simulated data II (GAW 19) ...... 48 6.1.3 Empirical data (GAW 19) ...... 48 6.2 Simulated data I ...... 50 6.2.1 Frequentist single-variant analysis ...... 51 6.2.2 Bayesian hierarchical multiple-variant analysis ...... 52 6.2.3 Summary ...... 58 6.3 Simulated data II (GAW 19) ...... 59 6.3.1 Frequentist single-variant analysis ...... 59 6.3.2 Bayesian hierarchical multiple-variant analysis ...... 61 6.3.3 Summary ...... 64 6.4 Empirical data (GAW 19) ...... 66 6.4.1 Frequentist single-variant analysis ...... 66 6.4.2 Bayesian hierarchical multiple-variant analysis ...... 70 6.4.3 Comparison of single and multiple variant analysis ...... 80

7 Discussion 86 7.1 Summary ...... 86 7.2 Statistical models ...... 87 7.2.1 Frequentist models ...... 87 7.2.2 Bayesian hierarchical models ...... 88 7.2.3 Comparison of applied models ...... 89

vi Table of contents

7.3 Genetics of blood pressure ...... 90 7.3.1 Individual variant effects ...... 91 7.3.2 Grouped variant effects ...... 92 7.4 Modelling of prior biological information ...... 96 7.5 Strengths and weaknesses ...... 98 7.5.1 Topic ...... 98 7.5.2 Method ...... 98 7.5.3 Model ...... 99 7.6 Conclusion ...... 99

References 100

A Appendix 114 A.1 Generalized linear model of the normal distribution ...... 115 A.2 Generalized linear model of the binomial distribution ...... 118 A.3 Examples of prior distributions ...... 122 A.4 Supplementary tables ...... 128 A.5 Supplementary figures ...... 143

vii List of tables

2.1 Top ten genes or gene regions associated with blood pressure, hyper- tension, or hypotension according to the GWAS catalog (Welter et al. (2014), retrieved December 8, 2014). Genes are sorted by p value of the index variant...... 8

3.1 Scale of interpretation of the Bayes factor according to Kass and Raftery (1995) (adapted from Wagenmakers et al. (2008)) ...... 13

6.1 Descriptives of simulated data II (1,943 individuals, GAW19) . . . . . 49 6.2 Descriptives of the empirical data (1,943 individuals, GAW19) . . . . 50 6.3 Genetic variants in the GAW19 data (1,943 individuals) ...... 50 6.4 Table of average coefficients of a frequentist normal linear model of simulated systolic blood pressure (SBP). SBP was predicted from individual variants (model 1), unweighted genetic scores (model 2), or weighted genetic scores (model 3). Analysis was performed in 1,000 datasets with 10,000 individuals without genetic effects (simulated data Ia)...... 52 6.5 Table of average coefficients of a reduced Bayesian hierarchical model predicting simulated systolic blood pressure using weakly informative prior Cauchy (0, 1) distributions. Analysis was performed in 1,000 samples with 10,000 individuals without genetic effects (simulated data Ia). Associations between systolic blood pressure and individual variant effects (model 1), average variant effects (model 2), or joint variant effects (model 3) of grouped variants were tested...... 54

viii List of tables

6.6 Table of average coefficients of a full Bayesian hierarchical model pre- dicting systolic blood pressure using weakly informative prior Cauchy (0, 1) distributions for group effects and Cauchy distributions with selected parameters for individual variants. Analysis was performed in 1,000 samples with 10,000 individuals without genetic effects (sim- ulated data Ia)...... 55 6.7 Table of average coefficients of a full Bayesian hierarchical model pre- dicting simulated systolic blood pressure using weakly informative prior Cauchy (0, 1) distributions for group effects and Cauchy dis- tributions with selected parameters for individual variants. Analysis was performed in 1,000 samples with 10,000 individuals with genetic effects (simulated data Ib)...... 58 6.8 Overview on genome-wide significant associations in 128 candidate genes with non-null effects and nominally significant associations in 1,000 genes with null effects in simulated data II (GAW19) ...... 65 6.9 Associations of top ten variants with observed diastolic blood pressure using a normal linear model in empirical GAW19 data (sorted by p value) ...... 67 6.10 Associations of top ten variants with observed systolic blood pressure using a normal linear model in empirical GAW19 data (sorted by p value) ...... 68 6.11 Associations of top ten variants with observed hypertension using the Firth logistic model in empirical GAW19 data (sorted by p value) . . 69 6.12 Associations of top ten variants associated with observed diastolic blood pressure using a reduced Bayesian hierarchical generalized lin- ear model (normal) in empirical GAW19 data (sorted by Bayesian p value) ...... 71 6.13 Associations of top ten variants with observed systolic blood pres- sure using a reduced Bayesian hierarchical generalized linear model (normal) in empirical GAW19 data (sorted by Bayesian p value) . . . 71

ix List of tables

6.14 Associations of top ten variants with observed hypertension using a reduced Bayesian hierarchical generalized linear model (binomial) in empirical GAW19 data (sorted by Bayesian p value) ...... 72 6.15 Associations of top ten genes with observed diastolic blood pressure using a reduced Bayesian hierarchical generalized linear model (nor- mal) to test the average effect of the group of common variants per gene in empirical GAW19 data (sorted by Bayesian p value) . . . . . 74 6.16 Associations of top ten genes with observed systolic blood pressure using a reduced Bayesian hierarchical generalized linear model (nor- mal) to test the average effect of the group of common variants per gene in empirical GAW19 data (sorted by Bayesian p value) . . . . . 75 6.17 Associations of top ten genes with observed hypertension using a reduced Bayesian hierarchical generalized linear model (binomial) to test the average effect of the group of common variants per gene in empirical GAW19 data (sorted by Bayesian p value) ...... 75 6.18 Genome-wide significant associations of genes with observed diastolic blood pressure using a Bayesian hierarchical generalized linear model (normal) to test the joint effect of the group of rare variants per gene in empirical GAW19 data (sorted by Bayesian p value) ...... 77 6.19 Genome-wide significant associations of genes with observed systolic blood pressure using a Bayesian hierarchical generalized linear model (normal) to test the joint effect of the group of rare variants per gene in empirical GAW19 data (sorted by Bayesian p value) ...... 79

A.1 Descriptives of a single simulated dataset (10,000 individuals) selected from 1,000 datasets with genetic effects (simulated data Ib) ...... 128 A.2 Genetic variants in the primary analysis set (461,868 variants) . . . . 128

x List of tables

A.3 Table of averaged coefficients of a normal linear model of systolic blood presssure (SBP). SBP was predicted from individual variants (model 1), unweighted genetic scores (model 2), or weighted genetic scores (model 3). Genetic variants had simulated non-null (rv1, rv2, cv1, cv2) or null effects (rv3, rv4, rv5, cv3, cv4, cv5). Analysis was performed in 1,000 datasets with 10,000 individuals with genetic ef- fects (simulated data Ib)...... 129 A.4 Table of average coefficients of a reduced Bayesian hierarchical model predicting simulated systolic blood pressure using weakly informa- tive prior Cauchy (0, 1) distributions. Analysis was performed in 1,000 samples with 10,000 individuals with genetic effects (simulated data Ib). Individual variant effects (model 1), average variant effects (model 2), or joint variant effects (model 3) of grouped variants on systolic blood pressure were tested...... 130 A.5 Table of average coefficients of a full Bayesian hierarchical model pre- dicting simulated systolic blood pressure using weakly informative prior Cauchy (0, 1) distributions for group effects and Cauchy dis- tributions with selected parameters for individual variants. Analysis was performed in 1,000 samples with 10,000 individuals without ge- netic effects (simulated data Ia)...... 131 A.6 128 genes with ranges of simulated genetic effects in simulated data II (GAW19) (sorted by genomic position) ...... 132 A.7 Functional variants with genome-wide significant associations with simulated blood pressure using frequentist normal linear models in- cluding a single variant and reduced Bayesian hierarchical general- ized linear models (normal) including multiple variants using weakly informative prior Cauchy (0, 1) distributions in simulated data II (GAW19) data. No genome-wide significant associations with hyper- tension were observed...... 138

xi List of tables

A.8 Associations of top ten genes with observed diastolic blood pressure using a reduced Bayesian hierarchical generalized linear model (nor- mal) to test the average effect of the group of rare variants per gene in empirical GAW19 data (sorted by Bayesian p value) ...... 139 A.9 Associations of top ten genes with observed systolic blood pressure using a reduced Bayesian hierarchical generalized linear model (nor- mal) to test the average effect of the group of rare variants per gene in empirical GAW19 data (sorted by Bayesian p value) ...... 139 A.10 Associations of top ten genes with hypertension using a reduced Bayesian hierarchical generalized linear model (binomial) to test the average effect of the group of rare variants per gene in empirical GAW19 data (sorted by Bayesian p value) ...... 140 A.11 Associations of top ten genes with observed diastolic blood pressure using a reduced Bayesian hierarchical generalized linear model (nor- mal) to test the joint effect of the group of common variants per gene in empirical GAW19 data (sorted by Bayesian p value) ...... 140 A.12 Associations of top ten genes with observed systolic blood pressure using a reduced Bayesian hierarchical generalized linear model (nor- mal) to test the joint effect of the group of common variants per gene in empirical GAW19 data (sorted by Bayesian p value) ...... 141 A.13 Associations of top ten genes with observed hypertension using a reduced Bayesian hierarchical generalized linear model (binomial) to test the joint effects of groups of rare variants per gene in empirical GAW19 data (sorted by Bayesian p value) ...... 141 A.14 Associations of top ten genes with observed hypertension using a reduced Bayesian hierarchical generalized linear model (binomial) to test the joint effect of the group of common variants per gene in empirical GAW19 data (sorted by Bayesian p value) ...... 142

xii List of figures

6.1 Distribution of estimates of group effects from full Bayesian hier- archical models predicting simulated systolic blood pressure using weakly informative prior Cauchy (0, 1) distributions for group ef- fects and Cauchy distributions with selected parameters for individ- ual variants. Analysis was performed in 1,000 samples with 10,000 individuals without genetic effects (simulated data Ia)...... 56 6.2 Distribution of posterior probabilities of group effects from full Bayesian hierarchical models predicting simulated systolic blood pres- sure using weakly informative prior Cauchy (0, 1) distributions for group effects and Cauchy distributions with selected parameters for individual variants. Analysis was performed in 1,000 samples with 10,000 individuals without genetic effects (simulated data Ia). . . . . 57 6.3 Comparison of simulated and estimated individual variant effects on diastolic blood pressure (DBP) and systolic blood pressure (SBP) us- ing frequentist normal linear models in simulated data II (GAW19) (black=significant, grey=not significant). Three outliers were re- moved from the DBP plot...... 60 6.4 Comparison of simulated and estimated (conditional) individual vari- ant effects on diastolic blood pressure (DBP) and systolic blood pres- sure (SBP) using reduced Bayesian hierarchical generalized linear models (normal) with weakly informative prior Cauchy (0, 1) distri- butions in simulated data II (GAW19) (black=significant, grey=not significant)...... 61

xiii List of figures

6.5 Comparison of regression coefficients of standard frequentist normal linear models including a single variant (SVT, Standard) versus (re- duced) Bayesian hierarchical generalized linear models (normal) in- cluding multiple variants (MVT, BHGLM) using weakly informative Cauchy (0, 1) prior distributions to predict diastolic blood pressure in empirical data (GAW19) ...... 81 6.6 Comparison of regression coefficients of standard frequentist normal linear models including a single variant (SVT, Standard) versus (re- duced) Bayesian hierarchical generalized linear models (normal) in- cluding multiple variants (MVT, BHGLM) using weakly informative Cauchy (0, 1) prior distributions to predict systolic blood pressure in empirical data (GAW19) ...... 82 6.7 Comparison of regression coefficients of standard frequentist linear models including a single variant (SVT, Standard) versus (reduced) Bayesian hierarchical generalized linear models (binomial) including multiple variants (MVT, BHGLM) using weakly informative prior distributions to predict hypertension in empirical data (GAW19) . . . 83

A.1 The normal distribution with varying values for location and scale . . 122 A.2 The t distribution with varying parameter values for location (verti- cal) and scale (horizontal) ...... 123 A.3 The Cauchy distribution with varying parameter values for location and scale ...... 124 A.4 The χ2 distribution with varying degrees of freedom ...... 125 A.5 The inverse χ2 distribution with varying degrees of freedom ...... 126 A.6 The gamma (Γ) distribution with varying parameter values for shape (vertical) and rate (horizontal) ...... 127 A.7 Diagnostic plots for diastolic and systolic blood pressure in empirical GAW19 data ...... 143

xiv List of figures

A.8 Individuals (N=1,943) plotted according to the first two ancestry principal components (APCs) based on 465,887 single-nucleotide vari- ants ...... 144 A.9 Associations between 262,929 genetic variants (MAC ≥ 2, valid p value) and diastolic blood pressure using a frequentist normal linear model estimating individual variant effects in 1,851 individuals from empirical GAW19 data (genome-wide significance = 1 × 10−7).... 145 A.10 Associations between 259,086 genetic variants (MAC ≥ 2, valid p value) and systolic blood pressure using a frequentist normal linear model estimating individual variant effects in 1,851 individuals from empirical GAW19 data (genome-wide significance = 1 × 10−7).... 146 A.11 Associations between 262,931 genetic variants (MAC ≥ 2, valid p value) and hypertension using a frequentist Firth logistic model esti- mating individual variant effects in 1,851 individuals from empirical GAW19 data (genome-wide significance = 1 × 10−7)...... 147 A.12 Comparison of genetic associations with diastolic blood pressure based on p values of a frequentist normal linear model estimating single variant effects (SVT) versus a (reduced) Bayesian hierarchical normal linear model estimating multiple variant effects (BHGLM) using weakly informative Cauchy (0, 1) prior distributions in 1,851 individuals from empirical GAW19 data ...... 148 A.13 Comparison of genetic associations with systolic blood pressure based on p values of a frequentist normal linear model estimating single variant effects (SVT) versus a (reduced) Bayesian hierarchical nor- mal linear model estimating multiple variant effects (BHGLM) using weakly informative Cauchy (0, 1) prior distributions in 1,851 individ- uals from empirical GAW19 data ...... 149

xv List of figures

A.14 Comparison of genetic associations with hypertension based on p val- ues of a frequentist Firth logistic model estimating single variant ef- fects (SVT) versus a (reduced) Bayesian hierarchical logistic model es- timating multiple variant effects (BHGLM) using weakly informative Cauchy (0, 1) prior distributions in 1,851 individuals from empirical GAW19 data ...... 150

xvi 1 Introduction

This chapter describes developments in human genetic research and statistics leading to the present work and outlines its objectives. Additionally, previous publications are mentioned and a chapter overview is provided.

1.1 Missing heritability, rare variants, and Bayesian

statistics

Human genetic research has made great progress during the last decade. One area of progress has been the technological development of high-throughput methods lead- ing to the establishment of the field of genomics, which aims to comprehensively describe genetic variation in the genome [Joyce and Palsson 2006]. Another area has been the functional annotation of genetic variants including associated traits or diseases, resulting in about 1,800 genetic variant-disease associations at present [Welter et al. 2014]. So called Genome-wide Association Studies (GWAS) using mi- croarrays and imputation of common bi-allelic variants (single-nucleotide variants, SNVs) contributed largely to these achievements. However, compared to the expec- tations at the beginning of this development, results were disappointing, eventually leading to questions about the “missing heritability” [Manolio et al. 2009]. Potential sources for this unexplained genetic variation were suggested such as ad- ditional common variants (with small effects), interactions between genetic variants, interactions between genetic variants and the environment and finally rare variants (with large effects). The relative contributions of rare and common variants have elicited special interest because of the competing “common disease – common vari-

1 1 Introduction ant” and the “common disease – rare variant” hypotheses [Gibson 2011]. Therefore, the present study will address the particular problem of identifying rare variant effects on complex traits using Bayesian methods. These seem to be well-suited to address some deficiencies of frequentist statistical approaches: First, rare variants give per definition little information to estimate their effects from empirical data. Although a solution might be to increase the sample size, so that rare variants are not longer rare in terms of their observed allele counts, the availability of bioinformatics databases suggests using biologically informed variant selection strategies. Second, genetic effects are strongly confounded by highly correlated variants, which makes it difficult to identify independent, causal variants or genes. There- fore, the computation of effects of variants which are adjusted for (conditional on) other correlated variants is recommended not only to identify the causal variant, but also to estimate its true effect size and direction (see also Simpson’s Paradox 1, confounding and non-collapsibility of odds ratios [Stringer et al. 2011]). Addition- ally, adjusting for known genetic effects can improve statistical power for discovery of novel variants [Ma et al. 2010]. The independent effect of a variant is also of par- ticular importance in the missing heritability discussion, as only independent effects of rare variants, which are not explained as “shadow effects” of common variants, support the “common disease - rare variant model”. Third, genetic effects do not only have effects on an individual variant level, but may also have cumulative effects on grouped variant level. Hierarchical models, which can easily be implemented in the Bayesian framework, can jointly estimate effects on both levels, which can help to overcome the problem of sparse data on rare variants and give important information on the genetic architecture of a region [Neale and Sham 2004]. Fourth, the estimation of complex models, which include the conditional effects

1Simpson’s paradox describes the seemingly paradoxical finding that disease risk in one group (e.g. men) can be consistently higher than in another group (e.g. women) in a number of samples, while in the overall sample (meta-analysis) disease risk in the first group (e.g. men) might be lower than in the second group (e.g. women) [Julious and Mullee 1994]. Examples of Simpson’s paradox in the field of genetic epidemiology were published by Trynka et al. (2011) on celiac disease and by UK Parkinson’s Disease Consortium et al. (2011) on Parkinson’s disease.

2 1 Introduction of thousands of correlated variants in a single model, poses severe difficulties, e.g. singularity, collinearity, or separation, to standard frequentist regression models, which makes such models practically intractable in a frequentist framework [Yi, Liu, Zhi & Li 2011]. Therefore, the present study has the following objectives: In general, the Bayesian hierarchical generalized linear model (BHGLM) as published by Yi and Zhi (2011) will be evaluated for its potential to identify individual and grouped effects of rare variants on blood pressure (including diastolic blood pressure, systolic blood pres- sure, and hypertension) as complex traits. For this purpose, empirical data re- leased as part of the Genetic Analysis Workshop 19 (GAW19)1 will be used. The GAW19 data comprises 1,943 unrelated individuals of Mexican-American ancestry with exome-sequenced genotypes and diastolic and systolic blood pressure and hy- pertension as phenotypes. In particular, the following specific objectives will be pursued:

1. Evaluate proportions of false positive and true positive associations of Bayesian hierarchical models in fully simulated, simple data over a large number of samples (simulated data I ) 2. Evaluate proportions of false positive and true positive associations of Bayesian hierarchical models in partially simulated, complex data over a large number of real-world exome-sequenced genetic variants in a single sample (simulated data II, GAW19). 3. Identify individual and grouped effects of rare variants on blood pressure using Bayesian hierarchical models in fully empirical data including all available exome-sequenced real genetic variants and three real phenotypes including diastolic blood pressure, systolic blood pressure, and hypertension (empirical data, GAW19) 4. Compare individual variant effects from Bayesian hierarchical models of mul- tiple variants with individual variant effects from frequentist models of indi- vidual variants in empirical data (GAW19).

Considering the variety of statistical approaches to model rare genetic variation [Panoutsopoulou et al. 2013], this study will help to evaluate the value of Bayesian hierarchical models for genetic association analysis of rare variants. Additionally, the results might give novel insights into the genetic architecture of blood pressure in an ethnic group of Mexican-Americans, which has been rarely studied so far.

1http://www.gaworkshop.org

3 1 Introduction

1.2 Previous publications

First results of this work were presented as abstract, poster, and manuscript with the title “A joint model for rare and common variant effects on blood pressure using a Bayesian hierarchical generalized linear model” at the Genetic Analysis Workshop 19 (GAW19) from August 24 to 27, 2014, in Vienna [Igl and Thompson 2014]. Here a candidate gene association study in 128 candidate genes with 6,578 variants with simulated effects on systolic blood pressure (time point 1) in 1,943 individuals was performed. The BHGLMs identified overall four independent functional variants in the TNN, LPR, and MAP4 genes, and were more sensitive than a standard frequentist linear model including multiple variants, which identified no variants.

1.3 Overview

Chapter 1 gives an introduction into the present study. Chapter 2 explains the biological and medical background related to health aspects of blood pressure, elab- orating definitions and epidemiology of blood pressure and hypertension. Addition- ally, the competing rare variant and common variant models in genetic epidemiology is addressed. Chapter 3 outlines general concepts and specific aspects of Bayesian statistical theory, which are relevant within the scope of this work, including refer- ences to frequentist statistical theory. Chapter 4 focuses on Bayesian hierarchical generalized linear models (BHGLMs) as proposed by Yi, Liu, Zhi & Li (2011). Chapter 5 describes data, especially the GAW 19 data, and methods, in particular the Bayesian hierarchical model, applied in this study. Chapter 6 presents the results. After an overview on the descriptives of the used datasets (Chapter 6.1), re- sults are reported separately for (simple) simulated data I (Chapter 6.2), (complex) simulated data II (GAW19 data, Chapter 6.3), and empirical data (GAW19 data, Chapter 6.4). Chapter 7 summarizes and discusses the results related to the objec- tives of this study and provides a biological interpretation of the findings. Finally, conclusions regarding the application of BHGLMs in genetic association analysis of rare genetic variation and complex traits will be made.

4 2 Biological and medical background

This chapter describes relevant biological and medical background regarding blood pressure and human health. The epidemiology of hypertension and associated risk factors, especially in Mexican-Americans, is described. Additionally, genetic models based on rare or common variants are contrasted and current knowledge of genetic factors affecting blood pressure is summarized.

2.1 Definitions

Blood pressure is the physical pressure of the blood in the (human) vascular system. It is measured by two values which refer to the specific phases of heart action. Systolic blood pressure (SBP) is the pressure of the blood, when the heart pumps the blood from the left ventricle (heart chamber) into the aorta (a large blood vessel). Diastolic blood pressure is the pressure of the blood when both ventricles of the heart relax between its contractions. Systolic blood pressure is higher than diastolic blood pressure. Blood pressure is measured in millimetres of mercury (mmHg). For example, a healthy young adult typically has a systolic blood pressure of around 120 mmHg and a diastolic blood pressure of 80 mmHg [Marcovitch 2010]. Hypertension is the medical term for high blood pressure relative to normal blood pressure in the population. Hypertension is not a disease, because a person with hypertension might not experience any symptoms. However, hypertension is a risk factor for more severe medical conditions such as stroke or myocardial infarction and, therefore, a condition recommended for preventive treatment [Marcovitch 2010]. Zhao et al. (2013) emphasized the role of elevated blood pressure as the most important modifiable risk factor for cardiovascular disease and all-cause mortality,

5 2 Biological and medical background responsible for 13.5% of all deaths globally in 2001. According to a definition of the World Health Organization (WHO), hypertension is defined as a systolic blood pressure consistently greater than 160 mmHg and a diastolic blood pressure consistently greater than 95 mmHg. However, blood pressure also increases with age. Therefore, there is no general consensus on the diagnostic criteria of hypertension and indicators for treatment [Marcovitch 2010]. For example, in the United Kingdom, hypertension is diagnosed in individuals with blood pressure greater than 140 mmHg (systolic) and 90 mmHg (diastolic) in the general population [National Institute for Health and Clinical Excellence 2006]. Depending on the identified causes of hypertension, the condition is named pri- mary hypertension, if no specific causes can be identified, or as secondary hyper- tension, if the symptoms are caused by identifiable causes, such as lifestyle factors (e.g. tobacco smoking, obesity), diseases of the kidney or other [Marcovitch 2010]. Although optimal health is associated with a balanced diastolic and systolic blood pressure, health problems are usually associated with increased blood pressure, i.e. hypertension. Low blood pressure, i.e. hypotension, is of little clinical relevance [Marcovitch 2010].

2.2 General epidemiology

In the USA, cardiovascular diseases (CVDs) are leading causes of mortality among Hispanic and Latino individuals. Hispanics and Latinos of mixed origins had a prevalence of hypertension of 25.4% (confidence interval (CI) 95% = [24.1%,26.7%]) in men (N=5,979) and 23.5% (CI95% = [22.4%, 24.5%]) in women (N=9,100). 15% of men (CI95% = [13.8%, 15.9%]) and women (CI95% = [14.1%, 15.9%]) used anti-hypertensive treatment. 4.2% (2.2%) men and 2.4% (1.2%) women reported coronary heart disease (or stroke, respectively). Hypertension was associated with coronary heart disease (odds ratios between 1.5 and 2.2) and stroke (odds ratios be- tween 1.7 and 2.6) in both sexes [Daviglus et al. 2012]. Mitchell et al. (1996a) show that substantial variance of systolic (17.8%) and diastolic (28.3%) blood pressure can be explained by genetic factors in a Mexican-American sample.

6 2 Biological and medical background

2.3 Genetic epidemiology

2.3.1 Rare and common genetic variation

Genetic variants are categorised in rare and common variants according to the minor allele frequency (MAF). Variants with MAF < 1% are typically considered as rare, and variants with MAF ≥ 1% as common [Gibson 2011]. These categories are arbitrary in the sense that the frequency of variants is a continuum and also depends on the examined population. The differentiation between rare and common variants has become of increased importance after GWAS studies based on common variants could only explain a small fraction of the heritability of complex diseases, which has been discussed as the “missing heritability problem” in the literature [Manolio et al. 2009]. For example, the currently largest GWAS on complex traits, such as body height, included up to 263,00 individuals and resulted in 184 loci based on common variants, which explained 7.35% of the phenotypic variance [Berndt et al. 2013]. These findings fuelled discussions on the competing “common disease - common variant” (CDCV, “common allele model”) and “common disease - rare variant” (CDRV, “rare allele model”) hypotheses. While the CDCV hypothesis says that common, complex diseases are largely explained by a moderate number of common genetic variants with small effects, the CDRV hypothesis states that a small number (per individual) of rare genetic variants with large effects are responsible. However, the finding of missing heritability does not exclude the CDRV model or the CDCV model a-priori [Gibson 2011; Iyengar and Elston 2007; Schork et al. 2009].

2.3.2 Genetics of blood pressure

Although blood pressure is affected by many, strong environmental factors, e.g. physical or psychological stress, more than 40 genetic loci with genome-wide signifi- cance (p < 5 × 10−8) have been found in genome-association studies so far. Still the total explained variability of blood pressure of all known loci remains very modest around 3% [Wain 2014; Zhao et al. 2013], while the total heritability of blood pres- sure is estimated between 30% and 50% [Munroe et al. 2013]. A recent review on

7 2 Biological and medical background the genomics of blood pressure was published by Munroe et al. (2013). The GWAS catalog1 currently includes 20 studies on blood pressure in European, Asian or African populations which have been published between June 2007 and September 2013. In total 100 index single-nucleotide variants (p < 1 × 10−5) are reported, which map 83 genes or gene regions. The top ten genes, which are sorted by p value of the index variant, are given in Table 2.1:

Table 2.1: Top ten genes or gene regions associated with blood pressure, hypertension, or hypotension according to the GWAS catalog (Welter et al. (2014), retrieved December 8, 2014). Genes are sorted by p value of the index variant.

Gene Index variant Chr Position P value First author (year) HECTD4 rs11066280 12 112379979 8 × 10−31 Kato (2011) ATXN2 rs653178 12 111569952 7 × 10−20 Wain (2011) C10orf107 rs4590817 10 61707795 2 × 10−18 Wain (2011) MTHFR rs17367504 1 11802721 2 × 10−16 Wain (2011) CACNB2 rs12258967 10 18439030 2 × 10−16 Wain (2011) CSK rs1378942 15 74785026 2 × 10−15 Wain (2011) GOSR2 rs17608766 17 46935905 6 × 10−15 Wain (2011) ATP2B1 rs17249754 12 89666809 7 × 10−15 Kelly (2013) PRDM8 - FGF5 rs1458038 4 80243569 3 × 10−14 Wain (2011) ZNF831 rs6015450 20 59176062 4 × 10−14 Ehret (2011) Chr=; If the index variant is located in a gene, the gene is mapped, otherwise the upstream and downstream genes are listed. Sources: Welter et al. (2014), Kato et al. (2011), Wain et al. (2011), Kelly et al. (2013), Ehret and the International Consortium for Blood Pressure Genome-Wide Association Studies (2011)

Genome-wide association studies have been based on common variation so far, although effects of rare variants on monogenic diseases with a phenotype of high blood pressure have been found in family studies (see Munroe et al. (2013) for an overview on monogenic disease loci). For example, Gordon syndrome is associated with rare variants in the WNK1, WNK4, KLHL3, and CUL3 gene. Bartter syn- drome and Gitelman syndrome are caused by rare mutations in SCL12A3 (NCCT ) and SCL12A1 (NKCC2 ), and KCNJ1 (ROMK ) genes, respectively. Genes associ- ated with monogenic disease often lie in pathways managing sodium homeostasis in the kidney [Munroe et al. 2013]. Although GWAS studies of rare variation have not contributed substantially to existing knowledge [Wain 2014], Ji et al. (2008) per-

1The GWAS catalog was queried on December 8, 2014, using the search terms ”blood pressure”, ”hypertension”, ”hypotension”, which included the phenotypes Pulse Pressure (PP), Mean Arterial Pressure (MAP), Diastolic (DBP, DBPLTA (LTA=long-term average), Delta DBP), Systolic (SBP, SBPLTA (LTA=long-term average), Delta SBP), and abbreviations.

8 2 Biological and medical background formed a large-scale study on effects of rare variants in the SLC12A3, SCL12A1, and KCNJ1 genes on blood pressure, which pointed to rare mutations affecting pathways handling renal salt. Additionally, first results on blood pressure using an exome- chip design have been presented at conferences. Liu et al. (2014) report novel rare variant associations in the NPR1 and DBH genes, which are related to pathways involving natriuretic peptides which regulate blood volume and blood pressure.

9 3 Statistical background

This chapter explains relevant concepts of Bayesian statistics and relates them to frequentist statistics. The concept of Bayesian p values will be presented in more depth because of its application in Bayesian hierarchical generalized linear models by Yi, Liu, Zhi & Li (2011).

3.1 Bayesian and frequentist statistics

Bayesian and frequentist theory1 are considered as two distinct theories, philoso- phies, or schools of thought, which cannot easily be described by concepts of the other [Wagenmakers et al. 2008]. Discussions on the superiority of one of these ap- proaches are based on philosophical, theoretical, and practical arguments [Gill 2008, p. 62]. However, frequentist and Bayesian theory both aim to provide methods to describe empirical data and to reduce the complexity of the data to simplified models which contain a systematic, deterministic and an unsystematic, random component (error) using probability statements. Moreover, a frequentist (likelihood) model is equivalent to a Bayesian model with an appropriately bounded uniform prior distri- bution. Asymptotically, frequentist models are equivalent to a Bayesian model for any given prior. Therefore, the stated differences between Bayesian and frequentist theory are considered an “artificial divide” [Gill 2008, pp. 62].

1Here, strict frequentist theory proposed by Neyman and Pearson and likelihood-based theory by Fisher are both categorised as frequentist in this context (see Gill (2008, pp. 62)).

10 3 Statistical background

3.2 Bayes theorem

Bayes’ theorem is the foundation of Bayesian theory and its applications. It states that the probability of an unobserved event A given an observed event B (posterior probability) can be calculated from the conditional probability of event B given event A (likelihood), and the probabilities of event A (prior) and event B (evidence) [Peacock and Peacock 2011, p. 478]:

P (B|A) × P (A) P (A|B) = (3.1) P (B)

The same idea can also be formalised for multiple events, which requires a more complicated mathematical language, which describes not only the probability of a single event, but the probability of multiple discrete values or a probability den- sity of a range of continuous values of a parameter. The following equation 3.2 and its transformations describe the posterior probability P (θ|y) for a parameter θ ([−∞, +∞]) after observing data y [Gelman 2014, p. 7]:

p(θ, y) P (θ|y) = (3.2) p(y) p(y|θ) × p(θ) = (3.3) p(y) p(y|θ) × p(θ) = (3.4) R p(y|θ) × p(θ)dθ

∝ p(y|θ) × p(θ) (3.5)

The primary objective of the application of Bayesian theory is to derive a poste- rior predictive distribution, which is then used to make statistical inferences [Gelman 2014, pp. 6]. The resulting probability distribution is often called a predictive dis- tribution, because it is used to make probabilistic inferences (“predictions”) about an unknown, but observable parameter θ and about unobserved valuesy ˜ based on observed values y. p(θ) is the prior distribution of the parameter θ before the ob- servation of data y. p(y|θ) is the sampling distribution (or data distribution) given

11 3 Statistical background parameter θ. p(y) is called the prior predictive distribution or marginal distribution of y, because the probability of y is integrated over the range of values of parameter θ and does not depend on it anymore. p(y) can also be interpreted as a weighted average likelihood with weights provided by the prior distribution p(θ). The prior predictive distribution is used as a scaling parameter to derive a proper probabil- ity distribution, but can also be omitted, which leads to a unnormalised posterior probability distribution, which does not necessarily integrate to unity. The prior predictive distribution can be omitted if the objective of an analysis is parameter estimation, but is of critical importance if the objective is Bayesian hypothesis test- ing (see Bayesian p values below) [Gelman (2014, p. 7), Wagenmakers et al. (2008, p. 190)].

3.3 Bayes factors

The Bayes Factor (BF) is a measure which compares the posterior distributions of a parameter θ between two competing models, M1 and M2, given the observed data y (see equations 3.6, 3.7, and 3.8).

P (M |y) P (y|M ) P (M ) 2 = 2 × 2 (3.6) P (M1|y) P (y|M1) P (M1) | {z } | {z } | {z } Posterior Odds Bayes Factor Prior Odds

P (y|M2) BF (M2,M1) = (3.7) P (y|M1) R P (θ2|M2) × P (y|θ2,M2)dθ2 = R (3.8) P (θ1|M1) × P (y|θ1,M1)dθ1

The Bayes factor can also be used to quantify the changes between the prior and the posterior distribution as a result of the observed data (“weight of evidence”).

The model with the higher probability (BF1 > BF2) is to be preferred [Wagenmakers et al. 2008, p. 190]. Therefore, the Bayes factor can be seen as measure for model selection similar to a frequentist p value.

12 3 Statistical background

The models under comparison can use different prior distributions of parame- ters, different sets of parameters, and different structural relationships between the parameters. Therefore, the probability distributions in the equations above are con- ditional on the respective model to indicate such potential differences [Kruschke 2011, pp. 56]. If the compared models represent only a single value of a parameter

θ1 and θ2, respectively, the Bayes factor simplifies to a classical likelihood ratio test, assuming that both models have equal prior probability [Gelman 2014, pp. 182-184]. As a guideline for interpretation of Bayes factors, Jeffreys (1961, p. 432) and Kass and Raftery (1995) have proposed scales, of which only the latter is cited in table 3.1:

Table 3.1: Scale of interpretation of the Bayes factor according to Kass and Raftery (1995) (adapted from Wagenmakers et al. (2008))

2 × ln(BF ) Bayes factor p(H1|y) Evidence against H0 0 to 2 1 to 3 0.50-0.75 weak 2 to 6 3 to 20 0.75-0.95 positive 6 to 10 20 to 150 0.95-0.99 strong > 10 > 150 > 0.99 very strong

The transformation 2 × ln(BF ) has the advantage of being on the same scale as the deviance and likelihood ratio test statistics [Kass and Raftery 1995]. A similar scale for interpreting likelihood ratios was published [Evett et al. 2000]. Wagenmakers et al. (2008) point out, that the scale for interpretation of Bayes factors by Kass and Raftery (1995) should be only applied to a comparison of two models. For example, if Bayes factors are used to find the best set of 10 predictors, over 1,000 (exactly 210) models of equal prior probability are compared. In this case, a change of a prior probability of a single model of 0.001 to a posterior probability of 0.50 would intuitively be regarded as “very strong” (instead of “weak”) evidence. In genetics, (approximate) Bayes factor are also used to extract a minimal set of genetic variants, similar to the selection of the model with the best predictors above, which explain a genetic signal in a defined genetic region (typically ± 500 kb). Stephens and Balding (2009) give a short summary how p values from frequentist analysis can be used to calculate Bayes factors and posterior probabilities and derive

13 3 Statistical background credible sets of genetic variants: First, an assumption on the probability π of a non-null association of a genetic variant in the genome with a complex trait must be made, for which Stephens and Balding (2009) suggest values for π between 10−4 and 10−6. Second, a Bayes factor must be calculated from the frequentist p value indicating the amount of evidence from the empirical data. Under general assumptions, for p values < 1/e = 1/2.72 = 0.37 the following equation 3.9 holds [Sellke et al. 2001]:

BF = 1/(e × p × ln p) (3.9)

Third, having the prior probability π and the Bayes factor BF , equations 3.10 and 3.11 can be used to calculate the posterior odds PO and posterior probability PP , respectively:

π PO = BF × (3.10) 1 − π PO PP = (3.11) 1 + PO

The posterior probability (PP) can be interpreted as the probability of a variant being causal (of all examined variants), which does not depend on sample size, statistical power, and number of variants anymore. Assuming that fine-mapping a genetic region corresponds to model selection of a set of predictors in a regression model, PP is the sum of the posterior probabilities over all models which contain this variant. Fourth, posterior probabilities are sorted in decreasing order and cumulated un- til a certain posterior probability threshold, e.g. 95%, is reached. The resulting (minimal) 95% credible set contains the causal variant with 95% probability (as- suming the variant was genotyped) [Stephens and Balding 2009; Wellcome Trust Case Control Consortium et al. 2012]. If variants are highly correlated (i.e., in strong linkage disequilibrium) and strongly

14 3 Statistical background associated with the phenotype, then each variant might have only low posterior probability in a regression model including all variants, but the cluster of variants (haplotype region) might have high posterior probability. To identify the causal variant in this situation with strong confounding is also challenging for Bayesian models and would require the exploration of the full posterior distribution. However, Gelman (2014, p. 182) does not recommend the application of Bayes factors for model selection because “in practice, the marginal likelihood is highly sensitive to aspects of the model that are typically assigned arbitrarily and are untestable from data.”

3.4 Bayesian p values

To obtain consistent conclusions for statistical problems using frequentist and Bayesian theory and to develop a measure for the goodness-of-fit for single value models, the concept of Bayesian p values1 has been developed. A review on the discussion about Bayesian p values can be found in Ghosh and Delampady (2014). Bayarri and Berger (2004) discussed the meaning of additional frequentist concepts in Bayesian theory.

3.4.1 Frequentist p values

P values were made popular by Fisher (1925) as a measure of statistical significance of data against a (null) model. A p value is the probability of observing data, which is summarised in a statistic, as or more extreme than in the observed sample in a large number of repeated samples from the same population under the null hypothesis [Peacock & Peacock 2011, pp. 248]. The related concept of a confidence interval describes a range of values which contains the true value of a parameter with specified probability (usually 100% - 5% = 95%) in a large number of repeated samples from the same population [Peacock & Peacock 2011, p. 243]. Currently, p values and confidence intervals are key concepts in frequentist theory and widely applied in biomedical research.

1also: posterior predictive p values

15 3 Statistical background

Although the concept of p values is seen critically in the community of Bayesian statisticians, theoretical and empirical arguments suggest that this concept can be translated to Bayesian statistics. As an example, Firth’s bias-corrected logistic re- gression [Firth 1993] may be mentioned. This frequentist statistical model has been proposed as a solution to overcome the problem of complete separation when apply- ing logistic regression to sparse data, which results in extremely inflated estimates and low power. For this purpose Firth (1993) introduced a penalisation of the like- lihood function, which is equivalent to using a Jeffrey’s prior distribution [Jeffreys 1946] in a Bayesian model.

3.4.2 The one-sided testing problem

A basic difference between frequentist and Bayesian theory consists in the way mod- els are described. In frequentist theory, a parameter is assumed to have a single fixed true value. However, this assumption is not formalised in frequentist models. In Bayesian theory, a parameter is described as a distribution of multiple, random values, and, therefore, prior assumptions have to be formalised as prior distributions. This leads to the first problem of specifying a single-value (point) model, e.g. a null model, from a frequentist two-sided test problem (H0 : θ = 0; H1 : θ 6= 0) in Bayesian theory.1 However, to avoid such problems, the comparison of Bayesian and frequen- tist p values can be reduced to a one-sided test problem (e.g., H0 : θ ≤ 0; H1 : θ > 0). In this case, both models can be described as a distribution of parameters in Bayesian theory. Additionally, a statistic can be estimated in a Bayesian model which gives the posterior probability of the parameter being less or equal to a null value, which can be regarded as an equivalent to the one-sided frequentist probability (p value) of observing values of a statistic equal or more extreme than in the current sample in a large number of repeated samples under the null hypothesis (assuming a single

fixed true parameter, e.g. θ0 = 0). While both approaches use the same empirical data, the results of a Bayesian

1The problem of modelling a two-sided test problem in Bayesian theory, can also be seen as caused by collapsing two models (H1a : θ > 0; H1b : θ < 0) into one model with a bi-modal probability density distribution with a probability density of zero at the null.

16 3 Statistical background model are additionally influenced by the selection of a prior distribution. Ignor- ing any differences between the results caused by methodological differences, such as test statistics, estimation algorithms, or others, the most important influences seem to come from the choice of the prior distribution. Therefore, classes of prior distributions were examined in Bayesian theory regarding consistency between the Bayesian posterior probability (Bayesian p value) and the frequentist p value.

3.4.3 Classes of prior distributions

Berger and Sellke (1987) studied the two-sided testing problem and evaluated classes of prior distributions which assigned equal probability mass (50%) to a point null model (H0 : θ = 0) and the alternative model (H0 : θ 6= 0), which they considered as “objective”. The classes comprised a) all distributions, b) symmetric distribu- tions, and c) unimodal symmetric distributions, and d) normal distributions. They found that the derived posterior probabilities are considerably (up to one order of magnitude) larger than the frequentist p values. Therefore, they concluded that p values can be highly misleading measures of evidence provided by the data against the null model. However, Casella and Berger (1987) studied the one-sided testing problem and criticised the definition of objective prior distributions by Berger and Sellke (1987) as biasing the analysis towards the (point) null model. Similar to Berger and Sellke (1987), they compared frequentist p values with the posterior probability from four classes of prior distributions, which, however, gave equal probability mass of 50% to ranges of parameter values such as ] − ∞, 0] (H0 : θ ≤ 0) and ]0; +∞[(H1 : θ > 0):

1. all distributions

2. all symmetric distributions with mean M = 0

3. all symmetric, unimodal distributions with mean M = 0

4. all normal distributions N(0, σ2) with σ =]0; ∞[

The authors show that under the additional assumption of monotone likelihood

17 3 Statistical background ratio, the infimum (lower bound) of posterior probability of the null model is equal to the frequentist p value for classes of distributions 3) and 4).1

inf P (H0|x) = p(x) (3.12)

Although the p value as an infimum of a Bayesian p value can be interpreted in that sense that frequentist models overstate the evidence against the null model (in comparison to Bayesian models), Casella and Berger (1987) also mention situations in which the infimum of the Bayesian posterior probability is lower or higher. For example, Casella and Berger (1987) show that the infimum for Bayesian p values based on Cauchy priors is strictly less than the p value, so that for certain Cauchy priors the posterior probability is equal to the frequentist p value (although Cauchy distributions do not have a monotone likelihood ratio). These results are of great relevance because distribution classes 3) and 4) seem to have the highest practical relevance as prior distributions in Bayesian analysis. Additionally, Cauchy distributions are often applied as weakly informative priors, because their “heavy tails” give substantial probability to extreme ranges of values. Therefore, Casella and Berger (1987) conclude that it is possible to interpret results from a Bayesian approach in context of a frequentist approach (and vice versa).

3.4.4 Non-informative and weakly informative priors

Uniform distributions2 with equal density across the whole range of parameter val- ues (which can be seen as a special case of class 3 above) will be further discussed here, because of their high relevance for comparisons with frequentist p values. Ad- vantages of this class of prior distributions are that they are easy to specify without prior knowledge and objective in the sense of giving maximum influence on the posterior distribution to the observed data [Gelman 2014, p. 52]. Additionally, pos-

1The assumption of a monotone likelihood ratio seems violated, for example, in case of a two- sided test with the alternative model having a bi-modal probability density distribution with zero density at the null. 2also: non-informative, objective, reference, flat, vague, diffuse, impartial priors

18 3 Statistical background terior predictive probabilities (Bayesian p values) approximate frequentist p values and posterior predictive intervals (Bayesian credible intervals) are similar to frequen- tist confidence intervals (and vice versa) [Greenland and Poole 2013]. According to ˆ Greenland and Poole (2013), the frequentist estimate θH0 and the two-sided p value

PH0, which tests the null hypothesis H0: θ = 0, can be interpreted in Bayesian terms assuming a true value θt, a uniform prior distribution with a prior median θm = 0, and a posterior distribution of θˆ:

ˆ 1. θH0 is the median of the posterior distribution, ˆ ˆ i.e. P (θt < θ) = P (θt > θ) = 50%.

ˆ 2. PH0 is the posterior probability that θ is closer to 0 than to θt, ˆ ˆ i.e. P (|θ − θt| > |θ − 0|).

ˆ ˆ 3. PH0/2 is the posterior probability P (θ < 0) if θt > 0 or P (θ > 0) if θt < 0, respectively.

ˆ 4. PH0/2 approximates the smallest possible posterior probability P (θ < 0) if ˆ θt > 0 or P (θ > 0) if θt < 0, respectively.

5. The 95% confidence interval [θ, θ] of θ is the 95% credible (posterior probabil-

ity) interval of θ, i.e. P (θ ≤ θtrue ≤ θ) = 95%.

However, uniform distributions have also several disadvantages. For example, uniform priors are improper1 and can also lead to improper, unnormalised posterior distributions [Gelman 2014, p. 52]. Additionally, extreme ranges of parameter values might be evaluated by simulation-based algorithms which can considerably increase computation time [Thompson 2014, p. 7]. Moreover, they might ignore potentially existing prior knowledge and may give sub-optimal results. Therefore, weakly informative priors were recommended as objective Bayesian priors, which are proper, but at the same time weak enough so the above statements remain approximately correct. In case prior knowledge exists, weakly informative priors are

1A prior density p(θ) is proper if it does not depend on the data and integrates to 100% [Gelman 2014, p. 52].

19 3 Statistical background specified so that they are informative but represent less than the available scientific knowledge to allow for unexpected findings [Gelman 2014, p. 52].

3.4.5 Uniformity of p values

Finally, an argument which was used against the application of Bayesian p values is the property of frequentist p values of having a “well-behaved” uniform (0, 1) distribution under the null model. A uniform distribution of p values also implies that p values have a common interpretation across statistical problems. Therefore, uniformity of p values has also been recommended as discriminatory tool to evaluate the adequacy of Bayesian p values [Bayarri and Berger 2004]. Unfortunately, distributions of posterior predictive p values do not in general have uniform distributions under the null hypothesis, but are more concentrated near 0.5 [Gelman 2013]. Therefore, posterior predictive p values are more conservative, which is, however, also the (desired) effect of the modelled prior on the posterior distribution. Hence, Gelman (2013) argues that a uniform distribution of p values is no precondition for correct statistical inferences and posterior predictive p values are valid probabilities, which is also stated in the Bayes theorem (for normalised posterior distributions).

3.4.6 Summary

Bayesian posterior probabilities are regarded as conservative approximations of fre- quentist p values over a wide spectrum of classes of distributions (with locations at zero). Weakly informative priors, for example, certain Cauchy distributions, seem to be adequate to derive posterior probabilities (Bayesian p values) and credible intervals of parameter values comparable to frequentist p values and confidence in- tervals. Disadvantages of uniform prior distributions can be dealt with using weakly informative priors. Deviations from uniformity of the distribution of posterior prob- abilities do not impair the interpretation of Bayesian p values as valid probabilities and the derived statistical inferences.

20 4 Bayesian hierarchical generalized linear models

This chapter describes the Bayesian hierarchical generalized linear model (BHGLM) as published by Yi, Liu, Zhi & Li (2011). The BHGLM approach will be used for the analysis of empirical data (GAW19) in the present study.

4.1 Overview

The availability of large amounts of data on rare variants and a lack of adequate statistical methods for these weak signals have stimulated the development of a great variety of statistical models. However, methods of this large family of rare variant tests1 are often specific to certain data structures, e.g. binary phenotypes, a small number of variants, certain definitions of regions, or specific variant characteristics, which limit their general usefulness [Moutsianas and Morris 2014; Panoutsopoulou et al. 2013]. At the same time, Bayesian hierarchical models have received increasing interest in genetic epidemiology to model complex genetic structures. For example, Bayesian hierarchical models were applied to model copy number variations (CNV) [Zhang et al. 2014], epigenetic histone modifications (ChIP-seq) [Mitra and M¨uller2014], or rare variants [He et al. 2015]. For example, He et al. (2015) combined misclassi- fication probability with shrinkage-based Bayesian variable selection in association studies of sequenced variants. Therefore, Yi & Zhi (2011), Yi, Liu, Zhi & Li (2011), and Yi et al. (2014)

1also known as burden tests or gene-based tests

21 4 Bayesian hierarchical generalized linear models extended Generalized Linear Models (GLMs) [McCullagh & Nelder 1989; Nelder & Wedderburn 1972] by Bayesian hierarchical features to the Bayesian Hierarchi- cal Generalized Linear Model framework, which can be applied to various types of outcomes (e.g. continuous, categorical), and can include non-genetic and genetic covariates to model individual variant and grouped variants effects. Additionally, prior information, e.g. from biological databases, can be integrated to support the identification and interpretation of genetic effects. Moreover, the BHGLM includes many models as special cases, e.g., classical GLMs, ridge regression, Bayesian lasso, and adaptive lasso [Yi, Liu, Zhi & Li 2011]. Here, the key ideas of the Bayesian Hi- erarchical Generalized Linear Model (BHGLM) framework [Yi, Liu, Zhi & Li 2011] will be summarised: First, the application of Bayesian statistics allows the modelling of prior dis- tributions. A prior distribution can strengthen or weaken (shrink) genetic signals depending on their parameter values for location and scale, which can also represent prior biological information in a formalised way. In the absence of prior knowledge, weakly informative prior distributions can be applied. Second, the use of a hierarchical model in the linear predictor allows the joint estimation of effects of groups of variants and of individual variants. Additionally, the hierarchical decomposition of prior distributions to multiple other distributions allows variants in a group to share hyper-parameters, supporting the estimation of effects. Third, the adaptation of the Generalized Linear Model framework [McCullagh and Nelder 1989; Nelder and Wedderburn 1972] makes its full flexibility of statis- tical modelling available for genetic analysis, i.e. including all error distributions of the exponential family, various link functions, and variable selection methods for predictors. Fourth, the extension of the standard Iterative Weighted Least Squares (IWLS) algorithm by an Expectation-Maximisation (EM) algorithm allows the very efficient estimation of parameters and hyper-parameters, enabling the application of Bayesian methods to large-scale genome-wide analysis. Standard Markov Chain Monte Carlo

22 4 Bayesian hierarchical generalized linear models

(MCMC)-based simulation is not expected to be feasible for large-scale genetic asso- ciation analysis because of the increased computing time and problems in evaluating convergence. Additionally, models including hundreds or thousands of conditional effects of genetic variants in a region of interest can be estimated simultaneously, which would be intractable in standard frequentist models. Important concepts of the BHGLM including the linear predictor, link function, error distributions, and prior distributions will be explained in more detail in the following sections.

4.2 Linear predictor, link function, and error

distribution

The Bayesian Hierarchical Generalized Linear Model by Yi, Liu, Zhi & Li (2011) is based on the generalized linear model (GLM) by [McCullagh and Nelder 1989; Nelder and Wedderburn 1972]. A generalized linear model is defined by the compo- nents of linear predictor, link function, and error distribution. The GLM framework includes various error distributions of the exponential family, for example, the nor- mal distribution and the binomial distribution. A formal description of the standard GLM for the normal distribution (Appendix A.1) and the binomial model (Appendix A.2) is available in the appendix. A formal description of the BHGLM using the terminology by Yi, Liu, Zhi & Li (2011) is given in the following:

In the BHGLM, the multiplicative form of the linear predictor ηi is (see equations 4.1 and 4.2):

J K J ! X0 X Xk ηi = xijβj + gk αj zij (4.1)

j=0 k=1 j∈Gk J K J ! X0 X Xk ηi = xijβj + (gkαj) zij (4.2)

j=0 k=1 j∈Gk

β0 represents the intercept, xj the covariate j, and βj the regression coefficient of covariate j. zj is the main effect predictor of genetic variant j in group Gk.

23 4 Bayesian hierarchical generalized linear models

The indices i refer to values of individuals as observational units. The common coefficient gk represents the group-level effect for Jk variants in the k-th group, and the individual coefficients αj can be interpreted as the weights (relative effects) of the individual variants. The common coefficient gk estimates the association between the J Pk phenotype and the linear combination αjzij of Jk individual variants. This linear j∈Gk combination can be considered as a genetic (risk) score. The common coefficient gk is an estimate of the cumulative effect (group-level effect) of the Jk individual variants in the k-th group. The term (gkαj) in equation 4.2 represents the total genetic effect of a single variant. In contrast to the described full BHGLM above, a reduced BHGLM results, if the group coefficients are omitted. In this case, the coefficients

αj represent the main effect of genetic variant j, but still share hyper-parameters of the common prior distributions, supporting the estimation of individual variant effects. The following (inverse) link function h−1 (see equation 4.3) is used to associate the linear predictor with the raw phenotype:

−1 E(yi|ηi) = h (ηi) (4.3)

The canonical link functions for the normal error distribution is the identity func- tion, and for the binomial error distribution the logistic function [Nelder and Wed- derburn 1972]. The probability density function (see equation 4.4) results in the likelihood for the observed data with η as the linear predictor and φ as the dispersion (or variance) parameter.

n Y p(y|η, φ) = p(yi|ηi, φ) (4.4) i=1

24 4 Bayesian hierarchical generalized linear models

4.3 Prior distributions

The BHGLM makes use of prior distributions to separate individual variant and group variant effects (weights), which otherwise would not be identifiable. The t distribution was chosen as the default prior distribution in the BHGLM, because it showed desirable properties across a range of analytical problems [Gelman 2006; Gelman et al. 2008]. In the following, prior distributions for various components of the BHGLM are described in more detail. However, Yi, Liu, Zhi & Li (2011) use a hierarchical decomposition of prior distributions which has the advantage that variants in a group share hyper-parameters supporting the estimation of their ef- fects, but the disadvantage of being less easy to calculate and interpret. Therefore, some information on the relationships and transformations between probability dis- tributions are given: According to Gelman (2014, pp. 435, 581), a t distribution is equivalent to a mixture of normals with common mean and unknown variances that follow an inverse-Γ distribution. The t distribution with a single degree of freedom is equivalent to a standard Cauchy distribution (with a location of zero and a scale

2 of one). Moreover, the inverse-Γ distribution includes the inverse-χdf distribution

df 1 2 2 as a special case with Γ(a, b) = Γ( 2 , 2 ). The scaled inverse-χ (df, s ) distribution 2 results from randoms draws of X ∼ χdf , which are transformed to a random variable dfs2 by X . To give the reader some intuition of these prior distributions, examples of density distributions with selected parameter values of relevant families of distribu- tions are plotted in the appendix (Appendix A.3). Depending on the component in the BHGLM, i.e. individual variants, groups of variants, covariates, and others, the authors use different decompositions of prior distributions.

4.3.1 Variant effects

Yi, Liu, Zhi & Li (2011) use the following hierarchical formulation of a mixture of normal distributions for the mean and a (half-)Cauchy distribution for the standard deviation (see equations 4.5 to 4.8):

25 4 Bayesian hierarchical generalized linear models

α |τ ∼ N(µ , τ 2 ) (4.5) j αj j αj τ 2 |s2 ∼ Inv-χ2(1, s2 ) (4.6) αj αj αj

sαj |bk[j] ∼ Γ(0.5, bk[j]) (4.7)

p(ln(bk)) ∝ 1 (4.8)

The above equations can be interpreted as following: 1) Individual variant coeffi- cient αj follows a normal distribution with given mean µj and an unknown variant- specific variance τ 2 , resulting in a mixture of normal distributions with varying αj scale (standard error) ταj . 2) The prior distribution of scale ταj is a hierarchical for- mulation of a half-Cauchy distribution. The half-Cauchy distribution has desirable properties, such as being positive, having an infinite mode at the prior mean, high densities for extreme values (“heavy tails”), and computational efficiency. Index k[j] refers to group k, which variant j is a member of [Yi, Liu, Zhi & Li 2011].

The variant-level coefficient µj is a prior estimate of the importance of the indi- vidual variant, whose value can be derived from various sources [Yi, Liu, Zhi & Li 2011]:

J Pk • µj = 1 (ταj = 0): The genetic score is the simple sum of allele counts zij. j∈Gk

• µj = 0 (ταj >> 0): The genetic score is a weighted sum of allele counts, where an individual variant effect a µj is shrunken towards 0. This means, that variants with small effects will be removed from the model.1

• µj = m (ταj >> 0): The genetic score is a weighted sum of allele counts, where the prior mean µj represents the relative importance of an individual variant according to some biological score m.

In addition to the prior mean, however, scale parameter ταj has an important influence on the estimation of effects, since it controls the amount of shrinkage towards the prior mean. For example, if τ 2 = 0, the coefficient α will be identical αj j

1This prior mean is not recommended by Yi, Liu, Zhi & Li (2011), because, first, the combined small effects of variants (e.g. rare variants) might be important. Second, the group effect can be based on only a subset of variants and will be difficult to interpret. Third, the weights can be estimated from the data, and can allow for variants without effects.

26 4 Bayesian hierarchical generalized linear models to the prior mean µ . If τ 2 = ∞, then parameters g and α cannot be separated. j αj k j If τ 2 is finite, the coefficient α is shrunken towards µ , but does not need to be αj j j identical to the prior mean µj. The group-level coefficient bk can also be used to estimate a group-specific level of shrinkage from the data, which is specific for each group. This configuration of prior distributions does not require the selection and tuning of parameters by an analyst [Yi, Liu, Zhi & Li 2011]. Finally, to allow for the influence of the prior distribution and the effect of the group parameter, the main effect of an individual variant j in group k can be adjusted in two ways: 1) The regression weight αj of an individual variant j has to be multiplied by the common group effect gk, resulting in the total effect gkαj. 2) The bias from the choice of for the mean µj of the prior distribution of individual variant effect αj has to be removed. Therefore, the adjusted total main effect of individual variant j in group k can be calculated as gk(αj − µj) [Yi, Liu, Zhi & Li 2011].

4.3.2 Group effects

Group effects can be estimated with uniform prior distributions if the number of groups is small. However, low allele frequencies can result in very small variances J Pk for the genetic scores gk = αjzij and, hence, in unreliable estimates. A solution j∈Gk to this problem is, the use of a weakly informative prior distribution on group effect gk. The prior distribution for group effect gk is again formalised as a hierarchical prior distribution (see equations 4.9 to 4.11) [Yi, Liu, Zhi & Li 2011]:

g |τ 2 ∼ N(0, τ 2 ) (4.9) k gk gk τ 2 |s2 ∼ Inv-χ2(1, s2 ) (4.10) gk gk gk s2 ∼ Γ(0.5, 0.5) (4.11) gk

Similar to the hierarchical distribution of variant effects, the group effect gk follows a normal distribution with a given mean of 0 and an unknown group-specific variance τ 2 , resulting in a mixture of normal distributions with varying scale (standard error) gk

27 4 Bayesian hierarchical generalized linear models

τgk . The group-specific parameter sgk follows a weakly informative prior distribution

Γ(0.5, 0.5). The group effect gk can be interpreted as a shared common regression coefficient of Jk variants in group k. The coefficient can be interpreted as the (weighted) average effect per allele. However, the direction of the effect cannot be interpreted without considering individual variant effects, which can have positive or negative signs [Yi, Liu, Zhi & Li 2011].

4.3.3 Covariate effects, intercept, and dispersion

The prior distribution for a covariate effect βj (see equations 4.12 to 4.14) has the same structure as the prior distribution for a group effect (see above) [Yi, Liu, Zhi & Li 2011]:

β |τ 2 ∼ N(0, τ 2 ) (4.12) j βj βj τ 2 |s2 ∼ Inv-χ2(1, d2 ) (4.13) βj βj βj s2 ∼ Γ(0.5, 0.5) (4.14) βj

For the intercept β0 (see equation 4.15) and the dispersion parameter φ (see equation 4.16) non-informative prior distributions are used [Yi, Liu, Zhi & Li 2011]:

2 2 p(β0) = N(0, τ0 ) , with τ0 >> 0 (4.15)

p(log(φ)) ∝ 1 (4.16)

4.4 Estimation algorithm (IWLS-EM)

Yi, Liu, Zhi & Li (2011) reduced the computational burden of model fitting of Bayesian models based on a Markov Chain Monte Carlo (MCMC) algorithm by a modified Iterative Weighted Least Squares (IWLS) algorithm including a flexible Expectation-Maximisation (EM) algorithm [Yi, Liu, Zhi & Li 2011, Appendix S1] (see also Yi and Ma (2012) and Gelman (2014, pp. 444)). Their approach is based

28 4 Bayesian hierarchical generalized linear models on the classic IWLS algorithm, which was extended in two ways:

1. Two updating steps were introduced to estimate individual variant effects and group variant effects in a hierarchical model. These updating steps are conditional on each other.

2. An Expectation-Maximisation (EM) algorithm was added which treats un- 2 2 known variances τj and hyper-parameters sj and bk[j] as missing data values in the augmented data (design) matrix.

Important steps of the IWLS-EM algorithm are summarised in the following [Yi, Liu, Zhi & Li 2011, Appendix S1]:

Initial values The model is initialised by setting the parameter vector (β, α, g, φ, τ 2, s2, b) to plausible starting values, for example:

• individual variant effects: αj = µj, ταj = sαj = 0.5, bk = 0.5

• group effects: gk = 0, τgk = sgk = 1

• covariate effects: βj = 0, τβj = sβj = 1

• total model: φ = 1, β0 = 0

Variant step αj values are updated conditionally on the prior distribution

2 N(µj, τj ), the current estimate ofg ˆk, and other covariates βj withz ˆij = zijgˆk (see linear predictor in equation 4.17):

J K J X0 X Xk ηi = xijβj + zˆijαij (4.17)

j=0 k=1 j∈Gk

Group step The estimates of group effects gk are updated conditionally on the prior distribution N(0, τ 2 ), the current estimates ofα ˆ , and other covariates with gk j J ˆ Pk Tik = αˆjzij (see linear predictor in equation 4.18): j∈Gk

J0 K X X ˆ ηi = xijβj + Tikgk (4.18) j=0 k=1

29 4 Bayesian hierarchical generalized linear models

Update At each step of the iteration, terms involving missing values are replaced by their conditional expectations. The coefficients β, φ are updated by maximising the expected value of the joint log-posterior density log[p(β, φ, τ 2, s2, b|y)].

Expectation step In the E-step, the expectation of the conditional posterior dis-

−2 2 tributions, p(τj |βj, sj), p(sj |τj, bbk[j]), and p(bk[j]|{sj; j ∈ Gk}) are derived from the joint log-posterior density.

Maximisation step In the M-Step the parameter vector (β, φ) is updated by max- n  2 P 1 P βj −µj imising ln p(yi|Xiβ, φ) − 2 J τˆ2 . i=1 j=0 j Therefore, the classical IWLS algorithm can be applied to approximate the like- lihood of the GLM p(yi|Xiβ, φ) by the likelihood of a weighted normal linear model with the pseudo (working) response zi and the pseudo-weight wi (see equation 4.19):

−1 p(yi|Xiβ, φ) ≈ N(zi|Xiβ, wi φ) (4.19)

An additional advantage of this augmented data matrix X∗ is that this normal lin- ear regression model is well-defined and has finite variances of regression estimates, even if the original design matrix X features high-dimensionality, collinearity, or separation, which would make a classical generalized linear model non-identifiable. More details of the IWLS-EM algorithm are described in Yi, Liu, Zhi & Li (2011, Appendix S1).

Convergence The convergence of the algorithm is evaluated by using the standard criterion of the classical GLM with d(t) as the deviance at iteration t (4.20):

(t) (t−1) (t)  −5 d − d / 0.1 + |d | <  , e.g. < 10 (4.20)

Estimates Estimates of parameters, standard errors, and p values are derived from the last iteration, meeting the convergence criterion. Hypothesis tests are performed

30 4 Bayesian hierarchical generalized linear models

for group effects gk = 0 and adjusted total main effects of individual variantsg ˆk(ˆαj −

µj) = 0. To derive the standard errors for the adjusted total main effects as a function of the standard errors of group effects and unadjusted individual variant effects the δ technique is applied (see equation 4.21):

2 2 Var(ˆgk(ˆαj − µj)) =g ˆk · Var(ˆαj) + (ˆαj − µj) · Var(ˆgk) (4.21)

4.5 Bayesian p values

As explained in previous chapters, posterior probabilities (Bayesian p values) from Bayesian models using weakly informative priors are equivalent to one-sided fre- quentist p values. Since the BHGLM is based on the frequentist GLM framework and frequentist methods are usually applied in genetic association studies to iden- tify genetic signals, Yi, Liu, Zhi & Li (2011) make use of this analogy by providing Bayesian p values and 95% credible intervals for individual variants and grouped variant effects. For this purpose, the BHGLM applies various Wald test like proce- dures.

4.5.1 Individual variants

As explained in the previous section, estimates and standard errors can be derived from the IWLS-EM algorithm on individual variant level which can directly be used to perform a Wald test (see equations 4.22 to 4.24):

2 P (αj = αj0 = 0) = P (T > χ1(1 − type I error)) (4.22) (α − α )2 T = bj j0 (4.23) var(αbj) 2 T ∼ χ1 (4.24)

31 4 Bayesian hierarchical generalized linear models

4.5.2 Grouped variants

On group level, there are several methods to calculate the Bayesian p value of group effects depending on the calculated model: First, if a full Bayesian hierarchical model is estimated, which includes group coefficients gk (see equation 4.1), a group effect can be evaluated based on the estimated group coefficient, its standard error, and the derived Bayesian p value similar to individual variants. Second, if a reduced Bayesian hierarchical model without a hierarchical group ef- fect is estimated, an average effect test of grouped variants can be applied (equations 4.25 to 4.27). Here, the test statistic is the ratio of the difference of the weighted average of the coefficients of the variants in the group minus the null value and the standard deviation of the weighted average of the coefficients of the variants in the group, evaluated by a Wald test [Yi, Liu, Zhi & Li 2011]:

P (¯αk =α ¯k0 = 0) = P (T > probit(1 − type I error/2)) (4.25) (α¯ − α¯ ) T = abs bk k0 (4.26) σ bα¯k T ∼ N (0, 1) (4.27)

Third, if a reduced Bayesian hierarchical model is estimated, a joint effect test of grouped variants can be calculated. The joint effect test (see equations 4.28 to 4.30) uses a quadratic statistic Q to perform a test of the null hypothesis H0:

αj = 0, j = 1, .., jK . W is a vector of weights, for example, the inverse of the covariance matrix of tested coefficients αj, j = 1, .., jk of group k.

P (α = 0, j = 1, .., j ) = P (Q > χ2 (1 − type I error)) (4.28) j k jk

T Q = αb × Wc × αb (4.29) Q ∼ χ2 (4.30) jk

32 4 Bayesian hierarchical generalized linear models

Here, the test assumes that the coefficients αj follow a multivariate normal distri- bution. The univariate normal probability distribution of a parameter z, is equiv- alent to a χ2 probability distribution with one degree of freedom of parameter z2. Therefore, the weighted sum of independent χ2 distributions with one degree of free- dom can be approximated by a χ2 distribution with a number of degrees of freedom jk equivalent to the number of tested coefficients jk [Yi, Liu, Zhi & Li 2011; Zhang 2005].

33 5 Methods

This chapter describes the data and analysis methods applied in the present study. The data included simulated data and empirical data from the Genetic Analysis Workshop 19 (GAW19) and a Bayesian hierarchical generalized linear model in the primary analysis.

5.1 Data

The data comprises fully simulated data, partially simulated data (GAW19) and fully empirical data (GAW19).

5.1.1 Simulated data

Simulated data I 1 was supposed to be simple with sufficient medical plausibility to allow systematic simulations for a large number of samples. A sample size of 10,000 individuals was chosen to avoid zero allele counts for the estimation of effects of rare variants. For example, the minor allele of the rare variant with the lowest MAF=0.001 had an expected allele count of 20 alleles in the dataset (10,000 × 2 × 0.001=20 alleles). The data contained diastolic blood pressure (DBP), systolic blood pressure (SBP), hypertension (HTN), sex, age, and five rare and five common genetic variants. Simulated data Ia consisted of 1,000 samples without simulated genetic effects, and simulated data Ib contained 1,000 samples with simulated genetic effects. Five genetic variants in two categories according to minor allele frequencies were generated using a binomial model:

1This dataset was not distributed by the GAW19 organisers.

34 5 Methods

• 5 rare variants with MAF > 0.000 and MAF < 0.010, i.e. rv1=0.001, rv2=0.002, rv3=0.003, rv4=0.004, rv5=0.005

• 5 common variants with MAF ≥ 0.05 and MAF ≤ 0.50, i.e. cv1=0.100, cv2=0.200, cv3=0.300, cv4=0.400, cv5=0.500

The following linear predictor (equations 5.1 and 5.2) was used to simulate dias- tolic blood pressure (DBP) without genetic effects in simulated data Ia:

µ = 80.0 + 10.0 × sex + 1.0 × (age − 50.0) (5.1)

DBP ∼ N(µ, 10) (5.2)

Additionally, a similar linear predictor (equations 5.3 to 5.4) was applied to sim- ulate diastolic blood pressure (DBP) with genetic effects in simulated data Ib:

µ = 80.0 + 10.0 × sex + 1.0 × (age − 50.0) (5.3)

+ 10.0 × rv1 + 20.0 × rv2

+ 10.0 × cv1 + 20.0 × cv2

DBP ∼ N(µ, 10) (5.4)

The linear predictor for systolic blood pressure (SBP) was identical except for using an intercept of 120 (instead of 100) and a normal error distribution with a standard deviation of 15 (instead of 10). Individuals were categorised as hypertonic if SBP > 140 mmHg and DBP > 90 mmHg.

5.1.2 GAW19 Data

General

The data consisted of the GAW 19 unrelated data (GAW19) dataset which was provided by the T2D-GENES 1 consortium and distributed by the organisers of

1Type 2 Diabetes Genetic Exploration by Next-generation sequencing in Ethnic Samples

35 5 Methods the Genetic Analysis Workshop 19 1 on April 30, 2014. It included 1,943 unrelated individuals with 1,021 type 2 diabetes cases and 922 controls with Mexican-American (Hispanic) ancestry [Almasy et al. (2014), Genetic Analysis Workshop (2014a), Genetic Analysis Workshop (2014b)]. The genetic data comprised whole-exome data from odd-numbered chromosomes (1, 3, . . . , 21) including 1,765,688 genetic variants in various formats (i.e. allele count, allele dosage, and others). Only allele counts (coded as 0, 1, or 2 alleles) of bi-allelic single-nucleotide variants were included in the current analysis. All samples (individuals) and all variants passed quality control which was performed by the T2D-GENES consortium [Almasy et al. (2014), Genetic Analysis Workshop (2014a), Genetic Analysis Workshop (2014b)]. The non-genetic data comprised simulated (200 time points) and observed systolic blood pressure (SBP), diastolic blood pressure (DBP), and hypertension as outcomes and sex, age, and medication as covariates. Ancestry principal components were calculated from genetic data. Data on type 2 diabetes (T2D) was not available.

Study collection I The first part of the data consists of unrelated individuals from a pedigree-based, cross-sectional sample (272 T2D cases, 218 controls, 490 total), which was collected between January 1992 and June 1995 including:

1. San Antonio Family Heart Study [Mitchell et al. 1996b]

2. San Antonio Family Diabetes/Gallbladder Study [Hunt et al. 2005]

3. Veterans Administration Genetic Epidemiology Study [Coletta et al. 2009]

4. Family Investigation of Nephropathy and Diabetes Study [Knowler et al. 2005]

Individuals from San Antonio were included, if the proband was between 40 and 60 years old. The study was approved by the institutional review board and all subjects gave their informed consent.

Study collection II The second part of the data consists of the population-based Starr County Texas Study [Below et al. 2011; Hanis et al. 1983] (749 T2D cases and 704 controls), which was collected between 2002 and 2006. Individuals from Starr

1http://www.gaworkshop.org

36 5 Methods

County, Texas, USA, were included if the individual was 20 years or older. Informed consent was obtained from all participants.

Non-genetic data

Observed phenotypes The GAW19 data comprised empirical (“real world”) data including systolic blood pressure (SBP, in mmHg units) and diastolic blood pres- sure(DBP, in mmHg units), and hypertension (0 = no hypertension, 1= hyperten- sion) as phenotypes (response). Hypertension was diagnosed in individuals with SBP > 140 mmHg and DBP > 90 mmHg. Additionally, empirical covariates such as sex (1=male, 2=female), age (years), and anti-hypertensive medication (0=no, 1=yes), and the examination data (year) were given. In the study collection I stud- ies, systolic and diastolic blood pressure were measured to the nearest even digit using a random-zero sphygmomanometer (Hawksley-Gelman) on the right arm of the seated subject after resting for five minutes. Blood pressure was repeated three times for each individual. The participant’s blood pressure was defined as the arith- metic mean of the second and third reading. In the study collection II studies, systolic and diastolic (4th Korotkoff sound) blood pressure were measured with one reading.

Simulated phenotypes The GAW19 data also contained simulated phenotypes (SBP, DBP, HTN) for 200 replicates (“time points”). The simulated phenotypes used a model which included 1,730 coding variants in 128 genes as causal vari- ants. Genetic effects were based on associations between observed gene expression in lymphocytes with DBP and SBP, known effects of coding variant alleles on gene expression, and the PolyPhen-2 score [Adzhubei et al. 2013] to obtain biologically plausible associations. The full simulation model is published elsewhere1 [Bickeb¨oller et al. 2014]:

Covariates The GAW19 data included sex, age, exam year, and medication in the real phenotype dataset. Additionally, 10 ancestry principal components (APCs)

1There are differences regarding the number of genetic variants and genes with simulated effects between the GAW19 family data (as published) and the GAW19 unrelated data (unpublished).

37 5 Methods were calculated from a pruned data set including 465,887 polymorphic variants in linkage equilibrium (R2 ≤ 0.5 with a window size of 50 bp and step size of 5 bp). The inclusion of APCs as covariates in GWAS is recommended to adjust for confounding by ethnicity (i.e., population stratification) [Price et al. 2006]. APCs were computed using multidimensional scaling as implemented in PLINK [Purcell et al. 2007].

Genetic data

Overview The genetic data comprises 1,765,688 exome-sequenced genetic variants of all odd-numbered (1, 3, ..., 21) autosomal chromosomes. All genetic variants were identified by chromosome, position, reference allele and alternate allele according to assembly 19 (hg19, GRCh37). Allele dosages (and allele counts) represent the estimated number of minor alleles per variant with values between 0 and 2 alleles. Tables 6.3 (Results) and A.2 (Appendix) give an overview on the number of genetic variants.

Quality control The provided genetic data was quality-controlled by the T2D- Genes Consortium as documented [Genetic Analysis Workshop (2014a), Genetic Analysis Workshop (2014b)]. Variants were excluded on the basis of call rate (< 90% in any study), deviation from Hardy-Weinberg equilibrium (p < 1 × 10−6 , females only for X chromosome) or differential call rate between type 2 diabetes cases and controls (p < 1 × 10−4). Autosomal variants with a minor allele frequency (MAF) > 1% , which passed quality control, were selected for trans-ethnic kinship analyses. Identity-by-state (IBS) between each pair of samples on the basis of independent variants (trans- ethnic r2 < 0.05) was computed and principal components analysis performed with the software EIGENSTRAT [Price et al. 2006] to identify ethnic outliers. IBS values were also used to identify duplicates. The sample from each pair with lowest call rate and/or mismatch with external information was removed. A subset of individuals, which were considered as unrelated based on their kinship coefficients, was finally included in the GAW19 unrelated dataset. Additionally, in frequentist single-variant analysis, variants with a minor allele

38 5 Methods count ≥ 2 were selected for interpretation [de Winter 2013]. In Bayesian hierarchical models, individual variants with a minor allele count ≥ 1 were analysed, considering that estimates were aggregated, e.g. to a genetic score, for which estimates even of singleton variants were potentially relevant. On the level of grouped variants, only groups with ≥ 3 variants per group were considered, to obtain reliable estimates of group effects.

Annotation Genetic variants were identified using the Human Genome Reference Assembly 37 (GRCh37, UCSC hg19). All 461,868 variants of the primary analysis set of the GAW19 data were functionally annotated using the RefSeq (release 66) database1. 241,377 (52.26%) of 461,868 variants were identified in the dbSNP (build 138) database2 (Table A.2 in the appendix).

Simulation The simulated data of the GAW19 unrelated dataset contained 1,730 variants in 128 genes with simulated effects on either diastolic or systolic blood pressure.3 1,517 variants in 111 genes had non-null effects on diastolic blood pres- sure. The simulated effects, which were the difference in trait mean for each copy of the minor allele of an individual, ranged from -6.19 to 4.16 (mean=0.02, standard deviation (SD) = 0.94). 122 (8.04%) variant were rare (MAF < 0.01) and 1,395 (91.96%) were common. 1,221 variants in 93 genes had non-null effects on systolic blood pressure. Effects ranged from -7.83 to 6.05 (mean = 0.09, SD = 1.28). 102 (8.35%) variant were rare (MAF < 0.01) and 1,119 (91.65%) were common. 1,008 variants in 76 genes had non-null effects on diastolic and systolic blood pressure, and, therefore, were considered as risk variants for hypertension, i.e. SBP > 140 and DBP > 90. 88 (8.73%) variant were rare (MAF < 0.01) and 920 (91.27%) were common (MAF ≥ 0.01). An overview on simulated variants in the 128 genes is given in Table A.6 in the appendix. Additionally, 73,888 variants in 1,000 genes without known simulated non-null

1http://www.ncbi.nlm.nih.gov/refseq/ 2http://www.ncbi.nlm.nih.gov/SNP/ 3The file with the correct simulated genetic effects was sent by Laura Almasy on October 10, 2014, per email. These variants (genes) will be referred to as functional, causal, simulated, or non-null variants (genes).

39 5 Methods effects were selected for sensitivity (specificity) analysis. The selected genes were located on odd-numbered chromosomes, were at least 1 Mb from the start or stop positions outside of the transcription region of non-null genes and contained at least 5 rare variants and 5 common variants.

5.2 Statistical models

This section describes the applied statistical models to identify individual variant and grouped variant effects on blood pressure. Bayesian hierarchical generalized linear models of multiple variants were used in the primary analysis. Frequentist generalized linear models of individual variants were used in the secondary (sensi- tivity) analysis.

5.2.1 Frequentist analysis of individual variants

The statistical models included diastolic blood pressure (DBP), systolic blood pres- sure (SBP) or hypertension (HTN) as phenotypes (outcomes). DBP (in mmHg) and SBP (in mmHg) were modelled based on a normal error distribution with an identity link function, HTN (0=no hypertension, 1=hypertension) based on a binomial error distribution with a logit link function. Covariates comprised sex (0=male, 1=fe- male), age (in years), and two principal components. Genetic predictors included one polymorphic, bi-allelic single-nucleotide variant, which contained the count of the minor allele, i.e. 0, 1, or 2 alleles. However, separation is a common problem in logistic regression models, espe- cially, with sparse data, i.e. rare variants. Separation occurs if predictors are highly (perfectly) correlated with the outcome, leading to extreme estimates and stan- dard errors and loss of statistical power. To overcome the problem of separation, Firth bias-corrected logistic regression [Firth 1993] was applied. It uses a penalised likelihood which shrinks extreme estimates towards zero. Firth logistic regression showed superior behaviour in case-control genetic association studies which included low count variants [Ma et al. 2013].

40 5 Methods

5.2.2 Bayesian analysis of multiple variants

Bayesian hierarchical generalized linear models as proposed by [Yi, Liu, Zhi & Li 2011] were applied to model effects of individual variants and groups of variants on blood pressure. Here, individual variants are grouped into a group of rare (MAF < 1%) or common (MAF ≥ 1%) variants depending on their minor allele frequency. This statistical model has been described in general terms in chapter 4.

Estimation and testing

A full Bayesian hierarchical model, which includes coefficients of group effects (see equation 4.1) and a reduced Bayesian hierarchical model, which excludes coefficients of group effects were calculated. The reason is that the implementation of the full BHGLM approach1 in the R/BhGLM package (V1.0.0, release 2013-12-15) calculates effects of individual and groups of variants jointly, but only gives variable weights for individual variants without testing them (i.e. no standard errors, credible intervals, or Bayesian p values). The implementation of the reduced BHGLM approach2 does not estimate a coefficient of the group effect, but estimates and tests effects of individual variants. Individual variants can be assigned here to groups, which share hyper-parameters of prior distributions, supporting the estimation of effects. Joint (primary analysis) and average (secondary analysis) effects of groups of variants were tested with a Wald test3 [Yi, Liu, Zhi & Li 2011; Zhang 2005]. The estimation of a group coefficient required at least two variants per group to be identifiable.

Prior distributions

Weakly informative prior distributions were based on t-distributions with one degree of freedom, which are equivalent to Cauchy distributions with the same location4 and a scale of one. In the full BHGLM, prior locations were set to 0 and a scale of 1 for group effects, and prior locations of 1 and a scale of 1 for individual effects

1see BhGLM::bglm.ex() function (Version 1.0.0) 2see BhGLM::bglm() function (Version 1.0.0) 3see BhGLM::waldtest.bglm() function (Version 1.1.0) 4The mean and standard deviation are not defined for Cauchy distributions, therefore, the pa- rameters location and scale are used.

41 5 Methods following the default values. In the reduced BHGLM, prior locations were set to 0 and a scale of 1, which are default values. In both models, hyper-parameters were not updated in the estimation process, to control the prior distributions. If not mentioned otherwise, default values were used.

5.2.3 Multiple testing

In genome-wide association analysis typically a global type I error of 0.05 is applied, which results in a local Bonferroni-adjusted α-threshold of 5 × 10−8, assuming one million independent tests (or variants) across the genome. In our study, 461,868 variants in exomes on odd-numbered chromosomes are studied, which corresponds to a local α = 0.05/461868 = 1.08 × 10−7. In a hierarchical model with effects of individual variants and of two groups of variants per gene, a number of 2 × 13,788 genes = 27,576 tests is added, resulting in a local Bonferroni-adjusted α = 0.05/(461868+2×13788) = 1.02×10−7. Therefore, a local type I error threshold of 1×10−7 was adopted for the primary analysis using a Bayesian analysis of multiple variants reporting Bayesian p values and the secondary analysis applying a frequentist analysis of individual variants reporting frequentist p values.

5.2.4 Missing data

Missing values in genotypes were imputed by the expected (mean) allele counts of the respective variant. Although superior imputation methods exist [Graham 2009], mean imputation is still widely applied in genetic association studies [Yi, Liu, Zhi & Li 2011]. Mean imputation was also applied to impute missing age values in 81 individuals in simulated data, to make use of the full dataset (1,943 individuals).

5.2.5 Correction of ethnicity and relatedness

In genome-wide association analysis, test statistics of genetic associations may be inflated by confounding of genetic variation with ethnicity (population stratifica- tion) or by cryptic relatedness, which both can lead to false positive findings. To

42 5 Methods adjust for ethnicity, the inclusion of ancestry principal components (APCs) was rec- ommended [Price et al. 2006]. Additionally, remaining test inflation, for example, by cryptic relatedness, can be corrected by genomic control (including QQ Plots for visualisation) [Devlin and Roeder 1999]. Therefore, two APCs were included in the analysis models and test inflation will be evaluated by calculating the λ parameter and Quantile-Quantile (QQ) plots. Genomic control will be applied if λ values are substantially larger than one. However, methods which were developed for a frequentist single-variant analysis model of common variants will be only of limited use for a Bayesian hierarchical model of rare and common variants. First, shrinkage of weak effects towards zero will distort the overall distribution of test statistics (or p values). Second, hier- archical grouping of a varying number of variants, e.g. per gene, corresponds to tests with varying degrees of freedom, leading to a mixture of distributions. Third, inflation of test statistics depends also on statistical power, which again depends on the frequency of observations (minor allele frequency). This means, inflation of test statistics will be stronger for common variants than for rare variants. Hence, the applied penalisation by genomic control (including rare and common variants) will probably cause conservative results for rare variants and liberal results for com- mon variants, compared to an analysis exclusively studying one of these variant categories. Therefore, genomic control of association results was not applied to Bayesian hierarchical models.

5.2.6 Criteria for model evaluation

Criteria for the evaluation of the models in simulated data I were: a) unbiasedness of the estimates, b) coverage of the true parameter value by 95% confidence or credibility intervals, and c) the p value distribution. Bias is a measure of deviation of the estimated parameter value from the simulated (true) value. Coverage is the proportion of 95% confidence intervals covering the simulated (true) value and is expected to approximate 95%. Depending on the simulated parameter values, the distribution of (frequentist or Bayesian) p values is informative about the validity

43 5 Methods of the test. If the simulated parameter value is null, the proportion of p values less than 0.05 is expected to be about 5% corresponding to a uniform distribution of p values (c.f., type I error, 1-specificity).1 If the simulated parameter values were non-null, the proportion of p values less than 0.05 is a measure of the statistical power of the test (sensitivity). In simulated data II the absolute and relative frequency (%) of variants (or genes) with significant null or non-null effects, respectively, was examined.

5.3 Software

Data management and analysis of large-scale genetic data usually requires a va- riety of software tools. In the following the main software for statistical analysis are described. EPACTS was applied to perform single-variant association analysis. R/BhGLM was used to perform Bayesian hierarchical generalized linear models for a normal and a binomial model.2 Other programs were used for data management, annotation, and specific analyses.

5.3.1 EPACTS

EPACTS (Efficient and Parallelizable Association Container Toolbox, version 3.2.5)3 is an open source software program which integrates a variety of statistical models and databases for variant annotation for genetic association analysis. It contains a command-line interface, which is written in PERL (wrapper), to call various sta- tistical functions implemented in R. The PERL wrapper allows the efficient analy- sis of large genomic data by extracting the relevant data from files in variant call format (VCF), and by processing it in parallel threads on multi-core systems. It also integrates functionality for collection and presentation of genome-wide associa- tion results. Here, EPACTS was used for single-variant analysis of exome-sequence

1Although the distribution of posterior probabilities is not necessarily uniform in Bayesian mod- els, this assumption is considered reasonable using weakly informative prior distributions and simulated null effects. 2These programs were based on an R (Version 3.0.2) installation, since R versions < 3.0 or > 3.1 did not work with all mentioned programs. 3http://genome.sph.umich.edu/wiki/EPACTS

44 5 Methods data in VCF format using a generalized linear model with a normal error distribu- tion (with identity link) and a binomial error distribution (with logistic link). A modified binomial-logistic regression model was used as suggested by Firth (1993) (Firth’s bias-corrected logistic regression) to avoid problems of separation in sparse data which leads to inflated estimates, standard errors, and p values.

5.3.2 R/BhGLM

The BhGLM (Bayesian hierarchical Generalized Linear Models, version 1.0.0, release 2013-12-15)1 package for R has been developed by Yi, Liu, Zhi & Li (2011). It implements data management and statistical analysis functions for the application of Bayesian hierarchical generalized linear models to genetics, e.g. genetic association studies. The R/BhGLM is still under development at present. It is available at the author’s website2 and has not been officially released at the Comprehensive R Archive Network (CRAN)3 yet.

5.3.3 Other

Data management was performed with VCFTOOLS (version 0.1.12)4 [Danecek et al. 2011]. Annotation of variants with RefSeq data was performed with ANNOVAR (version 2014-07-14)5 [Wang et al. 2010]. Principal components were calculated by PLINK (version 1.07)6 [Purcell et al. 2007] using classical multidimensional scaling. Data management and analysis was automatised using BASH and R scripts, and the TORQUE Resource Manager (version 4.2.9)7 for job submission on a high- performance computing system.

1The waldtest.bglm() function was upgraded to R/BhGLM (Version 1.1.0), which implemented tests of average and joint effects. A full upgrade to BhGLM (version 1.1.0) was not performed, which would have required major changes of analysis scripts. 2http://www.ssg.uab.edu/bhglm/ 3http://cran.r-project.org 4http://vcftools.sourceforge.net 5http://www.openbioinformatics.org/annovar/ 6http://pngu.mgh.harvard.edu/~purcell/plink/ 7http://www.adaptivecomputing.com/products/open-source/torque/

45 5 Methods

5.4 Hardware

All data management and analysis was performed on the high-performance comput- ing facility ALICE1 at the University of Leicester. ALICE features 208 standard compute nodes available for job execution. Each compute node has a pair of eight- core Intel Xeon Sandybridge CPUs E5-2670 (total 16 cores) running at 2.6 GHz and 64 GB of RAM. All computing time results refer to a single core, if not stated otherwise.

1http://www2.le.ac.uk/offices/ithelp/services/hpc/alice

46 6 Results

This chapter presents results from genetic association analyses of diastolic blood pressure (DBP), systolic blood pressure (SBP), and hypertension (HTN), if not stated otherwise. Frequentist single-variant analysis and Bayesian hierarchical multiple-variant analysis are performed in three types of data: a) fully simulated, simple data without genetic effects (simulated data Ia) and with genetic effects (simulated data Ib), b) partially simulated, complex data with genetic effects based on real genetic variants and simulated blood pressure provided by GAW19 (simu- lated data II ), c) a fully empirical, complex dataset provided by GAW19 (empirical dataset).

6.1 Descriptives

Descriptive statistics of genetic and non-genetic data for simulated data I, simulated data II (GAW19), and empirical data (GAW19) are shown.

6.1.1 Simulated data I

Simulated data I comprised 2 × 1,000 samples, each of which included 10,000 indi- viduals with diastolic and systolic blood pressure and hypertension status, sex, age and ten genetic variants in two categories according to their minor allele frequencies. Parameters and values of the simulated data are described in the methods section in detail. A single dataset was designed to include on average 5,000 (50%) women and individuals with an average age of 50 years (SD = 10). Allele frequencies of five rare variants ranged from 0.1% to 0.5% and of five common variants between 10% to 50%.

47 6 Results

Datasets 1 to 1,000 (simulated data Ia) contained blood pressure associated with sex and age, but not with genetic variants. Datasets 1,001 to 2,000 (simulated data Ib) included blood pressure associated with sex and age, but also with two rare variants (i.e., rv1, rv2) and two common variants (i.e., cv1, cv2). Diastolic blood pressure was normally distributed with a baseline mean of 80 mmHg and standard deviation of 10 mmHg. Systolic blood pressure was also normally distributed, but with a baseline mean of 120 and standard deviation of 15 mmHg. Hypertension was coded if SBP was greater than 140 mmHg and DBP greater than 90 mmHg. The baseline referred to women with 50 years of age without a minor allele in the 10 simulated genotypes. Although on average the descriptives correspond to the simulated parameter val- ues, the descriptives and values of a single simulated dataset can diverge randomly, but considerably from their expected values. As an example, descriptives of a single, fully simulated dataset are given in the appendix (Table A.1).

6.1.2 Simulated data II (GAW 19)

Simulated data II was provided by the Genetic Analysis Workshop 19 and comprised 1,943 individuals with simulated diastolic and systolic blood pressure and hyperten- sion status, and observed sex, age, medication and genetic variants at time point 1 of overall 200 available time points (Table 6.1). Simulated blood pressure values were in a medically plausible range with a mean of 74 mmHg (SD = 9 mmHg) for DBP and a mean of 123 mmHg (SD = 15 mmHg) for SBP. Hypertension was diagnosed in 28 (1.44%) individuals with SBP greater than 140 mmHg and DBP greater than 90 mmHg. Age was not recorded in 81 individuals without hypertension. To utilise the full dataset missing age values were mean imputed. The non-genetic and genetic empirical data will be described in the following section.

6.1.3 Empirical data (GAW 19)

Empirical data was provided by the Genetic Analysis Workshop 19 and contained 1,943 individuals with diastolic and systolic blood pressure and hypertension sta-

48 6 Results

Table 6.1: Descriptives of simulated data II (1,943 individuals, GAW19)

All No hypertension Hypertension 1943 (100%) 1915 (98.56%) 28 (1.44%) Values Sex, female, N(%) 1241 (63.87%) 1232 (64.33%) 9 (32.14%) Age, years, mean (SD) 49.49 (14.24) 49.36 (14.22) 57.42 (14.05) Medication, treated, N(%) 182 (9.37%) 171 (8.93%) 11 (39.29%) DBP, mmHg, mean (SD) 73.93 (9.24) 73.63 (8.96) 94.46 (3.25) SBP, mmHg, mean (SD) 123.36 (15.04) 122.99 (14.81) 148.47 (7.78) Missings Sex, missing, N (%) 0 (0.00%) 0 (0.00%) 0 (0.00%) Age, missing, N (%) 81 (4.17%)) 81 (4.23%) 0 (0.00%) Medication, missing, N (%) 0 (0.00%) 0 (0.00%) 0 (0.00%) DBP, missing, N (%) 0 (0.00%) 0 (0.00%) 0 (0.00%) SBP, missing, N (%) 0 (0.00%) 0 (0.00%) 0 (0.00%) Hypertension, missing, N (%) 0 (0.00%) 0 (0.00%) 0 (0.00%) DBP = Diastolic Blood Pressure, SBP = Systolic Blood Pressure, GAW19=Genetic Analysis Workshop 19

tus, sex, age, medication, and genetic variants. The 1,943 participants included 1,241 (63%) women. The sample had an average age of 49 years (SD = 14 years). 147 (36%) of 407 individuals with valid medication status were treated with blood-pressure lowering medication. Mean blood pressures were 73 mmHg (11 mmHg, diastolic) and 125 mmHg (SD = 21 mmHg, systolic). Diastolic and systolic blood pressure were significantly correlated (Pearson r=0.55, CI95%=[0.52,0.58], p =2.56 × 10−146). 92 (5%) individuals had missing blood pressure (i.e., diastolic, systolic, and hypertension), of which 81 (4%) individuals had no age records. Med- ication of blood pressure was missing in overall 1,536 (79%) individuals (Table 6.2). The exome-sequenced genetic data contained overall 1,765,688 variants, of which 461,868 quality-controlled, polymorphic, bi-allelic variants in 11,329 genes were used in the analysis (Table 6.3). The number of variants per gene varied between 1 and 1,274 with a mean of 41 (SD = 49) variants. 412,949 (89.41%) variants were rare and 48,919 (10.59%) were common. 241,829 (52.36%) variants of the analysis set were annotated in dbSNP138 (Table A.2 in the appendix). Principal components calculated by classical multidimensional scaling showed no outliers in this quality- controlled sample (Figure A.8), confirming a homogeneous population with Mexican- American (Hispanic) ancestry.

49 6 Results

Table 6.2: Descriptives of the empirical data (1,943 individuals, GAW19)

All No hypertension Hypertension 1943 (100%) 1772 (95.73%) 79 (4.26%) Values Sex, female, N (%) 1241 (63.87%) 1159 (65.41%) 31 (39.24%) Age, years, mean (SD) 49.49 (14.24) 49.36 (14.39) 50.93 (10.17) Medication, treated, N (%) 147 (36.12%) 140 (35.90%) 7 (41.18%) DBP, mmHg, mean (SD) 73.47 (11.02) 72.25 (9.45) 100.73 (8.17) SBP, mmHg, mean (SD) 125.04 (21.10) 123.23 (19.35) 165.73 (17.54) Missings Sex, missing, N (%) 0 (0.00%) 0 (0.00%) 0 (0.00%) Age, missing, N (%) 81 (4.17%) 0 (0.00%) 0 (0.00%) Medication, missing, N (%) 1536 (79.05%) 1382 (77.99%) 62 (78.48%) DBP, missing, N (%) 92 (4.73%) 0 (0.00%) 0 (0.00%) SBP, missing, N (%) 92 (4.73%) 0 (0.00%) 0 (0.00%) Hypertension, missing, N (%) 92 (4.74%) 0 (0.00%) 0 (0.00%) DBP = Diastolic Blood Pressure, SBP = Systolic Blood Pressure, GAW19 = Genetic Analysis Workshop 19

Table 6.3: Genetic variants in the GAW19 data (1,943 individuals)

Nr Criterion Included Excluded 1 All 1,765,688 0 2 (1) AND quality-controlled 1,765,445 243 3 (2) AND polymorphic 505,420 1,260,025 4 (3) AND bi-allelic 461,868 43,552 Primary Analysis Set 461,868 1,303,820

6.2 Simulated data I

Genetic association analysis in fully simulated, simple data (simulated data I ) was performed to evaluate the validity of the applied methods. Genetic effects on systolic blood pressure1 in simulated data I were analysed using a standard frequentist linear model and reduced and full Bayesian hierarchical generalized linear models. In general, sex, and age (mean-centred at 50 years) were included as covariates. All five rare variants with minor allele frequency (MAF) less than 1%, and all five common variants with a MAF greater or equal than 1% were included. Results without genetic associations (i.e., genetic null effects) were simulated using 1,000 datasets from simulated data Ia. Results with genetic associations were simulated using 1,000 datasets from simulated data Ib.

1For the evaluation of analysis models and software systolic blood pressure as a single quantitative outcome was considered sufficient. Therefore, diastolic blood pressure and hypertension were omitted.

50 6 Results

6.2.1 Frequentist single-variant analysis

A standard frequentist generalized linear regression model with a normal error distri- bution and an identity link function was applied to model genetic null and non-null effects on simulated systolic blood pressure.

Data without genetic effects (simulated data Ia)

Association results from the analysis of 1,000 samples with 10,000 individuals are summarised in Table 6.4. Model 1 (“Variants”) refers to a linear regression model, which includes individual genetic variants. Model 2 (“Unweighted genetic scores”) concerns a linear regression model, which includes a simple sum of rare genetic variants and a simple sum of common genetic variants (all weights = 1). Model 3 (“Weighted genetic scores”) represents a linear regression model, which includes a weighted sum of rare genetic variants and a weighted sum of common variants. Weights were set to the regression coefficients from model 1. The results of model 1 and 2 showed no substantial bias and good coverage of the true parameter values by their 95% confidence intervals. The proportion of p values < 0.05 of genetic null effects varied closely around the expected value of 5%. Model 3 showed no bias for the intercept, sex and age, but coverage for the intercept and weighted sum scores was low. Correspondingly, the proportion of p values < 0.05 was excessively high. Estimates of the weighted sum scores were estimated to be one, which was a result of using regression coefficients from model 1 as weights.

Data with genetic effects (simulated data Ib)

Association results from an analysis of genetic null and non-null effects in 1,000 datasets with 10,000 individuals (simulated data Ib) were averaged and are pre- sented in Table A.3 in the appendix. Models 1, 2, and 3 were described above. Model 1 (“Variants”) showed no substantial bias of null and non-null effects and good coverage of the true values by 95% confidence intervals around 95%. Model 2 (“Unweighted genetic scores”) and model 3 (“Weighted genetic scores”) had similar properties for intercept, sex and age, except for the intercept in model 2, which

51 6 Results

Table 6.4: Table of average coefficients of a frequentist normal linear model of simulated systolic blood pressure (SBP). SBP was predicted from individual variants (model 1), unweighted genetic scores (model 2), or weighted genetic scores (model 3). Analysis was performed in 1,000 datasets with 10,000 individuals without genetic effects (simulated data Ia).

Nr Predictor Estimate CI95% Coverage [%] P < 0.05 [%] Model 1: Variants 1 Intercept 120.00 119.23 120.78 94.1% 100.0% 2 Sex 10.01 9.42 10.60 93.4% 100.0% 3 Age - 50 1.00 0.97 1.03 93.7% 100.0% 4 RV1 −0.12 −6.82 6.59 95.2% 4.8% 5 RV2 −0.02 −4.72 4.68 94.3% 5.7% 6 RV3 −0.06 −3.89 3.77 96.0% 4.0% 7 RV4 0.00 −3.31 3.32 95.8% 4.2% 8 RV5 0.06 −2.90 3.01 94.9% 5.1% 9 CV1 −0.01 −0.70 0.69 94.5% 5.5% 10 CV2 −0.00 −0.52 0.52 95.1% 4.9% 11 CV3 −0.00 −0.46 0.45 94.4% 5.6% 12 CV4 0.01 −0.42 0.43 95.1% 4.9% 13 CV5 −0.01 −0.42 0.41 96.3% 3.7% Model 2: Unweighted genetic scores 14 Intercept 120.00 119.24 120.77 93.5% 100.0% 15 Sex 10.01 9.42 10.60 93.3% 100.0% 16 Age - 50 1.00 0.97 1.03 93.8% 100.0% 17 RV 0.00 −1.70 1.71 94.8% 5.2% 18 CV −0.00 −0.22 0.21 94.9% 5.1% Model 3: Weighted genetic scores 19 Intercept 120.00 119.50 120.51 80.8% 100.0% 20 Sex 10.01 9.42 10.60 93.4% 100.0% 21 Age - 50 1.00 0.97 1.03 93.7% 100.0% 22 RV 1.00 −0.04 2.04 43.8% 56.2% 23 CV 1.00 −0.03 2.03 41.7% 58.3% RV=rare variant(s), CV=common variant(s), Estimate=estimate of re- gression coefficient, CI95%=95% confidence interval seemed to be underestimated. The proportion of p values < 0.05 of non-null genetic effects showed very high statistical power approximating 100%, except for the rarest variant with substantially lower power around 82.7%.

6.2.2 Bayesian hierarchical multiple-variant analysis

Bayesian hierarchical models with a normal error distribution and an identity link function were applied to estimate genetic effects on systolic blood pressure.

52 6 Results

Reduced Bayesian hierarchical model

A simulation study with 1,000 data sets with 10,000 individuals was conducted to evaluate the performance of a reduced Bayesian hierarchical model with and with- out genetic effects (simulated data Ia and Ib). The reduced Bayesian hierarchical model does not include coefficients for effects of grouped variants and used a weakly informative prior Cauchy distribution (location=0, scale=1) for variant effects.

Data without genetic effects (simulated data Ia) Association results from the analysis of 1,000 samples with 10,000 individuals (simulated data Ia) are summarised in Table 6.5. Results of model 1 (“Individual variant effect”) showed that estimates of genetic effects have no substantial bias, except rare variant rv1 (estimate = -0.12, CI95% = [-6.82%,6.58%], averaged over 1,000 samples). Coverage of the true value of the parameter of all genetic variants varied slightly around the expected value of 95%. This is also true for the group effects based on the test of “Average variant effects” (model 2 ) as well as the test of “Joint variant effects” (model 3 ) of rare and common variant effects, respectively.

Data with genetic effects (simulated data Ib) Association results from an anal- ysis of 1,000 datasets with 10,000 individuals (simulated data Ib) are summarised in Table A.4 in the appendix. Model 1 (“Individual variant effect”) showed no sub- stantial bias and coverage varied slightly around the expected value of 95%. For model 2 (“Average variant effects”), the true value for unweighted genetic scores were not available, however, nominally significant p values were observed in 91.2% of rare variant group effects and 100.0% of common variant group effects indicating high statistical power. For model 3 (“Joint variant effects”), nominally significant effects were observed in 100.0% of simulated datasets for rare and common group effects, respectively.

53 6 Results

Table 6.5: Table of average coefficients of a reduced Bayesian hierarchical model predicting simulated systolic blood pressure using weakly informative prior Cauchy (0, 1) distributions. Analysis was performed in 1,000 samples with 10,000 individuals without genetic effects (simulated data Ia). Associations between systolic blood pressure and individual variant effects (model 1), average variant effects (model 2), or joint variant effects (model 3) of grouped variants were tested.

Nr Predictor Estimate CrI95% Coverage [%] P < 0.05 [%] Model 1: Individual variant effects 1 Intercept 120.00 119.23 120.78 94.1% 100.0% 2 sex 10.01 9.42 10.60 93.4% 100.0% 3 age 1.00 0.97 1.03 93.7% 100.0% 4 rv1 −0.12 −6.82 6.58 95.2% 4.8% 5 rv2 −0.02 −4.72 4.68 94.3% 5.7% 6 rv3 −0.06 −3.88 3.77 96.0% 4.0% 7 rv4 0.00 −3.31 3.32 95.8% 4.2% 8 rv5 0.06 −2.90 3.01 94.9% 5.1% 9 cv1 −0.01 −0.70 0.69 94.5% 5.5% 10 cv2 −0.00 −0.52 0.52 95.1% 4.9% 11 cv3 −0.00 −0.46 0.45 94.4% 5.6% 12 cv4 0.01 −0.42 0.43 95.1% 4.9% 13 cv5 −0.01 −0.42 0.41 96.3% 3.7% Model 2: Average variant effects 14 RVs 0.00 −1.70 1.71 94.8% 5.2% 15 CVs −0.00 −0.21 0.21 95.1% 4.9% Model 3: Joint variant effects (a) 16 RVs NA NA NA NA 4.4% 17 CVs NA NA NA NA 4.8% (a) No estimated effects and credible intervals are available. RV=rare vari- ant, CV=common variant, Estimate=estimate of regression coefficient, CrI95%=95% credible interval. A Cauchy (location=0, scale=1) prior distribution was used for individual variant effects.

Full Bayesian hierarchical model

A simulation study with 1,000 data sets with 10,000 individuals was conducted to evaluate the performance of a full Bayesian hierarchical model with and without genetic effects (simulated data Ia and Ib). The full Bayesian hierarchical model included a coefficient for effects of grouped variants and used a weakly informa- tive prior Cauchy distribution for effects of groups of variants (Cauchy, location=0 scale=1) and Cauchy distributions with various selected parameters for individual variants.

Data without genetic effects (simulated data Ia) Association results were aver- aged over 1,000 datasets with 10,000 individuals and are displayed in Table 6.6 for

54 6 Results

Table 6.6: Table of average coefficients of a full Bayesian hierarchical model predicting sys- tolic blood pressure using weakly informative prior Cauchy (0, 1) distributions for group effects and Cauchy distributions with selected parameters for individ- ual variants. Analysis was performed in 1,000 samples with 10,000 individuals without genetic effects (simulated data Ia).

Nr Predictor Prior Estimate CrI95% Coverage [%] P < 0.05 [%] 21 Intercept L=0.0, S=1.0 119.99 119.50 120.49 85.2% 100.0% 22 Sex L=0.0, S=1.0 9.99 9.40 10.58 93.3% 100.0% 23 Age - 50 L=0.0, S=1.0 1.00 0.97 1.03 93.5% 100.0% 24 RV L=0.0, S=1.0 0.47 −1.00 1.93 89.8% 10.2% 25 CV L=0.0, S=1.0 0.73 −0.05 1.51 55.0% 45.0% 46 Intercept L=1.0, S=1.0 119.84 119.28 120.40 84.5% 100.0% 47 Sex L=1.0, S=1.0 9.99 9.40 10.58 93.1% 100.0% 48 Age - 50 L=1.0, S=1.0 1.00 0.97 1.03 93.6% 100.0% 49 RV L=1.0, S=1.0 0.20 −0.92 1.32 92.0% 8.0% 50 CV L=1.0, S=1.0 0.46 −0.03 0.94 56.7% 43.3% L=location, S=scale, RV=rare variant, CV=common variant. A Cauchy (location=0, scale=1) prior distribution was used for grouped variant effects.

prior Cauchy distributions with selected parameters for individual variants. Assuming a true null effect, the group coefficient showed considerable positive bias. Coverage was substantially decreased for group effects of rare variants (down to 89.8%) and was largely lowered for group effects of common variants (down to 55%) with corresponding results for the proportion of posterior p values < 0.05. The distribution of estimated effects and posterior p values for effects of groups of rare variants or groups of common variants for a wider range of prior parameters are displayed in Figures 6.1 and 6.2 (see also Table A.5 in the appendix). The plots confirm that posterior probabilities of group effects of a full Bayesian hierarchical model substantially exceed the expected rate of false positive results while the strength of this finding depends on prior parameters, especially shrinkage. The shrinkage parameters do not seem to appropriately control the proportion of false positive results since they have different consequences for the group of rare ver- sus common variants. While the p value distribution of group effects of rare variants converges to a uniform distribution for large scale values (i.e., weak shrinkage), for group effects of common variants it converges to a uniform distribution for small scale values (i.e., strong shrinkage). However, ignoring the overall distribution of posterior p values and focusing on the proportion of p values < 0.05, a scale param-

55 6 Results rare common Group 15 10 5 Location=0, Scale=1.0 Location=1, Scale=1.0 0 2.0 1.5 1.0 0.5 0.0 1.0 0.5 0.0 15 10 5 Location=0, Scale=0.8 Location=1, Scale=0.8 0 6 4 2 0 1.00 0.75 0.50 0.25 0.00 15 10 5 Estimate Location=0, Scale=0.5 Location=1, Scale=0.5 0 0 20 10 2.0 1.5 1.0 0.5 0.0 15 10 5 Location=0, Scale=0.3 Location=1, Scale=0.3 0 0 3 2 1 0 50 100 15 10 5 Location=0, Scale=0.1 Location=1, Scale=0.1 0 informative prior CauchyAnalysis (0, was 1) performed distributions in 1,000 for samples group with effects 10,000 individuals and without Cauchy genetic distributions effects with (simulated data selected Ia). parameters for individual variants. 3 2 1 0

1.5e+23 1.0e+23 5.0e+22 0.0e+00 density Figure 6.1 : Distribution of estimates of group effects from full Bayesian hierarchical models predicting simulated systolic blood pressure using weakly

56 6 Results rare common Group 1.00 1.00 0.75 0.75 0.50 0.50 0.25 0.25 Location=0, Scale=1.0 Location=1, Scale=1.0 0.00 0.00 6 4 2 0 5 4 3 2 1 0 1.00 1.00 0.75 0.75 0.50 0.50 0.25 0.25 Location=0, Scale=0.8 Location=1, Scale=0.8 0.00 0.00 6 4 2 0 3 2 1 0 1.00 1.00 0.75 0.75 0.50 0.50 P 0.25 0.25 Location=0, Scale=0.5 Location=1, Scale=0.5 0.00 0.00 0 30 20 10 1.5 1.0 0.5 0.0 1.00 1.00 0.75 0.75 0.50 0.50 0.25 0.25 Location=0, Scale=0.3 Location=1, Scale=0.3 0.00 0.00 0 90 60 30 1.5 1.0 0.5 0.0 1.00 1.00 0.75 0.75 0.50 0.50 0.25 0.25 Location=0, Scale=0.1 Location=1, Scale=0.1 using weakly informative priorvariants. Cauchy (0, Analysis 1) was performed distributions for in group 1,000 effects samples and with Cauchy 10,000 distributions individuals with without selected genetic parameters effects for (simulated individual data Ia). 0.00 0.00 1.5 1.0 0.5 0.0

2.5e+13 2.0e+13 1.5e+13 1.0e+13 5.0e+12 0.0e+00 density Figure 6.2 : Distribution of posterior probabilities of group effects from full Bayesian hierarchical models predicting simulated systolic blood pressure

57 6 Results eter of 0.1 seems to control the proportion of false positive results the best, although at the cost of power loss for rare variant group effects and an increased type I error rate of common variant group effects.

Data with genetic effects (simulated data Ib) Association results were averaged over 1,000 datasets with 10,000 individuals and are displayed in Table 6.7. Bias and coverage were not evaluated because the true parameter values were not available. The proportion of posterior p values < 0.05 indicates high statistical power for the identification of group effects of rare and common variants, respectively, approxi- mating 100%.

Table 6.7: Table of average coefficients of a full Bayesian hierarchical model predicting simulated systolic blood pressure using weakly informative prior Cauchy (0, 1) distributions for group effects and Cauchy distributions with selected param- eters for individual variants. Analysis was performed in 1,000 samples with 10,000 individuals with genetic effects (simulated data Ib).

Nr Predictor Prior Estimate CrI95% Coverage [%] P < 0.05 [%] 1 Intercept L=0.0, S=1.0 120.02 119.53 120.50 77.2% 100% 2 Sex L=0.0, S=1.0 9.99 9.40 10.58 93.3% 100% 3 Age-50 L=0.0, S=1.0 1.00 0.97 1.03 93.7% 100% 4 RV(a) L=0.0, S=1.0 2.67 2.06 3.27 NA 100% 5 CV(a) L=0.0, S=1.0 1.00 0.98 1.03 NA 100% 6 Intercept L=1.0, S=1.0 119.88 119.40 120.37 75.6% 100% 7 Sex L=1.0, S=1.0 9.99 9.40 10.58 93.4% 100% 8 Age - 50 L=1.0, S=1.0 1.00 0.97 1.03 93.6% 100% 9 RV(a) L=1.0, S=1.0 2.55 1.97 3.13 NA 100% 10 CV(a) L=1.0, S=1.0 1.00 0.98 1.03 NA 100% L=location, S=scale, RV=rare variant, CV=common variant. (a) Coverage not available, because the true value of the parameter is not known. A Cauchy (location=0, scale=1) prior distribution was used for grouped variant effects.

6.2.3 Summary

Individual variant effects, average variant effects, joint variant effects of groups of variants in frequentist and Bayesian hierarchical models were analysed in simulated data including 1,000 samples with 10,000 individuals. They showed very similar re- sults for frequentist linear models and Bayesian linear models of individual variants, although rare variants were more sensitive to the effects of the prior distribution and were shrunken more strongly to the prior effect.

58 6 Results

Estimation and testing of group effects in a reduced Bayesian hierarchical models showed valid results and good performance regarding bias, coverage, true negative (specificity) and true positive (sensitivity) results. While the average and joint test performed similarly for grouped common variant effects, the joint test showed superior power to the average test of grouped rare variant effects. However, the estimation and testing of group effects in full Bayesian hierarchical models was strongly influenced by the prior distribution of individual variants, especially the scale (shrinkage) parameter, and showed, in particular for weakly informative priors, poor performance in terms of the type I error (specificity). Therefore, full Bayesian hierarchical models can give misleading results and have to be applied carefully.

6.3 Simulated data II (GAW 19)

Genetic associations between simulated phenotypes of diastolic and systolic blood pressure and hypertension at time point 1 and empirical genetic data from the GAW19 dataset are presented. First, results from frequentist standard single-variant analysis are shown, second, results from Bayesian hierarchical multiple-variant anal- ysis. The findings are reported separately for variants in 128 candidate genes with non-null effects and variants in 1,000 genes with null effects.

6.3.1 Frequentist single-variant analysis

Results from single-variant analysis of simulated genetic associations between vari- ants with genetic effects in 128 candidate genes and variants without genetic effects in 1,000 genes, respectively, and diastolic blood pressure (DBP), systolic blood pres- sure (SBP) and hypertension (HTN) in the GAW19 data are reported.

Candidate gene association analysis (CGAS)

6 (0.42%) variants in 4 genes had genome-wide significant (p < 1 × 10−7) effects out of overall 1,425 variants in 110 genes, which had simulated effects on DBP. 5 (0.43%) variants in 3 genes were genome-wide significant, considering 1,154 with simulated

59 6 Results effects on SBP in 93 genes. None of 917 variants with simulated effects on HTN in 76 genes showed genome-wide significant effects (Table A.7 in the appendix). A comparison of simulated and estimated single-variant effects on DBP and SBP showed low correlation (Figure 6.3). After removing three outliers with extreme effects of 169.80 of singleton variants on DBP, the correlation between simulated and estimated DBP effects was Pearson r = 0.15 (CI95% = [0.10, 0.20], p = 1.04 × 10−8) and similar for SBP (Pearson r = 0.15, CI95% = [0.09; 0.20], p = 6.69 × 10−7).

● ●● ● ● ● ●● ● ● ● ●● ●●● ● ● ● ● ●● ● ●●●●● 20 ● 20 ●●● ● ●● ●● ● ●●●● ●●● ● ●●●● ●●●●●● ●●● ● ●●●●●●● ● ●● ●●●● ●●●●●●●● ● ●●●● ●●●●●● ● ● ●● ●●●●● ●●●● ● ● ●●●●● ● ●●●●●● ● ●●●●●●●●● ● ● ● ●●●●●● ●●●●● ● ●●●●●●● ● ● ●●●●●● ● ● ●●●●● ● ●●●●● ● ● ●●●●●●●●●● ● ●●●●●●●●● ● ● ●●●●● ●● ● ●●●●●●●● ● ● ● ●● ●●●●●●●● ● ● ●●●●●● ● ● ●●●●●●●●●● ● ●● ● ●●●●●●●●●● ●●● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●● ● ●● ●●●●●●●●●●● ●● ●●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●● ● ●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●● ●● ●● ●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●● ●●●●● 0 ● ●●●●●●●●●●●● 0 ● ●●●●●●●●●●●●●● ● ● ● ●● ● ●●●●●●●●●● ● ● ● ●●●●●●●●●●●● ●● ● ● ●● ●●●●●●●●●● ● ● ● ● ●●●●●●●●●● ● ● ● ●● ● ●●●●●●●●●● ●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●● ● ● ●●●●●●●●● ● ● ●●●●●●●●● ● ● ● ● ●●●●●●● ● ● ● ● ●●●●●●● ● ●●●●●●●●● ● ●●● ● ● ●●●●●●●●●● ● ● ●●●●●● ● ● ●●●●●●● ● ●●●●●●●●● ● ● ●●●●●●● ● ● ●● ●● ● ● ●●●●●● ● ● ●●●●●●●● SBP, estimated SBP, ● DBP, estimated DBP, ● ●●●● ●●●●●●●●● ● ●●●●●●●● ● ●●●●● ● ● ● ●●●● ● ● ●●●●● ● ● ● ●●●●●●●● ●●●●●● ● ● ●●●●● ● ●●●●● ● ● ●●●● ● ●●●● ● ● ●●● ● ●● ●●●●●● ● ● ●●●●● ● ●● ● ● ●●●●● ● ● ● ● ●● ● ●● −20 ● −20 ● ●●● ● ● ●● ●●● ● ●● ●● ● ● ●●

−10 −5 0 5 10 −10 −5 0 5 10 DBP, simulated SBP, simulated

Figure 6.3: Comparison of simulated and estimated individual variant effects on diastolic blood pressure (DBP) and systolic blood pressure (SBP) using frequentist nor- mal linear models in simulated data II (GAW19) (black=significant, grey=not significant). Three outliers were removed from the DBP plot.

Null gene association analysis (NGAS)

3,854 (5.21%) of 73,888 genetic variants without effects in 1,000 genes had nominally significant (p < 0.05%) associations with DBP and no variant had genome-wide significant (p < 1 × 10−7) effects. 3,388 (4.60%) of 73,643 genetic variants in 1,000 genes had nominally significant effects on SBP and no variant was genome-wide significant. 2,250 (3.12%) of 72,219 null variants in 1,000 genes were nominally significantly associated with HTN and no variant was genome-wide significant.

60 6 Results

6.3.2 Bayesian hierarchical multiple-variant analysis

Bayesian hierarchical generalized linear model with weakly informative Cauchy dis- tributions for group (location=0, scale=1) and variant (reduced model: location=0, scale=1; full model: location=1, scale=1) effects were applied to simulated DBP, SBP, and HTN in 128 candidate genes and 1,000 null genes.

Candidate gene association analysis (CGAS)

Variants One variant (chr3:47956424, p = 1.73 × 10−15) of 1,258 variants with causal effects on DBP in the MAP4 gene of 98 genes had significant effects. The same variant (chr3:47956424, p = 2.80 × 10−12) of 1,004 variants with causal effects on SBP in the MAP4 gene of 80 genes had significant effects. None of 826 variants with causal effects on HTN in 67 genes was significantly associated (p ≥ 1.42 × 10−4, Table A.7 in the appendix). A comparison of simulated and estimated variant effects on DBP and SBP showed low correlation (Figure 6.4). The correlation between simulated and estimated DBP effects was Pearson r=0.074 (CI95%=[0.019, 0.129], p =0.00840) and similar for SBP (Pearson r=0.0664, CI95%=[0.00456; 0.128], p =0.0354).

●● 20 20

● ●

● ● ● ●● ● ● ● ● ● ●●●●●●●●●● ● ●●●●● ● ● ●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●● 0 ● ● ●●● 0 ●● ● ● ● ● ● ●

● SBP, estimated SBP, DBP, estimated DBP, ● ● ● ●

−20 −20 ●●

● ● ●

−10 −5 0 5 10 −10 −5 0 5 10 DBP, simulated SBP, simulated

Figure 6.4: Comparison of simulated and estimated (conditional) individual variant effects on diastolic blood pressure (DBP) and systolic blood pressure (SBP) using reduced Bayesian hierarchical generalized linear models (normal) with weakly informative prior Cauchy (0, 1) distributions in simulated data II (GAW19) (black=significant, grey=not significant).

61 6 Results

Groups Group effects of rare and group effects of common variants on DBP were calculated in 111 of overall 128 candidate genes. No results were obtained for 17 genes, which had less than 2 variants in any group of variants. A reduced Bayesian hierarchical model testing the average effect of groups of variants on DBP yielded no significant rare variant group effects (p ≥ 0.72), while 2 (1.80%) genes were found with significant common variant effects (MAP4 : p =[2.76 × 10−50, TNN : 7.75 × 10−8]). For SBP, no rare variant group effects were found (p ≥ 0.75), but one gene (0.90%) with common variant effects (MAP4 : p = 4.15 × 10−62). For HTN, no rare variant group effects (p ≥ 0.16) and one gene (.90%) with common variant effects (MAP4 : p = 1.27 × 10−17) were identified. A reduced Bayesian hierarchical model testing the joint effect of groups of variants on DBP found rare variant group effects in none of the genes (p ≥ 5.16 × 10−5), and common variant group effects in one (0.90%) gene (MAP4 : p = 2.86 × 10−17). For SBP, no rare (p ≥ 1.54 × 10−4) or common variant group effects (p ≥ 9.50 × 10−6) were detected. For HTN, no significant effects of rare variants (p ≥ 0.38) or common variants (p ≥ 1.49 × 10−2) were identified. A full Bayesian hierarchical model testing the hierarchical group coefficients on DBP showed that 49 of 111 (44.14%) genes had significant rare variant group ef- fects (p = [6.73 × 10−50;8.47 × 10−8]), while 7 (6.31%) genes had significant com- mon variant group effects (p =[1.80 × 10−65, 1.81 × 10−8]), totalling to 49 genes. For SBP, 25 of 111 (22.52%) genes had significant rare variant group effects (p = [6.27 × 10−45; 9.58 × 10−11]), while 6 (5.41%) genes had significant common variant group effects (p = [2.58 × 10−91, 3.38 × 10−10]), totalling to 30 genes. For HTN, 32 of 111 (28.83%) genes had significant rare variant group effects (p = [2.76 × 10−18, 5.65 × 10−8]), while 6 (5.41%) genes had significant common variant group effects (p =[7.59 × 10−21, 5.09 × 10−8]), totalling to 33 genes.

Null gene association analysis (NGAS)

70,701 unique variants in 1,000 genes with simulated null effects on DBP and SBP and of 70,542 variants in 998 genes on HTN were studied.

62 6 Results

Variants 423 (0.598%) null variants on DBP in 317 (31.7%) genes had nomi- nally significant effects (p = [2.89 × 10−5, 4.97 × 10−2]). 545 (0.771%) null variants on SBP in 377 (37.7%) genes had nominally significant effects (p = [5.93 × 10−7, 4.86 × 10−2]). 480 (0.680%) null variants on HTN in 359 (35.97%) genes were sig- nificantly associated (p =[5.87 × 10−5, 5.00 × 10−2]).

Groups A reduced Bayesian hierarchical model testing the average effect on DBP found that groups of rare variants had no nominally significant average effect in the genes (p ≥ 0.42), but common variants had significant average effects in 29 (2.9%) (p = [5.11 × 10−3, 4.70 × 10−2]), totalling to 29 genes. Regarding SBP, rare variant effects had a nominally significant average effect in none of the genes (p ≥ 0.46), while 22 genes (2.20%) had common variant group effects (p = [1.08 × 10−3, 4.34 × 10−2]). Concerning HTN, no average effect of groups of rare variants was significant (p ≥ 5.57 × 10−2), but 35 (3.51%) genes had significant common variant group effects (p = [9.38 × 10−4, 4.89 × 10−2]). A reduced Bayesian hierarchical model testing the joint effect on DBP showed that groups of rare variants with nominally significant joint effects in 200 (20%) genes (p = [1.32 × 10−5, 4.87 × 10−2]), while common variants were significant in 8 genes (.80%) (p = [5.86 × 10−4, 3.17 × 10−2]), totalling to 204 significant genes. Regarding SBP, rare variant group effects were found in 316 (31.6%) genes (p = [1.12 × 10−6, 4.87 × 10−2]), while common variant effects were detected in 9 (0.90%) genes (p = [5.89 × 10−3, 2.19 × 10−2]), totalling to 320 significant genes. Concerning HTN, no rare variant group effect was significant (p ≥ 0.14), while 13 (1.30%) genes showed significant common variant effects (p = [6.64 × 10−3, 4.89 × 10−2]). A full Bayesian hierarchical model testing the hierarchical group coefficients on DBP found significant rare variant group effects in 512 (51.2%) genes (p = [5.73 × 10−147, 1.27 × 10−2]), and significant common variant effects in 559 (55.9%) genes (p = [3.55 × 10−71, 4.96 × 10−2]) of 1,000 genes with simulated null effects on DBP, totalling to 722 significant genes. Regarding SBP, 291 (29.1%) significant rare variant group effects were detected (p = [9.94 × 10−151, 2.66 × 10−2]) and 477 (47.7%) genes with significant common variant group effects (p = [9.71 × 10−52,

63 6 Results

4.96 × 10−2]), totalling to 571 significant genes. Concerning HTN, significant rare variant group effects were found in 969 (97.09%) genes (p < 3.20 × 10−2) of 998 null genes, and 692 (69.41%) genes ((p < 4.99 × 10−2)) with common variant group effects, totalling to 977 significant genes.

6.3.3 Summary

Frequentist single-variant analysis provides little statistical power to detect simu- lated genetic effects. Although some identified variants had medium allele frequency with 0.01 ≤ MAF < 0.05, none of the rare variants was detected. The proportion of false positive results varied closely around the global type I error of 5%, which did not indicate test inflation, e.g by confounding by ethnicity. Results for hypertension as a binary outcome tended to be slightly conservative. Bayesian hierarchical multiple-variant analysis seemed to have less statisti- cal power than single-variant analysis to identify individual variants. Although the Bayesian models provided similar significant estimates compared to frequen- tist single-variant analysis, many non-significant variants with small effects were shrunken to zero. Therefore, extreme estimates, which were observed in frequentist models, were not found in Bayesian models. Tests based on the average group effect from a reduced Bayesian hierarchical model were not sensitive to effects of groups of rare variants and little sensitive to groups of common variants, but stayed below the expected proportion of false positive results. The joint test, however, while being insensitive to group effects in genes with simulated genetic effects, showed a high rate of false positive results in genes without simulated genetic effects. An explanation might be provided by the simulation model of GAW19, which is based on a polygenic model with a total heritability of blood pressure of about 30%. Heritability, which was not explained by the functional variants in the 128 candidate genes, was distributed over random variants outside the candidate genes on odd-numbered chromosomes, which the joint test might have been sensitive to (see Discussion).

64 6 Results DBP DBP DBP SBP SBP SBP HTN HTN HTN 7 − 10 × 1 < 0.05 < genes with null effects in simulated data II (GAW19) TheoryNon-null genetic effects, p FrequentistBayesian (a)Bayesian Single (a) Multiple Variant VariantBayesian (a) Multiple Variant Variant Variant ModelBayesian (b) Multiple Bayesian Variant P GroupNull Multiple genetic Variant effects, Level Group p Wald Cauchy(0,Frequentist 1) Average GroupBayesian (a) Coefficient 0.08%Bayesian Single Cauchy(0, (a) Multiple 1) Variant Variant Joint TestBayesian (a) Multiple Variant Cauchy(1, Variant Variant Prior 1) (variant)Bayesian Cauchy(0, (b) Multiple 1) Bayesian NA Variant P Variants Group NA NADBP=diastolic Multiple blood Variant RV pressure, Group grouphierarchical Wald Cauchy(0, NA SBP=systolic model. 0.42% 1) blood A Cauchy Average Group pressure, NA CV (location=0, 0.00% group HTN=hypertension, scale=1) prior RV=rare distribution variant, was NA 44.14% Coefficient CV=common 0.60% Variants used Cauchy(0, for variant. 1) group Joint effects. (a) 0.00% RV NA group reduced 1.80% Cauchy(1, Bayesian 0.10% 1) hierarchical model, 6.31% CV Cauchy(0, 1) group b) NA full NA NA Bayesian 0.90% Variants NA NA NA 5.22% RV group NA NA NA 0.00% CV NA group 0.43% 0.00% NA 51.20% 22.52% 20.00% NA 2.90% NA 0.77% 55.90% 0.00% 0.90% 5.41% NA 0.80% 0.00% NA 0.00% NA NA NA NA NA NA NA 4.60% 0.00% 29.10% NA NA 0.00% 28.83% 31.60% 0.00% NA 47.70% 2.20% 0.00% NA 0.90% 5.41% NA 0.68% 0.90% NA NA NA 0.00% NA NA NA 97.09% 0.00% NA 3.12% 0.00% 69.34% 3.51% NA 1.30% NA NA Table 6.8 : Overview on genome-wide significant associations in 128 candidate genes with non-null effects and nominally significant associations in 1,000

65 6 Results

Tests based on the hierarchical group effect from full Bayesian hierarchical models again showed a strongly increased rate of false positive results, while being sensitive to simulated non-null genetic effects, confirming the risk of potentially misleading results from simulated data I. Therefore, full Bayesian hierarchical models with a group coefficient will not be applied to empirical data. Table 6.8 summarises the results of frequentist and Bayesian analyses in simulated GAW19 data.

6.4 Empirical data (GAW 19)

Genetic associations between empirical phenotypes of diastolic and systolic blood pressure and hypertension and empirical genetic data from the GAW19 dataset are presented. Bayesian hierarchical multiple-variant models are used in the primary analysis, frequentist single-variant models as the secondary analysis. Following the structure of the previous sections, first, results from frequentist standard single- variant analysis will be shown, second, results from Bayesian hierarchical multiple- variant analysis.

6.4.1 Frequentist single-variant analysis

Genome-wide association analysis using single-variant tests in 1,851 individuals with complete data and 461,868 quality-controlled, polymorphic single nucleotide variants were analysed. A non-genetic linear model on diastolic and systolic blood pressure was performed, which showed no substantial deviations from normality of residuals or outliers (Figure A.7 in the appendix).

Diastolic blood pressure

Genome-wide single-variant association analysis of diastolic blood pressure using standard linear regression in 1,851 individuals was performed. 198,939 variants with a minor allele count of ≤ 1 or missing p value were excluded, resulting in 262,929 variants in 11,040 genes. No genomic control was applied (λ =0.985, Figure A.9 in the appendix).

66 6 Results

Top ten associations sorted by significance level are reported in Table 6.9. In total, one association between variant chr11:118518698 (G/A) (estimate = 34.1, CI95% = [22.1, 46.2], p = 2.90 × 10−8) with DBP was genome-wide significant. This variant is a rare coding non-synonymous variant (p.R1187H) of the PHLDB1 (Pleckstrin Homology-Like Domain, Family B, Member 1) gene.

Table 6.9: Associations of top ten variants with observed diastolic blood pressure using a normal linear model in empirical GAW19 data (sorted by p value)

Nr Variant ID Gene EAF EAC Est. CI95% P 1 11:118518698:G/A PHLDB1 0.0008 3 34.14 22.08 46.19 2.86 × 10−8 2 19:4446358:T/C UBXN6 0.0030 11 15.60 9.79 21.40 1.42 × 10−7 3 17:27287557:G/A SEZ6 0.0011 4 27.72 17.28 38.16 1.95 × 10−7 4 5:41853663:G/A OXCT1 0.0011 4 27.41 16.98 37.85 2.65 × 10−7 5 17:42152789:C/T G6PC3 0.0011 4 27.19 16.74 37.64 3.39 × 10−7 6 19:42595744:C/A POU2F2 0.0014 5 24.17 14.83 33.51 4.00 × 10−7 7 7:5415855:G/A TNRC18 0.0005 2 38.11 23.34 52.89 4.30 × 10−7 8 5:53606220:C/A ARL15 0.0005 2 37.90 23.13 52.67 4.96 × 10−7 9 5:43700242:A/G NNT 0.0022 8 18.78 11.38 26.18 6.48 × 10−7 10 19:51568294:G/T KLK13 0.0005 2 36.91 22.14 51.69 9.82 × 10−7 Variant ID=chromosome:position:reference/effect allele (hg19), EAF=effect allele frequency, EAC=effect allele count, Est.=estimate of the regression coefficient, CI95%=95% confidence interval; genome-wide significance level α = 1 × 10−7

The other nine non-significant rare variants are: chr19:4446358 (p = 1.42 × 10−7) in UBXN6 (SH3 And Multiple Ankyrin Repeat Domains 2), chr17:27287557 (p = 1.95 × 10−7) in SEZ6 (Seizure Related 6 Homolog (Mouse)), chr5:41853663 (p = 2.65 × 10−7) in OXCT1 (3-Oxoacid CoA Transferase 1), chr17:42152789 (p = 3.39 × 10−7) in G6PC3 (Glucose 6 Phosphatase, Catalytic, 3), chr19:42595744 (p = 4.00 × 10−7) in POU2F2 (POU Class 2 Homeobox 2), chr7:5415855 (p = 4.30 × 10−7) in TNRC18 (Trinucleotide Repeat Containing 18), chr5:53606220 (p = 4.96 × 10−7) in ARL15 (ADP-Ribosylation Factor-Like 15), chr5:43700242 (p = 6.48 × 10−7) in NNT (Nicotinamide Nucleotide Transhydrogenase), and chr19:51568294 (p = 9.82 × 10−7) in KLK13 (Kallikrein-Related Peptidase 13).

Systolic blood pressure

Genome-wide single-variant association analysis of systolic blood pressure using standard linear regression in 1,851 individuals was performed. 202,782 variants with a minor allele count of ≤ 1 or missing p value were excluded, resulting in 259,086

67 6 Results variants in 10,887 genes. No genomic control was applied (λ =0.942, Figure A.10 in the appendix). Top ten associations sorted by significance level are reported in Table 6.10. In total, one association between variant chr11:70331805 (T/C) (estimate = 70.66, CI95% = [45.23, 96.10], p = 5.20 × 10−8) with SBP was genome-wide significant. This variant is a rare coding synonymous variant (p.T943T) in the SHANK2 gene (SH3 And Multiple Ankyrin Repeat Domains 2).

Table 6.10: Associations of top ten variants with observed systolic blood pressure using a normal linear model in empirical GAW19 data (sorted by p value)

Nr Variant ID Gene EAF EAC Est. CI95% P 1 11:70331805:T/C SHANK2 0.0005 2 70.66 45.23 96.10 5.17 × 10−8 2 1:34401428:G/A CSMD2 0.0005 2 69.02 43.55 94.48 1.08 × 10−7 3 21:47419027:G/A COL6A1 0.0011 4 48.29 30.26 66.33 1.54 × 10−7 4 13:41507895:G/A ELF1 0.0005 2 67.58 42.11 93.05 1.99 × 10−7 5 19:45315391:G/A BCAM 0.0005 2 66.26 40.78 91.74 3.45 × 10−7 6 3:44684218:C/T ZNF197 0.0011 4 46.37 28.34 64.39 4.61 × 10−7 7 1:32084872:T/G HCRTR1 0.0019 7 35.08 21.40 48.75 4.94 × 10−7 8 1:117714983:T/C VTCN1 0.0008 3 53.14 32.30 73.97 5.77 × 10−7 9 11:73796878:T/C C2CD3 0.0022 8 31.85 19.09 44.61 9.99 × 10−7 10 17:36871888:C/T MLLT6 0.0005 2 62.46 36.96 87.95 1.58 × 10−6 Variant ID=chromosome:position:reference/effect allele (hg19); EAF=effect allele frequency, EAC=effect allele count, Est.=estimate of the regression coefficient, CI95%=95% confidence interval; genome-wide significance level α = 1 × 10−7

The other nine non-significant variants were also rare: chr1:34401428 (p = 1.08 × 10−7) in CSMD2 (CUB And Sushi Domain-Containing), chr21:47419027 (p = 1.54 × 10−7) in COL6A1 (Collagen, Type VI, Alpha), chr13:41507895 (p = 1.99 × 10−7) in ELF1 (E74-Like Factor 1), chr19:45315391 (p = 3.45 × 10−7) in BCAM (basal cell adhesion molecule), chr3:44684218 (p = 4.61 × 10−7) in ZNF197 (zinc finger 197), chr1:32084872 (p = 4.94 × 10−7) in HCRTR1 (Hypocretin (Orexin) Receptor), chr1:117714983 (p = 5.77 × 10−7) in VTCN1 (V-Set Domain Containing T Cell Activation Inhibitor 1), chr11:73796878 (p = 9.99 × 10−7) in C2CD3 (C2 Calcium-Dependent Domain Containing 3), and chr17:36871888 (p = 1.58 × 10−6) in MLLT6 (Myeloid/Lymphoid Or Mixed-Lineage Leukemia).

68 6 Results

Hypertension

Genome-wide single-variant association analysis of hypertension using Firth logistic regression in 1,851 individuals was performed. 198,937 variants with a minor allele count ≤ 1 were excluded, resulting in 262,931 variants in 11,040 genes. No genomic control was applied (λ =0.825, Figure A.11 in the appendix).

Table 6.11: Associations of top ten variants with observed hypertension using the Firth logistic model in empirical GAW19 data (sorted by p value)

Nr Variant ID Gene EAF EAC Est. CI95% P 1 17:78320283:C/T RNF213 0.0022 8 4.20 1.08 7.33 7.13 × 10−6 2 9:15657180:C/G CCDC171 0.0273 101 −1.35 −2.04 −0.66 1.17 × 10−5 3 11:67837771:A/T CHKA 0.2358 873 0.44 0.24 0.63 1.59 × 10−5 4 11:67815086:G/A TCIRG1 0.1056 391 0.58 0.32 0.84 1.69 × 10−5 5 7:48450157:T/C ABCA13 0.4968 1839 0.33 0.16 0.50 9.95 × 10−5 6 1:111857239:A/C CHIA 0.0140 52 1.24 0.62 1.85 1.01 × 10−4 7 7:99688144:G/A COPS6 0.0024 9 3.15 1.12 5.18 1.05 × 10−4 8 11:490036:G/A PTDSS2 0.1518 562 −0.51 −0.78 −0.24 1.06 × 10−4 9 3:38317194:C/G SLC22A13 0.0273 101 0.95 0.49 1.42 1.22 × 10−4 10 3:112282323:A/G SLC35A5 0.0030 11 2.52 1.15 3.88 1.57 × 10−4 Variant ID=chromosome;position:reference/effect allele (hg19); EAF=effect allele frequency, EAC=effect allele count, Est.=estimate of the regression coefficient, CI95%=95% confidence interval; genome-wide significance level α = 1 × 10−7

Top ten associations sorted by significance level are reported in Table 6.11. No variant reached genome-wide significance (all p > 1 × 10−7). The top ten vari- ants included: chr17:78320283 (p = 7.13 × 10−6) in RNF213 (Ring Finger Pro- tein 213), chr9:15657180 (p = 1.17 × 10−5) in CCDC171 (Coiled-Coil Domain Containing 171), chr11:67837771 (p = 1.59 × 10−5) in CHKA (Choline Kinase Alpha), chr11:67815086 (p = 1.69 × 10−5) in TCIRG1 (T-cell, immune regula- tor 1, ATPase, H+ transporting, lysosomal V0 subunit A3), chr7:48450157 (p = 9.95 × 10−5) in ABCA13 (ATP-Binding Cassette, Sub-Family A (ABC1), Member 13), chr1:111857239 (p = 1.01 × 10−4) in CHIA (Chitinase, Acidic), chr7:99688144 (p = 1.05 × 10−4) in COPS6 (COP9 Signalosome Subunit 6), chr11:490036 (p = 1.06 × 10−4) in PTDSS2 (Phosphatidylserine Synthase 2), chr3:38317194 (p = 1.22 × 10−4) in SLC22A13 (solute carrier family 22 (organic anion/urate trans- porter), member 13), and chr3:112282323 (p = 1.57 × 10−4) in SLC35A5 (Solute Carrier Family 35, Member A5).

69 6 Results

6.4.2 Bayesian hierarchical multiple-variant analysis

Genome-wide association analysis using (reduced) Bayesian hierarchical generalized linear models with weakly informative prior Cauchy (location=0, scale=1) distribu- tions in 1,851 individuals and 461,868 single-variants was performed. Results are reported for effects of individual variants, which are conditioned on all other vari- ants in the gene (primary), average effects (secondary) and joint effects (primary) of groups of rare variants and groups of common variants per gene.

Variants

Associations of individual variants from a Bayesian hierarchical generalized linear model are presented.

Diastolic blood pressure Associations of 420,046 variants in 9,436 genes with diastolic blood pressure were studied. In total, two rare variants were significantly associated with DBP (Table 6.12). Variant chr1:204214013 (G/T, p.P717P, estimate = 39.6, CrI95% = [25.3,53.9], p = 6.80 × 10−8) is a coding synonymous rare variant located in the PLEKHA6 (Pleckstrin Homology Domain Containing, Family A Member 6) gene. Variant chr11:118518698 (G/A, p.R118H, estimate = 32.23, CrI95% = [20.6%,43.9%], p = 7.00 × 10−8) is a coding non-synonymous variant in the PHLDB1 (Pleckstrin Homology-Like Domain, Family B, Member 1) gene. The eight non-significant rare variants were: chr17:27287557 (p = 4.32 × 10−7) in SEZ6 (Seizure Related 6 Homolog (Mouse)), chr19:4446358 (p = 5.07 × 10−7) in UBXN6 (UBX Domain Protein 6), chr5:41853663 (p = 7.08 × 10−7) in OXCT1 (3- Oxoacid CoA Transferase 1), chr17:42152789 (p = 1.23 × 10−6) in G6PC3 (Glucose 6 Phosphatase, Catalytic, 3), chr5:43700242 (p = 1.47 × 10−6) in NNT (Nicoti- namide Nucleotide Transhydrogenase), chr5:53606220 (p = 1.53 × 10−6) in ARL15 (ADP-Ribosylation Factor-Like 15), chr19:9038077 (p = 2.12 × 10−6) in MUC16 (Mucin 16, Cell Surface Associated), and chr7:5415855 (p = 2.48 × 10−6) in TNRC18 (Trinucleotide Repeat Containing 18).

70 6 Results

Table 6.12: Associations of top ten variants associated with observed diastolic blood pres- sure using a reduced Bayesian hierarchical generalized linear model (normal) in empirical GAW19 data (sorted by Bayesian p value)

Nr Variant ID Gene EAF EAC Est. CrI95% P 1 1:204214013:G/T PLEKHA6 0.0010 4 39.59 25.27 53.90 6.76 × 10−8 2 11:118518698:G/A PHLDB1 0.0008 3 32.23 20.56 43.90 7.04 × 10−8 3 17:27287557:G/A SEZ6 0.0010 4 25.82 15.85 35.79 4.32 × 10−7 4 19:4446358:T/C UBXN6 0.0028 11 14.36 8.78 19.94 5.07 × 10−7 5 5:41853663:G/A OXCT1 0.0010 4 25.45 15.43 35.47 7.08 × 10−7 6 17:42152789:C/T G6PC3 0.0010 4 25.14 15.02 35.26 1.23 × 10−6 7 5:43700242:A/G NNT 0.0026 10 17.66 10.49 24.82 1.47 × 10−6 8 5:53606220:C/A ARL15 0.0005 2 34.87 20.70 49.04 1.53 × 10−6 9 19:9038077:C/A MUC16 0.0008 3 34.76 20.44 49.07 2.12 × 10−6 10 7:5415855:G/A TNRC18 0.0005 2 34.07 19.96 48.18 2.48 × 10−6 Variant ID=chromosome:position:reference/effect allele (hg19); EAF=effect allele frequency, EAC=effect allele count, Est.=Estimate of regression coefficient, CrI95%=95% credible inter- val; genome-wide significance level α = 1 × 10−7

Systolic blood pressure Associations of 420,046 variants in 9,436 genes with sys- tolic blood pressure were studied, of which variants with genome-wide significance and top-ten variants are reported (Table 6.13). No variants were associated with SBP (p ≥ 1E-7).

Table 6.13: Associations of top ten variants with observed systolic blood pressure using a reduced Bayesian hierarchical generalized linear model (normal) in empirical GAW19 data (sorted by Bayesian p value)

Nr Variant ID Gene MAF MAC Est. CrI95% P 1 11:70331805:T/C SHANK2 0.0005 2 65.09 40.55 89.63 2.25 × 10−7 2 9:71836451:A/G TJP2 0.0280 109 10.89 6.78 15.00 2.32 × 10−7 3 21:47419027:G/A COL6A1 0.0010 4 44.92 27.49 62.34 4.90 × 10−7 4 1:34401428:G/A CSMD2 0.0005 2 62.39 38.09 86.69 5.36 × 10−7 5 13:41507895:G/A ELF1 0.0005 2 62.13 37.77 86.49 6.31 × 10−7 6 19:45315391:G/A BCAM 0.0005 2 61.17 36.66 85.69 1.09 × 10−6 7 1:32084872:T/G HCRTR1 0.0018 7 31.97 18.90 45.04 1.77 × 10−6 8 3:44684218:C/T ZNF197 0.0010 4 42.05 24.80 59.30 1.91 × 10−6 9 1:117714983:T/C VTCN1 0.0008 3 48.00 28.09 67.92 2.49 × 10−6 10 11:73796878:T/C C2CD3 0.0021 8 29.14 16.92 41.36 3.20 × 10−6 Variant ID=chromosome:position:reference/effect allele (hg19); EAF=effect allele frequency, EAC=effect allele count, Est.=estimate of regression coefficient, CrI95%=95% credible inter- val; genome-wide significance level α = 1 × 10−7

The top ten variants included: chr11:70331805 (p = 2.25 × 10−7) in SHANK2 (SH3 And Multiple Ankyrin Repeat Domains 2), chr9:71836451 (p = 2.32 × 10−7) in TJP2 (Tight Junction Protein 2), chr21:47419027 (p = 4.90 × 10−7) in COL6A1 (Collagen, Type VI, Alpha 1), chr1:34401428 (p = 5.36 × 10−7) in CSMD2 (CUB

71 6 Results

And Sushi Multiple Domains 2), chr13:41507895 (p = 6.31 × 10−7) in ELF1 (E74- Like Factor 1, Ets Domain Transcription Factor), chr19:45315391 (p = 1.09 × 10−6) in BCAM (Basal Cell Adhesion Molecule), chr1:32084872 (p = 1.77 × 10−6) in HCRTR1 (Hypocretin (Orexin) Receptor 1), chr3:44684218 (p = 1.91 × 10−6) in ZNF197 (Zinc Finger Protein 197), chr1:117714983 (p = 2.49 × 10−6) in VTCN1 (V-Set Domain Containing T Cell Activation Inhibitor 1), and chr11:73796878 (p = 3.20 × 10−6) in C2CD3 (C2 Calcium-Dependent Domain Containing 3).

Hypertension

Associations of 419,358 variants in 9,423 genes with hypertension were studied, of which variants with genome-wide significance and top ten variants are reported (Table 6.14). No variants were significantly associated with hypertension (p ≥ 1 × 10−7).

Table 6.14: Associations of top ten variants with observed hypertension using a reduced Bayesian hierarchical generalized linear model (binomial) in empirical GAW19 data (sorted by Bayesian p value)

Nr Variant ID Gene MAF MAC Est. CrI95% P 1 9:71836451:A/G TJP2 0.0280 109 1.29 0.72 1.85 7.25 × 10−6 2 11:67837771:A/T CHKA 0.2342 910 0.42 0.22 0.61 2.05 × 10−5 3 9:124521260:G/A DAB2IP 0.0661 257 0.65 0.34 0.96 3.15 × 10−5 4 11:67815086:G/A TCIRG1 0.1042 405 0.58 0.30 0.86 4.93 × 10−5 5 11:495103:A/G RNH1 0.4786 1858 −0.47 −0.70 −0.23 9.86 × 10−5 6 19:8321182:A/T CERS4 0.3055 1187 0.42 0.21 0.63 1.00 × 10−4 7 1:229730452:G/A TAF5L 0.3327 1293 −0.39 −0.60 −0.19 1.36 × 10−4 8 19:36530913:C/T LOC101927572 0.0713 277 0.75 0.36 1.13 1.50 × 10−4 9 15:51914662:G/A DMXL2 0.4677 1812 0.78 0.37 1.19 1.75 × 10−4 10 21:46842443:C/T COL18A1 0.1241 482 1.32 0.63 2.02 2.00 × 10−4 Variant ID=chromosome:position:reference/effect allele (hg19); EAF=effect allele frequency, EAC=effect allele count, Est.=estimate of regression coefficient, CrI95%=95% credible inter- val; genome-wide significance level α = 1 × 10−7

The top ten variants contained: chr9:71836451:A/G (p = 7.25 × 10−6) in TJP2 (Tight Junction Protein 2), chr11:67837771:A/T (p = 2.05 × 10−5) in CHKA (Choline Kinase Alpha), chr9:124521260:G/A (p = 3.15 × 10−5) in DAB2IP (DAB2 Interacting Protein), chr11:67815086:G/A (p = 4.93 × 10−5) in TCIRG1 (T-cell, immune regulator 1, ATPase, H+ transporting, lysosomal V0 subunit A3), chr11:495103:A/G (p = 9.86 × 10−5) in RNH1 (Ribonuclease/Angiogenin In-

72 6 Results hibitor 1), chr19:8321182:A/T (p = 1.00 × 10−4) in CERS4 (Ceramide Synthase 4), chr1:229730452:G/A (p = 1.36 × 10−4) in TAF5L (TAF5-like RNA polymerase II, p300/CBP-associated factor (PCAF)-associated factor), chr19:36530913:C/T (p = 1.50 × 10−4) in LOC101927572 (uncharacterized), chr15:51914662:G/A (p = 1.75 × 10−4) in DMXL2 (Dmx-Like 2), and chr21:46842443:C/T (p = 2.00 × 10−4) in COL18A1 (Collagen, Type XVIII, Alpha 1).

Average effects of groups of variants

Results of average effects of groups of rare or common variants in 6,102 genes, with a minimal number of 3 variants per group are reported.

Diastolic blood pressure None of the average group effects of rare variants (p > 0.33) or common variants (p > 5.54E-4) on DBP reached genome-wide significance for groups. Genes with top ten Bayesian p values of rare variant group effects, which are not nominally significant, are shown in Table A.8 in the appendix. Top ten common variant group effects are listed below in Table 6.15. Genes with top ten common variant group effects included UNC13D (p = 5.54 × 10−4, unc-13 homolog D, C. elegans), PAXIP1 (p = 2.25 × 10−3, PAX inter- acting (with transcription-activation domain) protein 1), PLSCR3 (p = 3.51 × 10−3, Phospholipid Scramblase 3), C1orf192 (p = 4.62 × 10−3, chromosome 1 open reading frame 192), TRIM65 (p = 4.68 × 10−3, tripartite motif containing 65), TIAM1 (p = 5.95 × 10−3, T-cell lymphoma invasion and metastasis 1), TRIM47 (p = 6.30 × 10−3, tripartite motif containing 47), WDR41 (p = 6.33 × 10−3, WD repeat domain 41), CHMP2B (p = 6.51 × 10−3, charged multivesicular body protein 2B), and ZNF540 (p = 6.75 × 10−3, zinc finger protein 540).

Systolic blood pressure None of the average group effects of rare variants (p > 0.63) or common variants (p > 2.48 × 10−3) on SBP reached genome-wide signifi- cance for groups. Genes with top ten Bayesian p values of rare variant group effects, which are not nominally significantly significant, are listed in Table A.9 in the ap- pendix. Genes with top ten Bayesian p values of common variant group effects, are

73 6 Results

Table 6.15: Associations of top ten genes with observed diastolic blood pressure using a reduced Bayesian hierarchical generalized linear model (normal) to test the average effect of the group of common variants per gene in empirical GAW19 data (sorted by Bayesian p value)

Nr Gene #RV #CV Estimate CrI95% P 1 UNC13D 163 22 −0.19 −0.30 −0.08 5.54 × 10−4 2 PAXIP1 74 4 0.60 0.22 0.99 2.25 × 10−3 3 PLSCR3 16 3 −0.48 −0.80 −0.16 3.51 × 10−3 4 C1orf192 12 4 −0.43 −0.72 −0.13 4.62 × 10−3 5 TRIM65 43 9 −0.29 −0.49 −0.09 4.68 × 10−3 6 TIAM1 92 14 0.20 0.06 0.34 5.95 × 10−3 7 TRIM47 41 3 −0.78 −1.34 −0.22 6.30 × 10−3 8 WDR41 37 12 0.30 0.09 0.52 6.33 × 10−3 9 CHMP2B 11 3 −0.61 −1.04 −0.17 6.51 × 10−3 10 ZNF540 60 7 −0.38 −0.65 −0.10 6.75 × 10−3 RV=rare variants, CV=common variants, CrI95%=95% credible interval; genome-wide significance level α = 1 × 10−7 shown below in Table 6.16. Top ten group effects of common variants were found in gene ZNF224 (p = 2.48 × 10−3, zinc finger protein 224), LOC100379224 (p = 3.85 × 10−3, Unchar- acterized), DBR1 (p = 5.97 × 10−3, debranching RNA lariats 1), PDS5B (p = 7.78 × 10−3, PDS5, regulator of cohesion maintenance, homolog B, S. cerevisiae), ZNF284 (p = 9.69 × 10−3, zinc finger protein 284), DZIP1L (p = 1.09 × 10−2, DAZ interacting zinc finger protein 1-like), ZNF112 (p = 1.28 × 10−2, zinc finger protein 112), OR7G3 (p = 1.35 × 10−2, olfactory receptor, family 7, subfamily G, mem- ber 3), A4GNT (p = 1.48 × 10−2, alpha-1,4-N-acetylglucosaminyltransferase), and DCLRE1B (p = 1.89 × 10−2, DNA cross-link repair 1B).

Hypertension None of the average group effects of rare variants (p > 0.63) or common variants (p > 2.48 × 10−3) on SBP reached genome-wide significance for groups. Genes with top ten Bayesian p values of rare variant group effects are listed in Table A.10 in the appendix. Genes with top ten common variant group effects are listed below in Table 6.17. Top ten group effects of common variants included the genes DRAXIN (p = 3.19 × 10−4, dorsal inhibitory axon guidance protein), A4GNT (p = 4.20 × 10−4, alpha-1,4-N-acetylglucosaminyltransferase), C11orf24 (p = 5.34 × 10−4, chromo-

74 6 Results

Table 6.16: Associations of top ten genes with observed systolic blood pressure using a reduced Bayesian hierarchical generalized linear model (normal) to test the average effect of the group of common variants per gene in empirical GAW19 data (sorted by Bayesian p value)

Nr Gene #RV #CV Estimate CI95% Bayesian P 1 ZNF224 25 13 −0.24 −0.40 −0.08 2.48 × 10−3 2 LOC100379224 18 12 −0.28 −0.48 −0.09 3.85 × 10−3 3 DBR1 28 5 −0.33 −0.57 −0.10 5.97 × 10−3 4 PDS5B 50 7 0.39 0.10 0.68 7.78 × 10−3 5 ZNF284 27 3 −0.62 −1.09 −0.15 9.69 × 10−3 6 DZIP1L 59 8 −0.30 −0.52 −0.07 1.09 × 10−2 7 ZNF112 33 7 0.31 0.07 0.56 1.28 × 10−2 8 OR7G3 14 3 0.51 0.11 0.92 1.35 × 10−2 9 A4GNT 14 6 −0.41 −0.75 −0.08 1.48 × 10−2 10 DCLRE1B 13 3 0.66 0.11 1.22 1.89 × 10−2 RV=rare variants, CV=common variants, CI95%=95% credible interval; genome-wide significance level α = 1 × 10−7 some 11 open reading frame 24), SCYL1 (p = 5.45 × 10−4, SCY1-like 1 (S. cere- visiae)), IGDCC4 (p = 7.49 × 10−4, immunoglobulin superfamily, DCC subclass, member 4), DZIP1L (p = 7.68 × 10−4, DAZ interacting zinc finger protein 1-like), KLHL3 (p = 1.06 × 10−3, Kelch-like family member 3), EHBP1L1 (p = 1.06 × 10−3, EH domain binding protein 1-like 1), PPP1R26 (p = 1.13 × 10−3, protein phos- phatase 1, regulatory subunit 26), and ZNF222 (p = 1.33 × 10−3, zinc finger protein 222).

Table 6.17: Associations of top ten genes with observed hypertension using a reduced Bayesian hierarchical generalized linear model (binomial) to test the average effect of the group of common variants per gene in empirical GAW19 data (sorted by Bayesian p value)

Nr Gene #RV #CV Estimate CrI95% Bayesian P 1 DRAXIN 44 7 0.21 0.10 0.33 3.19 × 10−4 2 A4GNT 14 6 −0.10 −0.16 −0.04 4.20 × 10−4 3 C11orf24 29 3 −0.35 −0.55 −0.15 5.34 × 10−4 4 SCYL1 64 4 0.30 0.13 0.46 5.45 × 10−4 5 IGDCC4 116 12 0.06 0.03 0.10 7.49 × 10−4 6 DZIP1L 59 8 −0.06 −0.10 −0.03 7.68 × 10−4 7 KLHL3 41 9 0.08 0.03 0.13 1.06 × 10−3 8 EHBP1L1 128 8 −0.15 −0.24 −0.06 1.06 × 10−3 9 PPP1R26 75 11 0.06 0.02 0.10 1.13 × 10−3 10 ZNF222 15 6 −0.24 −0.39 −0.09 1.33 × 10−3 RV=rare variants, CV=common variants, CrI95%=95% credible interval; genome-wide significance level α = 1 × 10−7

75 6 Results

Joint effects of groups of variants

Results of joint effects of groups of rare or common variants in 6,102 genes with a minimal number of 3 variants per group are reported.

Diastolic blood pressure Overall 23 genome-wide significant joint group effects of rare variants per gene were identified (Table 6.18). None of the joint effects of common variants was genome-wide significant (all p ≥ 2.69 × 10−5). Joint group effects of rare variants included the following genes: PNPLA7 (p = 8.77 × 10−14, patatin-like phospholipase domain containing 7), DNAH9 (p = 4.10 × 10−12, dynein, axonemal, heavy chain 9), SSPO (p = 2.70 × 10−11, SCO- spondin), PLEKHG2 (p = 3.93 × 10−11, pleckstrin homology domain containing, family G (with RhoGef domain) member 2), ABCA13 (p = 8.35 × 10−10, ATP- binding cassette, sub-family A (ABC1), member 13), DNAH11 (p = 2.33 × 10−9, dynein, axonemal, heavy chain 11), MYO1G (p = 2.97 × 10−9, myosin IG), AGRN (p = 5.34 × 10−9, agrin), AP5Z1 (p = 5.54 × 10−9, adaptor-related protein com- plex 5, zeta 1 subunit), DYNC2H1 (p = 6.19 × 10−9, Dynein, Cytoplasmic 2, Heavy Chain 1), TRPM4 (p = 1.08 × 10−8, transient receptor potential cation channel, subfamily M, member 4), WDR62 (p = 1.26 × 10−8, WD repeat domain 62), SIPA1L3 (p = 1.43 × 10−8, signal-induced proliferation-associated 1 like 3), SLC5A10 (p = 1.92 × 10−8, solute carrier family 5 (sodium/sugar cotransporter), member 10), CAMTA2 (p = 2.72 × 10−8, calmodulin binding transcription activator 2), WNK2 (p = 2.87 × 10−8, WNK lysine deficient protein kinase 2), SVEP1 (p = 3.02 × 10−8, sushi, von Willebrand factor type A, EGF and pentraxin domain con- taining 1), DVL1 (p = 3.55 × 10−8, dishevelled segment polarity protein 1), FASN (p = 3.84 × 10−8, fatty acid synthase), COL6A1 (p = 3.89 × 10−8, collagen, type VI, alpha 1), TRIO (p = 5.13 × 10−8, Rho guanine nucleotide exchange factor), TRPM2 (p = 6.72 × 10−8, transient receptor potential cation channel, subfamily M, member 2), and PHLDB1 (p = 9.32 × 10−8, pleckstrin homology-like domain, family B, member 1). Top ten joint group effects of common variants in a gene, none of which reached

76 6 Results genome-wide significance are listed in Table A.11 of the appendix. The genes included: TJP2 (p = 2.69 × 10−5, tight junction protein 2), LGALS4 (p = 1.32 × 10−4, lectin, galactoside-binding, soluble, 4), COX10 (p = 4.07 × 10−4, cy- tochrome c oxidase assembly homolog 10, yeast), C5orf34 (p = 6.45 × 10−4, chromo- some 5 open reading frame 34), HCN4 (p = 7.76 × 10−4, hyperpolarization activated cyclic nucleotide-gated potassium channel 4), KIAA1432 (p = 8.11 × 10−4, alterna- tive symbol RIC1, RAB6A GEF complex partner 1), GUCY1A2 (p = 1.00 × 10−3, guanylate cyclase 1, soluble, alpha 2), GPR146 (p = 1.02 × 10−3, G protein-coupled receptor 146), TP53 (p = 1.11 × 10−3, tumor protein p53), and VPS37C (p = 1.37 × 10−3, vacuolar protein sorting 37 homolog C, S. cerevisiae).

Table 6.18: Genome-wide significant associations of genes with observed diastolic blood pressure using a Bayesian hierarchical generalized linear model (normal) to test the joint effect of the group of rare variants per gene in empirical GAW19 data (sorted by Bayesian p value)

Nr Gene #RV #CV Q χ2 df Bayesian P 1 PNPLA7 239 20 6542.37 74.12 6.31 8.77 × 10−14 2 DNAH9 289 40 5799.68 65.81 6.26 4.10 × 10−12 3 SSPO 612 78 5795.92 59.78 5.52 2.70 × 10−11 4 PLEKHG2 138 14 5064.63 56.78 4.74 3.93 × 10−11 5 ABCA13 264 51 4774.02 50.46 4.78 8.35 × 10−10 6 DNAH11 298 68 4510.84 48.50 4.85 2.33 × 10−09 7 MYO1G 73 11 3886.13 46.93 4.47 2.97 × 10−09 8 AGRN 330 22 4110.51 46.64 4.82 5.34 × 10−09 9 AP5Z1 183 16 3958.30 44.43 4.04 5.54 × 10−09 10 DYNC2H1 225 35 4028.42 44.44 4.13 6.19 × 10−09 11 TRPM4 149 9 4150.50 45.14 4.82 1.08 × 10−08 12 WDR62 129 22 3708.54 40.82 3.39 1.26 × 10−08 13 SIPA1L3 164 14 3820.10 42.56 4.08 1.43 × 10−08 14 SLC5A10 107 16 3730.21 38.74 2.98 1.92 × 10−08 15 CAMTA2 75 11 3764.04 40.93 3.98 2.72 × 10−08 16 WNK2 281 16 3679.66 44.47 5.36 2.87 × 10−08 17 SVEP1 211 45 3510.79 44.71 5.50 3.02 × 10−08 18 DVL1 99 9 3385.29 35.64 2.40 3.55 × 10−08 19 FASN 320 13 3698.65 44.47 5.61 3.84 × 10−08 20 COL6A1 202 29 3381.20 42.25 4.75 3.89 × 10−08 21 TRIO 198 23 3187.94 38.23 3.50 5.13 × 10−08 22 TRPM2 190 19 3466.18 41.80 5.03 6.72 × 10−08 23 PHLDB1 75 8 1039.34 28.71 1.04 9.32 × 10−08 Q=quadratic statistic, RV=rare variants, CV=common variants, genome- wide significance level α = 1 × 10−7

77 6 Results

Systolic blood pressure Overall 21 genome-wide significant joint group effects of rare variants per gene were identified (Table 6.19). None of the joint effects of common variants was genome-wide significant (all p ≥ 1.16 × 10−7). Joint group effects of rare variants included the following genes: DNAH9 (p = 8.60 × 10−12, dynein, axonemal, heavy chain 9), MACF1 (p = 2.98 × 10−10, microtubule-actin crosslinking factor 1), DNAH11 (p = 3.74 × 10−10, dynein, ax- onemal, heavy chain 11), TRPM2 (p = 2.24 × 10−9, transient receptor potential cation channel, subfamily M, member 2), SDK2 (p = 5.02 × 10−9, sidekick cell ad- hesion molecule 2), DNAH1 (p = 5.68 × 10−9, dynein, axonemal, heavy chain 1), PCDHA1 (p = 7.27 × 10−9, protocadherin alpha 1), OBSCN (p = 1.73 × 10−8, obscurin, cytoskeletal calmodulin and titin-interacting RhoGEF), CEP131 (p = 2.03 × 10−8, centrosomal protein 131kDa), USH2A (p = 2.12 × 10−8, Usher syn- drome 2A (autosomal recessive, mild)), C19orf44 (p = 2.50 × 10−8, chromosome 19 open reading frame 44), DOCK6 (p = 2.81 × 10−8, dedicator of cytokinesis 6), PTPRS (p = 2.92 × 10−8, protein tyrosine phosphatase, receptor type, S), EHMT1 (p = 4.64 × 10−8, euchromatic histone-lysine N-methyltransferase 1), RNF126 (p = 5.25 × 10−8, ring finger protein 126), TECPR1 (p = 5.49 × 10−8, tectonin beta- propeller repeat containing 1), PCDHA2 (p = 6.45 × 10−8, protocadherin alpha 2), MYO10 (p = 6.71 × 10−8, myosin X), RNF213 (p = 7.21 × 10−8, ring finger protein 213), LTN1 (p = 7.28 × 10−8, listerin E3 ubiquitin protein ligase 1), and FASN (p = 7.81 × 10−8, fatty acid synthase). Top ten joint group effects of common variants in a gene, none of which reached genome-wide significance, are listed in Table A.12 of the appendix. The genes included TJP2 (p = 1.16 × 10−7, tight junction protein 2), CNTNAP2 (p = 3.86 × 10−5, contactin associated protein-like 2), MIR548I4 (p = 4.89 × 10−5, mi- croRNA 548i-4), URB2 (p = 3.49 × 10−4, ribosome biogenesis 2 homolog (S. cere- visiae)), PAK6 (p = 5.29 × 10−4, p21 protein (Cdc42/Rac)-activated kinase 6), ULK4 (p = 1.05 × 10−3, unc-51 like kinase 4), SERPINI1 (p = 1.50 × 10−3, serpin peptidase inhibitor, clade I (neuroserpin), member 1), RNH1 (p = 1.64 × 10−3, ri- bonuclease/angiogenin inhibitor 1), ZSCAN29 (p = 1.69 × 10−3, zinc finger and

78 6 Results

SCAN domain containing 29), and LPPR2 (p = 2.28 × 10−3, lipid phosphate phosphatase-related protein type 2).

Table 6.19: Genome-wide significant associations of genes with observed systolic blood pressure using a Bayesian hierarchical generalized linear model (normal) to test the joint effect of the group of rare variants per gene in empirical GAW19 data (sorted by Bayesian p value)

Nr Gene #RV #CV Q χ2 df P 1 DNAH9 289 40 17 052.60 68.48 7.88 8.60 × 10−12 2 MACF1 348 40 13 760.95 50.68 4.10 2.98 × 10−10 3 DNAH11 298 68 14 322.21 52.94 5.06 3.74 × 10−10 4 TRPM2 190 19 12 103.82 49.14 5.06 2.24 × 10−09 5 SDK2 261 23 11 982.12 48.44 5.45 5.02 × 10−09 6 DNAH1 383 22 12 250.68 46.83 4.94 5.68 × 10−09 7 PCDHA1 472 60 12 382.88 43.67 3.98 7.27 × 10−09 8 OBSCN 776 101 11 552.40 46.15 5.59 1.73 × 10−08 9 CEP131 226 26 8447.68 38.43 2.92 2.03 × 10−08 10 USH2A 298 44 11 698.91 46.39 5.87 2.12 × 10−08 11 C19orf44 59 9 10 917.83 42.81 4.61 2.50 × 10−08 12 DOCK6 302 22 8772.33 42.08 4.43 2.81 × 10−08 13 PTPRS 222 27 10 924.84 38.12 3.06 2.92 × 10−08 14 EHMT1 86 13 11 095.57 39.49 3.87 4.64 × 10−08 15 RNF126 113 8 8945.04 35.38 2.56 5.25 × 10−08 16 TECPR1 184 11 10 019.13 37.01 3.12 5.49 × 10−08 17 PCDHA2 432 58 10 821.02 39.08 3.97 6.45 × 10−08 18 MYO10 194 21 10 388.20 36.23 3.00 6.71 × 10−08 19 RNF213 313 39 10 103.58 38.48 3.84 7.21 × 10−08 20 LTN1 62 15 10 059.75 38.39 3.81 7.28 × 10−08 21 FASN 320 13 10 641.41 39.00 4.09 7.81 × 10−08 Q=quadratic statistic, RV=rare variants, CV=common variants, genome- wide significance level α = 1 × 10−7

Hypertension No joint group effects of rare (all p ≥ 0.025) or common (all p ≥ 1.20 × 10−7) variants per gene was found (Tables A.13 and A.14 in the appendix). Top ten joint group effects of rare variants in a gene, none of which reached genome-wide significance are listed in Table A.13 of the appendix. Genes referring to these top ten group effects are: GM2A (p = 2.48 × 10−2, GM2A), OR51Q1 (p = 6.40 × 10−2, olfactory receptor, family 51, subfamily Q, member 1 (gene/pseudo- gene)), HPCAL4 (p = 6.75 × 10−2, hippocalcin like 4), TRIM33 (p = 1.21 × 10−1, tripartite motif containing 33), PRSS58 (p = 1.27 × 10−1, protease, serine, 58), VTCN1 (p = 1.31 × 10−1, V-set domain containing T cell activation inhibitor 1), PON3 (p = 1.36 × 10−1, paraoxonase 3), ZNF677 (p = 1.43 × 10−1, zinc finger

79 6 Results

protein 677), NFE2L1 (p = 2.15 × 10−1, nuclear factor, erythroid 2-like 1), and MYOZ3 (p = 2.20 × 10−1, myozenin 3), Top ten joint group effects of common variants in a gene, none of which reached genome-wide significance are listed in Table A.14 of the appendix. Genes referring to these top ten group effects are: COX10 (p = 6.16 × 10−4, oxidase assembly homolog 10 (yeast)), DRAXIN (p = 1.53 × 10−3, dorsal inhibitory axon guidance protein), FAM63A (p = 2.64 × 10−3, Family With Sequence Similarity 63, Member A), HSD3B1 (p = 3.16 × 10−3, hydroxy-delta-5-steroid dehydrogenase, 3 beta- and steroid delta-isomerase 1), NPPA-AS1 (p = 3.30 × 10−3, Natriuretic Peptide A (NPPA) antisense RNA 1), NPPA (p = 3.30 × 10−3, Natriuretic Peptide A), KIAA1432 (p = 4.14 × 10−3, alternative symbol RIC1, RAB6A GEF complex partner 1), C9orf72 (p = 4.60 × 10−3, open reading frame 72), TN- FSF13 (p = 6.65 × 10−3, tumor necrosis factor (ligand) superfamily, member 13), and LGALS4 (p = 6.82 × 10−3, lectin, galactoside-binding, soluble, 4).

6.4.3 Comparison of single and multiple variant analysis

Estimates and p values

A visual comparison of association results of individual variants from frequentist linear models of single variants and Bayesian hierarchical linear models of multiple variants (conditional effects) with weakly informative priors was performed for DBP, SBP, and hypertension. Variants were marked in the plots according to minor allele count (categories) (Figures 6.5, 6.6, and 6.7) or according to function (Figures A.12, A.13, and A.14 in the appendix). Plots of estimates (regression coefficients) show three groups of variants: First, variants which show consistent regression estimates between both approaches. These variants seem to be present in all categories of allele frequency and function. Second, variants whose effect sizes were shrunken to null (or very small effect sizes) in the Bayesian model compared to the frequentist linear model. Such variants seem to

80 6 Results

(a) Estimate

(b) P value

Figure 6.5: Comparison of regression coefficients of standard frequentist normal linear models including a single variant (SVT, Standard) versus (reduced) Bayesian hierarchical generalized linear models (normal) including multiple variants (MVT, BHGLM) using weakly informative Cauchy (0, 1) prior distributions to predict diastolic blood pressure in empirical data (GAW19)

81 6 Results

(a) Estimate

(b) P value

Figure 6.6: Comparison of regression coefficients of standard frequentist normal linear models including a single variant (SVT, Standard) versus (reduced) Bayesian hierarchical generalized linear models (normal) including multiple variants (MVT, BHGLM) using weakly informative Cauchy (0, 1) prior distributions to predict systolic blood pressure in empirical data (GAW19)

82 6 Results

(a) Estimate

(b) P value

Figure 6.7: Comparison of regression coefficients of standard frequentist linear models in- cluding a single variant (SVT, Standard) versus (reduced) Bayesian hierar- chical generalized linear models (binomial) including multiple variants (MVT, BHGLM) using weakly informative prior distributions to predict hypertension in empirical data (GAW19)

83 6 Results be overrepresented in the categories with very low minor allele frequencies (MAC ≤ 20). Third, variants, which show (much) larger effect sizes in the Bayesian linear models compared to the frequentist linear models, also preferably among variants with very low minor allele counts (MAC ≤ 20). These three groups can be found in analysis on DBP, SBP and hypertension, although this pattern is much weaker in the analysis on hypertension, which might be related to lower statistical power. These three groups are also found in the corresponding plots of frequentist and Bayesian p values. The first group of variants shows strong correlation of p values, which, however, have a lower significance level (higher p values) in the Bayesian models as compared to the frequentist models. Again these observations are similar in DBP and SBP, but groups 2 and 3 are hardly identifiable in the hypertension analysis. These differences in regression estimates and p values of single variants do not show in the number of identified (genome-wide significant) variants, which are far below the significance threshold of 1 × 10−7. Considering that the effect allele is the minor allele in the analysis, the majority of variants with rare alleles seem to increase blood pressure and tend to show stronger blood pressure increasing than decreasing effects.

Computing resources

The applied statistical models were also evaluated regarding the required system resources of an exome-wide association analysis in empirical GAW19 data including 461,868 variants in 11,329 genes. All analysis was performed on the ALICE1 high- performance computing cluster of the University of Leicester, UK. In total, the project used about 443 GB of disk storage. Frequentist single-variant analysis required about 2 hours of computing time and 760 Mb of working memory using normal linear regression, and about 37 hours and 800 MB using Firth logistic regression. Bayesian hierarchical models needed between 10.5 to 12.5 hours of computing time and about 16GB of working memory using a normal linear model and about 21 hours of computing time and 15GB of

1http://www2.le.ac.uk/offices/ithelp/services/hpc/alice

84 6 Results working memory using a binomial linear model. Extrapolating this to the whole exome including even-numbered chromosomes should approximately double these computing resources. Reasons for these large differences are only partially caused by statistical models but also in the software implementation. For example, frequentist single variant analysis was performed using the EPACTS software which extracts the genetic data directly from the compressed files in variant call format by efficient indexing of variants, and, therefore, the statistical analysis required only little working memory. In contrast, Bayesian hierarchical multiple variant analysis was performed by reading all genetic data into working memory.

85 7 Discussion

The following chapter summarises and discusses the presented results in the context of other publications and information from biological databases.

7.1 Summary

Bayesian generalized linear models with weakly informative prior distributions [Yi, Liu, Zhi & Li 2011] were applied to identify genetic associations between blood pressure and individual genetic variants and groups of rare and common genetic variants per gene. The overall data included 1,943 individuals of Mexican-American ancestry and 1,765,688 variant, which were sequenced in exomes of odd-numbered chromosomes. The data was provided by the organisers of the Genetic Analysis Workshop 19 (August 24, 2014, to August 27, 2014, in Vienna) [Genetic Analysis Workshop 2014a; Genetic Analysis Workshop 2014b]. The applied frequentist and Bayesian methods were evaluated using simulated data. Overall 23 associations of joint rare genetic variants on diastolic blood pressure and 21 associations of joint rare genetic variants on systolic blood pressure were identified by the Bayesian ap- proach in empirical data. However, few effects of individual genetic variants were detected. Additionally, standard frequentist single-variant analysis was performed as sensitivity analysis. A comparison of effect sizes of individual variants from the Bayesian approach including multiple variants (conditional effects) and standard frequentist model including single variants (unconditional effects) highlighted a sub- group of rare genetic variants with substantially increased effect sizes of individual variants in the Bayesian approach.

86 7 Discussion

7.2 Statistical models

Statistical models were evaluated regarding their feasibility for genome-wide associ- ations studies on rare (and common) variants and groups of variants.

7.2.1 Frequentist models

Frequentist models showed good performance regarding bias and coverage in sim- ulated data I. The proportion of false positive results (type I error) was controlled well in simulated data without genetic effects (simulated data Ia) and the proportion of true positive results (statistical power, 1 - type II error) in simulated data with genetic effects (simulated data Ib) was acceptable. Frequentist models showed similar behaviour in partially simulated data (GAW19) with simulated genetic effects on blood pressure. The type I error threshold was not substantially violated and a small number of variants with simulated genetic effects was detected, which was larger than in the (reduced) Bayesian hierarchical models for the corresponding trait (except for hypertension). Six variants in four genes including MAP4 (Microtubule-Associated Protein 4), TNN (Tenascin N), LEPR (Leptin receptor), and CGN (Cingulin) with associations with diastolic or systolic blood pressure were identified. No variants or genes with effects on hypertension were found in any model. Biological functions are not discussed here because they are based on simulated data and, therefore, are probably medically not relevant in the real world. In empirical data (GAW19), frequentist models were hardly able to identify genome-wide significant genetic associations with blood pressure (performed as sen- sitivity analysis). Only variant chr11:70331805 (C) in SHANK2 (SH3 And Multiple Ankyrin Repeat Domains 2) and chr11:118518698 (A) in the PHLDB1 (Pleckstrin Homology-Like Domain, Family B, Member 1) gene had effects on diastolic or sys- tolic blood pressure.

87 7 Discussion

7.2.2 Bayesian hierarchical models

Estimates and posterior probabilities of individual variants and three tests for group effects (i.e. coefficient, average effect, and joint effect) were performed.

Reduced Bayesian hierarchical model

The reduced Bayesian hierarchical model showed good performance in terms of bias, coverage and the rates of false positive (1 - specificity) and true positive associa- tions (sensitivity) for tests of individual variant effects, average, and joint effects1 in simulated data I. Tests of groups of variants were more sensitive regarding joint effects than average effects of groups of variants. Although individual variants tests showed similar power compared to standard frequentist analysis, a simple sum of allele counts of groups of variants was less sensitive compared to group tests (average effect, joint effects) from a reduced Bayesian hierarchical model. The model behaviour was largely confirmed in simulated data II (GAW19). Only one genetic association of DBP and SBP, respectively, and the same individual variant rs2230169 in the MAP4 gene was found. Additionally, test of average and joint effects detected two genes (i.e. MAP4 and TNN ) with common variant group effects, but no genes with rare variant group effects. An unexpected finding was an inflated proportion of false positive association of the joint test in genes without simulated genetic effects in simulated data II (GAW19). An explanation for this finding might be provided by the simulation model of GAW19 [Genetic Analysis Workshop 2014a; Genetic Analysis Workshop 2014b], which is based on a polygenic model with a total heritability of blood pres- sure of 31.7% for DBP and 27.9% for SBP. Heritability, which was not explained by the functional variants in the 128 candidate genes, was distributed over random variants outside the candidate genes on odd-numbered chromosomes. The joint test might be highly sensitive to such minor genetic effects or other artifacts produced by the simulation model. However, no inflation of false positive genetic associations based on the test of

1The joint test did not give an estimate, but only a posterior probability (Bayesian p value).

88 7 Discussion joint effects seemed present in empirical data. Instead the test seemed sensitive to identify a large number of genes with genome-wide significant group effects of rare variants in relevant pathways.

Full Bayesian hierarchical model

Although the reduced Bayesian hierarchical model performed well in simulated data I, the test based on the estimated hierarchical group effect in a full Bayesian hier- archical model showed an inflated proportion of false positive results depending on the shrinkage parameter. Similar test characteristics were also found in simulated data II (GAW19). These test properties were in principle confirmed by the author of the package. He recommended to control the rate of false positive results by the (scale) shrinkage parameter or to use the test for joint effects (Yi, personal written communication, 2015-01-08). However, since the shrinkage parameter seemed to have different effects on groups of variants and distributions with strong shrinkage cannot be considered weakly informative, the joint test from a reduced Bayesian hierarchical model was preferred. Therefore, a full Bayesian hierarchical model including hierarchical group coefficients was only evaluated in simulated data and not used for empirical analysis since the results were not considered to be valid. Further simulations to select appropriate prior parameters for rare and common variants are recommended before applying this model.

7.2.3 Comparison of applied models

An evaluation of the reduced Bayesian hierarchical models and the applied test of individual variant effects and grouped variant effects leads to slightly different conclusions depending on the used data.: In the simulated data I, similar associations were estimated for individual variants in Bayesian hierarchical models including multiple variants compared to frequentist models including a single variant. Tests of joint effects performed more sensitive, than tests of average effects. Tests of average effects were more sensitive than fre-

89 7 Discussion quentist tests of simple sums of allele counts. However, no confounding (correlations) between genetic effects was present here. In simulated data II (GAW19), frequentist models including a single variant seemed more sensitive to discover simulated variant effects (and associated genes) in comparison to reduced Bayesian hierarchical models including multiple variants. However, the simulation model was based on an effect size spectrum from genetic association analysis on common variants, which featured relatively small effects. In empirical data (GAW19), reduced Bayesian hierarchical models identified a large number of joint genetic effects of rare variants, which was not expected from the simulation results. In comparison to standard frequentist analysis of (uncondi- tional) single variant effects, a subgroup of variants showed substantially increased (conditional) effects in Bayesian hierarchical analysis of multiple variants, which were adjusted for all other rare or common variants in the gene. These effects were far be- low the genome-wide significance threshold of single-variant tests, and only became visible by analysing the joint effects of rare variants. Interestingly, no genome-wide significant effects were found in groups of common variants. Therefore, the large number of identified genes in a cohort of relatively modest sample size might be explained by a combination of exome-sequencing (i.e., design) and conditioning on all other variants in the gene and aggregating these weighted variant effects per gene (i.e. statistical model).

7.3 Genetics of blood pressure

Genetic associations of variants and genes with blood pressure are summarised and compared with known genes and loci from the GWAS catalog associated with blood pressure, hypertension, or hypotension [Welter et al. 2014] (Version 2014-12-08). Function of the identified genes was annotated using the GeneCards database (Ver- sion 2014-12-08)1 [Stelzer et al. 2011] and linked databases such as the Gene On- tology (GO, Version 2014-12-08)2 [Ashburner et al. 2000], UniProt (Version 2014-

1http://www.genecards.org 2http://geneontology.org

90 7 Discussion

12-08)1 [UniProt Consortium 2014] and Online Inheritance in Men (OMIM, Version 2014-12-08)2 [Hamosh et al. 2005] databases, if no other references are given. Gene symbols interchangeably refer to the gene and the coded here for readability. Blood pressure is discussed here as a single construct, which includes diastolic and systolic blood pressure and hypertension. The analysis will integrate the results from Bayesian generalized linear models on individual and grouped variant level (i.e. joint test) using weakly informative priors as the primary analysis. These findings are grouped according to their biological function and sorted by their biological relevance for blood pressure in the following text.

7.3.1 Individual variant effects

Two associations of individual variants and diastolic blood pressure were identified, which were chr1:204214013 (T) in the PLEKHA6 (pleckstrin homology domain con- taining, family A member 6) gene and chr11:118518698 (A) in PHLDB1 (pleckstrin homology-like domain, family B, member 1), which were both related to the pleck- strin protein. The Pleckstrin protein is found in blood platelets (thrombocytes), which are involved in hemostasis, a process to stop bleeding [Marcovitch 2010]. Ef- fect sizes of these individual rare variant effects were considerably larger than effects which were reported for common variants. While common variant effects are usually much smaller than 10 mmHg, rare variants reached multiples of these effect sizes. Although this might raise doubts on the plausibility of such results, similar effect sizes were found in knock-out gene mouse models [Takahashi and Smithies 2004]. Additionally, one marginally significant finding in the Bayesian hierarchical gen- eralized linear model approach on individual variant level, which was genome-wide significant in the frequentist single-variant analysis should be mentioned. A rare, coding, synonymous, blood pressure increasing variant chr11:70331805 (T) in the SHANK2 (SH3 and multiple ankyrin repeat domains 2) gene was found. SHANK2 has not been associated with blood pressure in the past, but with autism spectrum disorders in humans [Berkel et al. 2010]. Diagnostic criteria of autism include deficits

1http://www.uniprot.org 2http://www.omim.org

91 7 Discussion in social communication and social interaction and hypersensitivity to sensory in- put [American Psychiatric Association 2013]. Although this might seem biologically unrelated to blood pressure, genetic effects might be mediated by perceived psycho- logical stress of individuals with autism-like behaviour in an intimate social situation with physical stimulation such as a medical examination with a blood pressure cuff and, therefore, situational psychological factors should be considered here as an alternative interpretation to stable biological traits. A comparison of effect sizes which were conditioned on other variants in the gene by a Bayesian hierarchical generalized linear model and unconditioned variant ef- fects from single-variant analysis showed a subgroup of mostly rare variants with substantially larger conditional effect sizes. However, these effect sizes differences were not apparent on an individual variant level. As shown in the comparison plots, the increased Bayesian p values of conditional effects are below the adapted signif- icance threshold, but are expected to influence the number of identified individual variant effects in larger samples. Considering that the tests of joint variant effects resulted in dozens of signals of group effects of rare variants, one might argue that only by conditioning and aggregating effects of rare variants, these were detectable in this cohort of moderate sample size.

7.3.2 Grouped variant effects

Cardiac growth and hemostasis

CAMTA2 (calmodulin binding transcription activator 2) is involved in histone deacetylase binding and chromatin binding. It may be a transcriptional coactivator of other genes involved in cardiac growth [Song et al. 2006]. PLEKHG2 (pleckstrin homology domain containing, family G (with RhoGef do- main) member 2) and PHLDB1 (pleckstrin homology-like domain, family B, mem- ber 1) are associated with the Pleckstrin protein. The Pleckstrin protein is found in blood platelets (thrombocytes), which are involved in hemostasis, a process to stop bleeding [Marcovitch 2010]. SVEP1 (sushi, von Willebrand factor type A, EGF and pentraxin domain containing 1) has been related to chromatin binding and

92 7 Discussion calcium ion binding, but not directly to blood pressure. However, von Willebrand factor type A, which is found in blood plasma, has been related to the aetiology of bleeding disorders [Ruggeri and Ware 1993]. So far, none of these genes has been reported in the GWAS catalog [Welter et al. 2014] to be associated with blood pressure, although genes with similar functions have been published, such as PLEKHA7 (pleckstrin homology domain containing, family A member 7) and PLEKHG1 (pleckstrin homology domain containing, family G (with RhoGef domain) member 1).

Sodium transport and membrane potential

SLC5A10 (solute carrier family 5, sodium/sugar cotransporter, member 10) may function as a kidney-specific, sodium-dependent mannose and fructose co- transporter. WNK2 (WNK lysine deficient protein kinase 2) acts a inhibitor of sodium-coupled and potassium-coupled chloride cotransporters, respectively, regu- lating electrolyte homeostasis. TRPM2 (transient receptor potential cation channel, subfamily M, member 2) and TRPM4 (transient receptor potential cation channel, subfamily M, member 4) are related to calcium channel activity which mediates the transport of ions across membranes, thereby depolarising the membrane. TRPM2 mediates ion flow in response to oxidative stress and confers susceptibility to cell death. TRPM4 is associated with progressive heart block. TRPM2 is an important paralog gene of TRPM4. At present, the GWAS catalog [Welter et al. 2014] does not contain any of these genes, however, genes with similar functions in the renal sodium pathway are known, for example, SLC4A7 (solute carrier family 4, sodium bicarbonate cotransporter, member 7). Additional examples of genes associated with renal sodium were dis- cussed in the background section, e.g. WNK1 and WNK4. Similar to TRPM2 and TRPM4, TRPA1 (transient receptor potential cation channel, subfamily A, member 1) is listed in the GWAS catalog [Welter et al. 2014].

93 7 Discussion

Ciliary activity and ATP/GTPase activity

DNAH1 (dynein, axonemal, heavy chain 1), DNHA9 (Dynein, Axonemal, Heavy Chain 9), DNAH11 (Dynein, Axonemal, Heavy Chain 11), and DYNC2H1 (Dynein, Cytoplasmic 2, Heavy Chain 1) are associated with the dynein protein, a motor molecule, which converts chemical energy into mechanical energy and is responsi- ble for transport of cellular “cargo” along microtubules and the movement of cilia. Mutations of the DNAH11 gene have been implicated with autosomal recessive Pri- mary Ciliary Dyskinesia, which is characterised by a defect of cilia in the respiratory tract, resulting in reduced mucus clearance and respiratory infections among other symptoms. Progression of disease can lead to hypoxemia (abnormally low level of oxygen in the blood), pulmonary hypertension, and chronic heart failure. USH2A (Usher syndrome 2A, autosomal recessive, mild) is related to collagen binding and myosin binding. Mutations of this gene can cause Usher syndrome. The Usher syndrome is characterised by hearing loss as a result of impaired development and maintenance of hair cells (stereocilia) in the ear. Similarly, MYO1G (Myosin IG) and MYO10 (myosin X) are actin-based motor molecules with ATPase activity, involved in cargo transport along actin filaments. MACF1 (microtubule-actin crosslinking factor 1) is involved in microtubule binding and calcium ion binding. It plays an important role in wound healing and epidermal cell migration and actin-microtubule interactions. The TRIO (trio Rho guanine nucleotide exchange factor) and the above men- tioned PLEKHG2 both affect guanine nucleotide exchange factor activity, which again stimulates the activity of GTPase enzymes. GTPase enzymes can metabolise energy-rich guanosine triphosphate (GTP) molecules. GTP molecules are an impor- tant product of metabolising carbohydrates and fat. OBSCN (obscurin, cytoskeletal calmodulin and titin-interacting RhoGEF) also contains a guanine nucleotide ex- change factor domain and may have a role in the organisation of myofibrils. DOCK6 (dedicator of cytokinesis 6) is an atypical guanine nucleotide exchange factor, associ- ated with autosomal recessive Adams-Oliver Syndrome 2, which is described by mul- tiple congenital anomaly syndrome, e.g. absence of skin, with variable involvement

94 7 Discussion of brain, eyes and the cardiovascular system. SIPA1L3 (signal-induced proliferation- associated 1 like 3) was also related to GTPase activator activity. ABCA13 (ATP- binding cassette, sub-family A (ABC1), member 13) is related to ATPase activity. RNF213 (ring finger protein 213) is also associated with ATPase activity and may play a role in angiogenesis. Mutations in the gene are associated with Moyamoya disease, which is a rare, progressive, blood vessel disease caused by blocked arteries at the base of the brain resulting in an increased risk for blood clots, strokes, and transient ischemic attacks. At present, none of these genes has been listed in the GWAS catalog [Welter et al. 2014], but it contains MYO16 (myosin XVI) as actin-based motor molecule which is related to motor activity and actin filament binding. Additionally, known DAW1 (alias: WDR69, Dynein Assembly Factor With WDR Repeat Domains 1) supports this identified pathway.

Lipids

FASN (fatty acid synthase) catalysis long-chain saturated fatty acids and PN- PLA7 (patatin-like phospholipase domain containing 7) is implicated in regulation of adipocyte differentiation. PHLDB1 is involved in phospholipid binding. At present, the GWAS catalog [Welter et al. 2014] features none of these genes with effects on blood pressure. However, R´ıos-Gonz´alezet al. (2014) report a clinical association between hypertension and dyslipidemia, and significant associations be- tween lipid metabolism genes such as FABP2, LIPC, MTTP, LPL, APOA5, APOB, and LDLR.

Neuronal development

AGRN (Agrin) and DVL1 (dishevelled segment polarity protein 1) are important for the development of the neuromuscular junction. Mutations in AGRN are asso- ciated with an congenital myasthenic syndrome. SSPO (SCO-Spondin) contains a Thrombospondin Domain and may be implicated in the modulation of neuronal ag- gregation. Thrombospondins are proteins with anti-angiogenic abilities which were found in thrombocytes [Baenziger et al. 1971]. WDR62 (WD repeat domain 62) is

95 7 Discussion involved in cerebral cortical development. Mutations can cause microencephaly and mental retardation. SDK2 (sidekick cell adhesion molecule 2) guides axon terminals of developing neurons to specific synapses. PTPRS (protein tyrosine phosphatase, receptor type, S) is involved primary axonogenesis and axon guidance in embryos and molecular control of repair of neurons in adults. At present, none of these genes is present in the GWAS catalog [Welter et al. 2014].

Other functions

Of the the remaining genes, which do not fall in the above mentioned groups, only EHMT1 (euchromatic histone-lysine N-methyltransferase 1) should be mentioned. It is involved in methyltransferase activity and associated with Kleefstra syndrome, which results in symptoms such as heart defects, hypotonia, respiratory infections, and renal defects [Kleefstra et al. 2010].

7.4 Modelling of prior biological information

In the present study, no biological information was used to inform the prior distri- butions, because of limitations of the (full) Bayesian hierarchical model proposed by Yi, Liu, Zhi & Li (2011) and the considered biological score, i.e. the Combined Annotation Dependent Depletion (CADD) score [Kircher et al. 2014]. Yi, Liu, Zhi & Li (2011) suggested to use biological scores as means of the prior distributions of individual variants in a (full) Bayesian hierarchical model including a hierarchical group effect. The prior means and the estimated regression coefficients of the individual variants are interpreted as relative weights of the importance of the specific variant in the linear predictor (genetic score). Under these assumptions, values of the biological scores do not have to be prior estimates of the regression coefficients or even on the same scale, because the hierarchical group effect will take such differences into account. However, tests of the hierarchical group effect showed an inflated rate of false positive effects, which did not make this model applicable. However, assuming this model performed correctly, some problems in modelling prior biological information remain:

96 7 Discussion

First, a biological score has to be selected which is informative for the genetic variants under study. At present, the CADD score is the only score covering all genetic variants to predict their likely biological function in terms of pathogenicity. Other biological scores with a similar purpose, are, for example, specific to coding, non-synonymous variants [Adzhubei et al. 2013] or non-coding variants [Ritchie et al. 2014]. Second, the scale of the selected CADD raw score ranges across negative and pos- itive numbers indicating the relative differences in pathogenicity. However, naively applying this score to this Bayesian hierarchical model would bias the direction of effects of individual variants in the direction of their prior mean as a methodological artifact. Therefore, the raw CADD score has to be transformed to a positive scale which excludes the null, since Yi, Liu, Zhi & Li (2011) recommended not to remove any variants from the model because the additive effect of several such variants might become important. Since there is only one hierarchical group effect (with a single sign) and prior means of all variants are set to positive values, this means, that implicitly the prior assumption is made that all variants share the same direction of effects which might not be correct. Third, after transforming the CADD score to positive scale, the relative differences between the CADD values will probably not represent the relative difference of regression coefficients. Fourth, the scale (shrinkage) parameter is actually most important regarding the weight prior biological information receives as compared to empirical data. While the prior mean is determined by some sort of biological knowledge, the degree of shrinkage is arbitrary and cannot be determined from the empirical or prior bio- logical data. Although optimisation and validation of prior parameters might be based on known variant effects, such variants might not be available and the ob- tained settings might not be transferable to other biological models or domains. In general, the formalisation of prior biological or other subjective knowledge in a Bayesian model raises a number of questions and further discussions and simulations are needed before clear recommendations can be given.

97 7 Discussion

7.5 Strengths and weaknesses

Strengths and weaknesses of the present study are discussed.

7.5.1 Topic

This study addresses an important question in genetic epidemiology regarding the effects of rare variants on complex traits, such as blood pressure. Methods to exam- ine such effects are expected to become more important in the future because of the decreasing costs and increasing availability of genotyping, imputing, and sequencing technologies [Panoutsopoulou et al. 2013]. The study of genetic effects has had only limited effect on clinical practise so far. In addition, associations of rare variants with clinical traits might be only relevant for small number of individuals which carry these results. However, also rare variants can make important contributions to the understanding of genetic effects and pathways which might have clinical impact in the future.

7.5.2 Method

The applied Bayesian hierarchical generalized linear model as proposed by [Yi, Liu, Zhi & Li 2011] offers a statistical framework to examine a broad range of statistical models including various distributions of endpoints (categorical and continuous) and non-genetic covariates and genetic predictors. The Bayesian approach features an interface to include prior biological information which is widely available in public databases. The efficient estimation algorithm allows the estimation of conditional genetic effects of thousands of variants and genes in genome-wide analysis. Bayesian hierarchical models and its implementation [Yi, Liu, Zhi & Li 2011] have not been widely applied in genetic epidemiology so far and are still considered un- der development. Limitations of the method regarding the estimation of hierarchical group effects and the integration of prior biological information were found. How- ever, alternative solutions to test group effects of genetic variants could be applied and conceptual problems in integrating prior biological information were discussed.

98 7 Discussion

7.5.3 Model

The applied model considered group effects of all sequenced rare and common vari- ants per gene on odd-numbered chromosomes. Therefore, a comprehensive evalu- ation of these effects was possible (within the scope of the available data). Group effects were based on individual effects, which were adjusted for sex, age, two prin- cipal components and all other variants in the gene. However, if a gene did not contain a sufficient number of rare or common variants, i.e. at least three variants per group, to estimate the group effects, the gene was excluded from the analysis. Additionally, the model conditioned only on other variants within the same gene. However, effects might also depend on correlated variants in other genetic regions. For example, Hoggart et al. (2008) applied a Bayesian approach which simultaneously estimated the conditional effects of all variants from a genome-wide study which included up to 500,000 variants.

7.6 Conclusion

Bayesian generalized linear models can successfully be applied in genome-wide as- sociation analyses using efficient estimation algorithms as proposed by Yi, Liu, Zhi & Li (2011). Conditional effects of hundreds of variants can be estimated in a single model, which can result in substantial changes of genetic effects, especially of rare variants. Aggregating these conditional effects in groups of rare variants per gene was found a sensitive and reliable method to identify novel genes and pathways. Future studies should further evaluate the effect of various weakly informative pri- ors on model characteristics in systematic simulations. Current Bayesian hierarchical models should be extended to include various formats of biological information. Finally, Bayesian hierarchical generalized linear models with weakly informative prior distributions are recommended for the application in genetic association studies including a larger number of individuals and variants which optimally cover the whole genome by sequencing of rare and common variants.

99 References

Adzhubei, I., Jordan, D. M., and Sunyaev, S. R. (2013). Predicting functional effect of

human missense mutations using PolyPhen-2. Current Protocols of Human Genet-

ics 7, Unit 7.20.

Almasy, L., Dyer, T. D., Peralta, J. M., Jun, G., Wood, A. R., Fuchsberger, C., Almeida,

M. A., Kent, J. W., Fowler, S., Blackwell, T. W., Puppala, S., Kumar, S., Curran,

J. E., Lehman, D., Abecasis, G., Duggirala, R., Blangero, J., and The T2D-GENES

Consortium (2014). Data for Genetic Analysis Workshop 18: human whole genome

sequence, blood pressure, and simulated phenotypes in extended pedigrees. BMC

Proceedings 8 (Suppl 1), S2.

American Psychiatric Association (2013). Diagnostic and statistical manual of mental

disorders: DSM-5. Washington, DC, USA: American Psychiatric Association.

Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis,

A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-

Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M.,

Rubin, G. M., and Sherlock, G. (2000). : tool for the unification of

biology. The Gene Ontology Consortium. Nature Genetics 25 (1), 25–9.

Baenziger, N. L., Brodie, G. N., and Majerus, P. W. (1971). A thrombin-sensitive pro-

tein of human platelet membranes. Proceedings of the National Academy of Sciences

(PNAS), USA 68 (1), 240–3.

Bayarri, M. J. and Berger, J. O. (2004). The Interplay of Bayesian and Frequentist

Analysis. Statistical Science 19 (1), 58–80.

Below, J. E., Gamazon, E. R., Morrison, J. V., Konkashbaev, A., Pluzhnikov, A.,

McKeigue, P. M., Parra, E. J., Elbein, S. C., Hallman, D. M., Nicolae, D. L., Bell,

G. I., Cruz, M., Cox, N. J., and Hanis, C. L. (2011). Genome-wide association and

100 References

meta-analysis in populations from Starr County, Texas, and Mexico City identify

type 2 diabetes susceptibility loci and enrichment for expression quantitative trait

loci in top signals. Diabetologia 54 (8), 2047–55.

Berger, J. O. and Sellke, T. (1987). Testing a point null hypothesis: the irreconcilability

of p values and evidence. Journal of the American Statistical Society 68, 966–969.

Berkel, S., Marshall, C. R., Weiss, B., Howe, J., Roeth, R., Moog, U., Endris, V.,

Roberts, W., Szatmari, P., Pinto, D., Bonin, M., Riess, A., Engels, H., Sprengel,

R., Scherer, S. W., and Rappold, G. A. (2010). Mutations in the SHANK2 synap-

tic scaffolding gene in autism spectrum disorder and mental retardation. Nature

Genetics 42 (6), 489–91.

Berndt, S. I., Gustafsson, S., M¨agi,R., Ganna, A., Wheeler, E., Feitosa, M. F., Jus-

tice, A. E., Monda, K. L., Croteau-Chonka, D. C., Day, F. R., Esko, T., Fall, T.,

Ferreira, T., Gentilini, D., Jackson, A. U., Luan, J., Randall, J. C., Vedantam, S.,

Willer, C. J., Winkler, T. W., Wood, A. R., Workalemahu, T., Hu, Y. J., Lee,

S. H., Liang, L., Lin, D. Y., Min, J. L., Neale, B. M., Thorleifsson, G., Yang, J.,

Albrecht, E., Amin, N., Bragg-Gresham, J. L., Cadby, G., den Heijer, M., Eklund,

N., Fischer, K., Goel, A., Hottenga, J. J., Huffman, J. E., Jarick, I., Johansson,

A., Johnson, T., Kanoni, S., Kleber, M. E., K¨onig, I. R., Kristiansson, K., Kutalik,

Z., Lamina, C., Lecoeur, C., Li, G., Mangino, M., McArdle, W. L., Medina-Gomez,

C., M¨uller-Nurasyid,M., Ngwa, J. S., Nolte, I. M., Paternoster, L., Pechlivanis, S.,

Perola, M., Peters, M. J., Preuss, M., Rose, L. M., Shi, J., Shungin, D., Smith,

A. V., Strawbridge, R. J., Surakka, I., Teumer, A., Trip, M. D., Tyrer, J., Van

Vliet-Ostaptchouk, J. V., Vandenput, L., Waite, L. L., Zhao, J. H., Absher, D.,

Asselbergs, F. W., Atalay, M., Attwood, A. P., Balmforth, A. J., Basart, H., Beilby,

J., Bonnycastle, L. L., Brambilla, P., Bruinenberg, M., Campbell, H., Chasman,

D. I., Chines, P. S., Collins, F. S., Connell, J. M., Cookson, W. O., de Faire, U.,

de Vegt, F., Dei, M., Dimitriou, M., Edkins, S., Estrada, K., Evans, D. M., Farrall,

M., Ferrario, M. M., Ferri`eres,J., Franke, L., Frau, F., Gejman, P. V., Grallert, H.,

Gr¨onberg, H., Gudnason, V., Hall, A. S., Hall, P., Hartikainen, A. L., Hayward, C.,

Heard-Costa, N. L., Heath, A. C., Hebebrand, J., Homuth, G., Hu, F. B., Hunt,

S. E., Hypp¨onen,E., Iribarren, C., Jacobs, K. B., Jansson, J. O., Jula, A., K¨ah¨onen,

M., Kathiresan, S., Kee, F., Khaw, K. T., Kivim¨aki,M., Koenig, W., Kraja, A. T.,

101 References

Kumari, M., Kuulasmaa, K., Kuusisto, J., Laitinen, J. H., Lakka, T. A., Langen- berg, C., Launer, L. J., Lind, L., Lindstr¨om,J., Liu, J., Liuzzi, A., Lokki, M. L.,

Lorentzon, M., Madden, P. A., Magnusson, P. K., Manunta, P., Marek, D., M¨arz,

W., Mateo Leach, I., McKnight, B., Medland, S. E., Mihailov, E., Milani, L., Mont- gomery, G. W., Mooser, V., M¨uhleisen, T. W., Munroe, P. B., Musk, A. W., Narisu,

N., Navis, G., Nicholson, G., Nohr, E. A., Ong, K. K., Oostra, B. A., Palmer, C. N.,

Palotie, A., Peden, J. F., Pedersen, N., Peters, A., Polasek, O., Pouta, A., Pram- staller, P. P., Prokopenko, I., P¨utter,C., Radhakrishnan, A., Raitakari, O., Rendon,

A., Rivadeneira, F., Rudan, I., Saaristo, T. E., Sambrook, J. G., Sanders, A. R.,

Sanna, S., Saramies, J., Schipf, S., Schreiber, S., Schunkert, H., Shin, S. Y., Sig- norini, S., Sinisalo, J., Skrobek, B., Soranzo, N., Stan´akov´a,A., Stark, K., Stephens,

J. C., Stirrups, K., Stolk, R. P., Stumvoll, M., Swift, A. J., Theodoraki, E. V., Tho- rand, B., Tregouet, D. A., Tremoli, E., Van der Klauw, M. M., van Meurs, J. B.,

Vermeulen, S. H., Viikari, J., Virtamo, J., Vitart, V., Waeber, G., Wang, Z., Wid´en,

E., Wild, S. H., Willemsen, G., Winkelmann, B. R., Witteman, J. C., Wolffenbut- tel, B. H., Wong, A., Wright, A. F., Zillikens, M. C., Amouyel, P., Boehm, B. O.,

Boerwinkle, E., Boomsma, D. I., Caulfield, M. J., Chanock, S. J., Cupples, L. A.,

Cusi, D., Dedoussis, G. V., Erdmann, J., Eriksson, J. G., Franks, P. W., Froguel,

P., Gieger, C., Gyllensten, U., Hamsten, A., Harris, T. B., Hengstenberg, C., Hicks,

A. A., Hingorani, A., Hinney, A., Hofman, A., Hovingh, K. G., Hveem, K., Illig,

T., Jarvelin, M. R., J¨ockel, K. H., Keinanen-Kiukaanniemi, S. M., Kiemeney, L. A.,

Kuh, D., Laakso, M., Lehtim¨aki,T., Levinson, D. F., Martin, N. G., Metspalu,

A., Morris, A. D., Nieminen, M. S., Njølstad, I., Ohlsson, C., Oldehinkel, A. J.,

Ouwehand, W. H., Palmer, L. J., Penninx, B., Power, C., Province, M. A., Psaty,

B. M., Qi, L., Rauramaa, R., Ridker, P. M., Ripatti, S., Salomaa, V., Samani, N. J.,

Snieder, H., Sørensen, T. I., Spector, T. D., Stefansson, K., T¨onjes,A., Tuomile- hto, J., Uitterlinden, A. G., Uusitupa, M., van der Harst, P., Vollenweider, P.,

Wallaschofski, H., Wareham, N. J., Watkins, H., Wichmann, H. E., Wilson, J. F.,

Abecasis, G. R., Assimes, T. L., Barroso, I., Boehnke, M., Borecki, I. B., Deloukas,

P., Fox, C. S., Frayling, T., Groop, L. C., Haritunian, T., Heid, I. M., Hunter, D.,

Kaplan, R. C., Karpe, F., Moffatt, M. F., Mohlke, K. L., O’Connell, J. R., Pawitan,

Y., Schadt, E. E., Schlessinger, D., Steinthorsdottir, V., Strachan, D. P., Thorsteins-

102 References

dottir, U., van Duijn, C. M., Visscher, P. M., Di Blasio, A. M., Hirschhorn, J. N.,

Lindgren, C. M., Morris, A. P., Meyre, D., Scherag, A., McCarthy, M. I., Speliotes,

E. K., North, K. E., Loos, R. J., and Ingelsson, E. (2013). Genome-wide meta-

analysis identifies 11 new loci for anthropometric traits and provides insights into

genetic architecture. Nature Genetics 45 (5), 501–12.

Bickeb¨oller, H., Bailey, J. N., Beyene, J., Cantor, R. M., Cordell, H. J., Culverhouse,

R. C., Engelman, C. D., Fardo, D. W., Ghosh, S., K¨onig,I. R., Bermejo, J. L.,

Melton, P. E., Santorico, S. A., Satten, G. A., Sun, L., Tintle, N. L., Ziegler, A.,

MacCluer, J. W., and Almasy, L. (2014). Genetic Analysis Workshop 18: Methods

and strategies for analyzing human sequence and phenotype data in members of

extended pedigrees. BMC Proceedings 8 (Suppl 1), S1.

Casella, G. and Berger, R. L. (1987). Reconciling Bayesian and frequentist evidence

in the one-sided testing problem. Journal of the American Statistical Associa-

tion 82 (397), 106–111.

Coletta, D. K., Schneider, J., Hu, S. L., Dyer, T. D., Puppala, S., Farook, V. S., Arya,

R., Lehman, D. M., Blangero, J., DeFronzo, R. A., Duggirala, R., and Jenkinson,

C. P. (2009). Genome-wide linkage scan for genes influencing plasma triglyceride

levels in the Veterans Administration Genetic Epidemiology Study. Diabetes 58 (1),

279–284.

Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A.,

Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G., Durbin,

R., and 1000 Genomes Project Analysis Group (2011). The variant call format and

VCFtools. Bioinformatics 27 (15), 2156–2158.

Daviglus, M. L., Talavera, G. A., Avil´es-Santa, M. L., Allison, M., Cai, J., Criqui, M. H.,

Gellman, M., Giachello, A. L., Gouskova, N., Kaplan, R. C., LaVange, L., Penedo,

F., Perreira, K., Pirzada, A., Schneiderman, N., Wassertheil-Smoller, S., Sorlie,

P. D., and Stamler, J. (2012). Prevalence of major cardiovascular risk factors and

cardiovascular diseases among Hispanic/Latino individuals of diverse backgrounds

in the United States. Journal of the American Medical Association 308 (17), 1775–

1784. de Winter, J. C. F. (2013). Using the Student’s t-test with extremely small sample sizes.

Practical Assessement, Research, and Evaluation 18 (10), 1–12.

103 References

Devlin, B. and Roeder, K. (1999). Genomic control for association studies. Biomet-

rics 55 (4), 997–1004.

Ehret, G. B. and the International Consortium for Blood Pressure Genome-Wide Asso-

ciation Studies (2011). Genetic variants in novel pathways influence blood pressure

and cardiovascular disease risk. Nature 478 (7367), 103–109.

Evett, I. W., Jackson, G., and Lambert, J. A. (2000). The impact of the principles

of evidence interpretation on the structure and content of statements. Science &

Justice 40 (4), 233–239.

Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika 80 (1),

27–38.

Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh: Oliver and

Boyd.

Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models.

Bayesian Analysis 1, 515–533.

Gelman, A. (2013). Two simple examples for understanding posterior p-values whose

distributions are far from uniform. Electronic Journal of Statistics 7, 2595–2602.

Gelman, A. (2014). Bayesian data analysis (3rd ed.). London: Chapman and Hall/CRC.

Gelman, A., Jakulin, A., Pittau, M. G., and Su, Y.-S. (2008). A weakly informative

default prior distribution for logistic and other regression models. The Annals of

Applied Statistics 2 (4), 1360–1383.

Genetic Analysis Workshop (2014a). GAW 19 - Data Description. Unpublished

manuscript, provided by the Genetic Analysis Workshop 19 organizers on July 7,

2014.

Genetic Analysis Workshop (2014b). The T2D-GENES Project1 GAW19 Dataset. Un-

published manuscript, provided by the Genetic Analysis Workshop 19 organizers on

May 11, 2014.

Ghosh, J. K. and Delampady, M. (2014). Bayesian P-Values. In M. Lovric (Ed.), Inter-

national Encyclopedia of Statistical Science, pp. 101–104. Berlin: Springer.

Gibson, G. (2011). Rare and common variants: twenty arguments. Nature Review Ge-

netics 13 (2), 135–45.

104 References

Gill, J. (2008). Bayesian Methods (2nd ed.). Boca Raton, FL, USA: Chapman & Hal-

l/CRC.

Graham, J. W. (2009). Missing data analysis: making it work in the real world. Annual

Review of Psychology 60, 549–76.

Greenland, S. and Poole, C. (2013). Living with p values: resurrecting a Bayesian

perspective on frequentist statistics. Epidemiology 24 (1), 62–68.

Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., and McKusick, V. A. (2005).

Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes

and genetic disorders. Nucleic Acids Research 33 (Database issue), D514–7.

Hanis, C. L., Ferrell, R. E., Barton, S. A., Aguilar, L., Garza-Ibarra, A., Tulloch, B. R.,

Garcia, C. A., and Schull, W. J. (1983). Diabetes among Mexican Americans in

Starr County, Texas. American Journal of Epidemiology 118 (5), 659–672.

He, L., Pitk¨aniemi,J., Sarin, A. P., Salomaa, V., Sillanp¨a¨a,M. J., and Ripatti, S.

(2015). Hierarchical Bayesian model for rare variant association analysis integrating

genotype uncertainty in human sequence data. Genetic Epidemiology 39 (2), 89–100.

Hoggart, C. J., Whittaker, J. C., Iorio, M. D., and Balding, D. J. (2008). Simultaneous

analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS

Genetics 4 (7), e1000130.

Hunt, K. J., Lehman, D. M., Arya, R., Fowler, S., Leach, R. J., Gring, H. H. H., Almasy,

L., Blangero, J., Dyer, T. D., Duggirala, R., and Stern, M. P. (2005). Genome-wide

linkage analyses of type 2 diabetes in Mexican Americans: the San Antonio Family

Diabetes/Gallbladder Study. Diabetes 54 (9), 2655–2662.

Igl, W. and Thompson, J. (2014). A joint model for rare and common variant effects

on blood pressure using a Bayesian hierarchical generalized linear model. In Genetic

Analysis Workshop (Ed.), GAW19: Analysis of human sequence, blood pressure,

and gene expression data [Conference Proceedings]. Conference proceedings of the

Genetic Analysis Workshop 19 (GAW19), Vienna, August 24 to 27, 2014.

Iyengar, S. K. and Elston, R. C. (2007). The genetic basis of complex traits: rare

variants or “common gene, common disease”? Methods in Molecular Biology 376,

71–84.

Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems.

105 References

Proceedings of the Royal Society of London: Series A (Mathematical and Physical

Sciences) 186 (1007), 453–461.

Jeffreys, H. (1961). The theory of probability (3rd ed.). Oxford University Press.

Ji, W., Foo, J. N., O’Roak, B. J., Zhao, H., Larson, M. G., Simon, D. B., Newton-

Cheh, C., State, M. W., Levy, D., and Lifton, R. P. (2008). Rare independent

mutations in renal salt handling genes contribute to blood pressure variation. Nature

Genetics 40 (5), 592–9.

Joyce, A. R. and Palsson, B. O. (2006). The model organism as a system: integrating

‘omics’ data sets. Nature Reviews Molecular Cell Biology 7 (3), 198–210.

Julious, S. A. and Mullee, M. A. (1994). Confounding and Simpson’s paradox. British

Medical Journal 309 (6967), 1480–1481.

Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical

Association 90 (430), 791.

Kato, N., Takeuchi, F., Tabara, Y., Kelly, T. N., Go, M. J., Sim, X., Tay, W. T.,

Chen, C. H., Zhang, Y., Yamamoto, K., Katsuya, T., Yokota, M., Kim, Y. J., Ong,

R. T., Nabika, T., Gu, D., Chang, L. C., Kokubo, Y., Huang, W., Ohnaka, K.,

Yamori, Y., Nakashima, E., Jaquish, C. E., Lee, J. Y., Seielstad, M., Isono, M.,

Hixson, J. E., Chen, Y. T., Miki, T., Zhou, X., Sugiyama, T., Jeon, J. P., Liu,

J. J., Takayanagi, R., Kim, S. S., Aung, T., Sung, Y. J., Zhang, X., Wong, T. Y.,

Han, B. G., Kobayashi, S., Ogihara, T., Zhu, D., Iwai, N., Wu, J. Y., Teo, Y. Y.,

Tai, E. S., Cho, Y. S., and He, J. (2011). Meta-analysis of genome-wide association

studies identifies common variants associated with blood pressure variation in east

Asians. Nature Genetics 43 (6), 531–8.

Kelly, T. N., Takeuchi, F., Tabara, Y., Edwards, T. L., Kim, Y. J., Chen, P., Li, H.,

Wu, Y., Yang, C. F., Zhang, Y., Gu, D., Katsuya, T., Ohkubo, T., Gao, Y. T., Go,

M. J., Teo, Y. Y., Lu, L., Lee, N. R., Chang, L. C., Peng, H., Zhao, Q., Nakashima,

E., Kita, Y., Shu, X. O., Kim, N. H., Tai, E. S., Wang, Y., Adair, L. S., Chen,

C. H., Zhang, S., Li, C., Nabika, T., Umemura, S., Cai, Q., Cho, Y. S., Wong,

T. Y., Zhu, J., Wu, J. Y., Gao, X., Hixson, J. E., Cai, H., Lee, J., Cheng, C. Y.,

Rao, D. C., Xiang, Y. B., Cho, M. C., Han, B. G., Wang, A., Tsai, F. J., Mohlke,

K., Lin, X., Ikram, M. K., Lee, J. Y., Zheng, W., Tetsuro, M., Kato, N., and He, J.

106 References

(2013). Genome-wide association study meta-analysis reveals transethnic replication

of mean arterial and pulse pressure loci. Hypertension 62 (5), 853–9.

Kircher, M., Witten, D. M., Jain, P., O’Roak, B. J., Cooper, G. M., and Shendure,

J. (2014). A general framework for estimating the relative pathogenicity of human

genetic variants. Nature Genetics 46 (3), 310–315.

Kleefstra, T., Nillesen, W. M., and Yntema, H. G. (2010). Kleefstra Syndrome. Seattle,

WA, USA: Gene Reviews.

Knowler, W. C., Coresh, J., Elston, R. C., Freedman, B. I., Iyengar, S. K., Kimmel,

P. L., Olson, J. M., Plaetke, R., Sedor, J. R., Seldin, M. F., and Family Investigation

of Nephropathy and Diabetes Research Group (2005). The Family Investigation of

Nephropathy and Diabetes (FIND): design and methods. Journal of Diabetes and

its Complications 19 (1), 1–9.

Kruschke, J. (2011). Doing Bayesian data analysis. London, UK: Academic Press.

Liu, C., Kraja, A. T., Smith, J. A., Brody, J. A., Franceschini, N., Morrison, A. C., Lu,

Y., Weiss, S., Auer, P. L., Guo, X., Chu, A. Y., Jakobsd´ottir,J., Bis, J. C., Zhao,

W., Tsosie, K., Amin, N., Mei, H., Newton-Cheh, C., Palmas, W., Liu, Y., Loos,

R. J. F., Edwards, T. L., V¨olker, U., Fornage, M., van Duijn, C. M., Borecki, I.,

Ehret, G., Gudnason, V., Chasman, D., Levy, D., and CHARGE Plus Exome Chip

Blood Pressure Consortium (2014). Meta-analysis of Variants on the Exome Chip

in 120,700 Individuals of European Ancestry Identifies Multiple Rare and Common

Loci for Blood Pressure. In Poster abstracts of the American Society of Human

Genetics 64th annual meeting, pp. 2099S.

Ma, C., Blackwell, T., Boehnke, M., Scott, L. J., and GoT2D investigators (2013).

Recommended joint and meta-analysis strategies for case-control association testing

of single low-count variants. Genetic Epidemiology 37 (6), 539–50.

Ma, L., Han, S., Yang, J., and Da, Y. (2010). Multi-locus test conditional on confirmed

effects leads to increased power in genome-wide association studies. PloS One 5 (11),

e15006.

Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter,

D. J., McCarthy, M. I., Ramos, E. M., Cardon, L. R., Chakravarti, A., Cho, J. H.,

Guttmacher, A. E., Kong, A., Kruglyak, L., Mardis, E., Rotimi, C. N., Slatkin, M.,

107 References

Valle, D., Whittemore, A. S., Boehnke, M., Clark, A. G., Eichler, E. E., Gibson, G.,

Haines, J. L., Mackay, T. F., McCarroll, S. A., and Visscher, P. M. (2009). Finding

the missing heritability of complex diseases. Nature 461 (7265), 747–53.

Marcovitch, H. (Ed.) (2010). Black’s Medical Dictionary (42nd ed.). Black Publishers.

McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models (2nd, revised ed.).

Boca Ration: CRC Press.

Mitchell, B. D., Kammerer, C. M., Blangero, J., Mahaney, M. C., Rainwater, D. L.,

Dyke, B., Hixson, J. E., Henkel, R. D., Sharp, R. M., Comuzzie, A. G., VandeBerg,

J. L., Stern, M. P., and MacCluer, J. W. (1996a). Genetic and environmental con-

tributions to cardiovascular risk factors in Mexican Americans. The San Antonio

Family Heart Study. Circulation 94 (9), 2159–70.

Mitchell, B. D., Kammerer, C. M., Blangero, J., Mahaney, M. C., Rainwater, D. L.,

Dyke, B., Hixson, J. E., Henkel, R. D., Sharp, R. M., Comuzzie, A. G., VandeBerg,

J. L., Stern, M. P., and MacCluer, J. W. (1996b). Genetic and environmental con-

tributions to cardiovascular risk factors in Mexican Americans. The San Antonio

Family Heart Study. Circulation 94 (9), 2159–2170.

Mitra, R. and M¨uller,P. (2014). Hierarchical Bayesian models for ChIP-seq data. In

S. Datta & D. Nettleton (Eds.), Statistical analysis of next generation sequenc-

ing data, Frontiers in probability and the statistical sciences, pp. 297–314. Berlin:

Springer.

Moutsianas, L. and Morris, A. P. (2014). Methodology for the analysis of rare genetic

variation in genome-wide association and re-sequencing studies of complex human

traits. Briefings in Functional Genomics 13 (5), 362–70.

Munroe, P. B., Barnes, M. R., and Caulfield, M. J. (2013). Advances in blood pressure

genomics. Circulation Research 112 (10), 1365–79.

National Institute for Health and Clinical Excellence (2006). Hypertension: Manage-

ment of hypertension in adults in primary care. London: NICE.

Neale, B. M. and Sham, P. C. (2004). The future of association studies: gene-based

analysis and replication. American Journal of Human Genetics 75 (3), 353–62.

Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. Journal of

the Royal Statistical Society: Series A (General) 135, 370–384.

108 References

Panoutsopoulou, K., Tachmazidou, I., and Zeggini, E. (2013). In search of low-frequency

and rare variants affecting complex traits. Human Molecular Genetics 22, R16–21.

Peacock, J. L. and Peacock, P. J. (2011). Oxford Handbook of Medical Statistics. Oxford,

UK: Oxford University Press.

Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and

Reich, D. (2006). Principal components analysis corrects for stratification in genome-

wide association studies. Nature Genetics 38 (8), 904–9.

Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., Maller,

J., Sklar, P., de Bakker, P. I., Daly, M. J., and Sham, P. C. (2007). PLINK: a tool

set for whole-genome association and population-based linkage analyses. American

Journal of Human Genetics 81 (3), 559–75.

Ritchie, G. R., Dunham, I., Zeggini, E., and Flicek, P. (2014). Functional annotation of

noncoding sequence variants. Nature Methods 11 (3), 294–6.

R´ıos-Gonz´alez,B. E., Ibarra-Cort´es,B., Ram´ırez-L´opez, G., S´anchez-Corona, J., and

Maga˜na-Torres, M. T. (2014). Association of polymorphisms of genes involved in

lipid metabolism with blood pressure and lipid values in Mexican hypertensive in-

dividuals. Disease Markers 2014, 150358.

Ruggeri, Z. M. and Ware, J. (1993). von Willebrand factor. FASEB Journal 7 (2), 308–

16.

Schork, N. J., Murray, S. S., Frazer, K. A., and Topol, E. J. (2009). Common vs.

rare allele hypotheses for complex diseases. Current Opinion Genetic and Develop-

ment 19 (3), 212–9.

Sellke, T., Bayarri, M. J., and Berger, J. O. (2001). Calibration of p values for testing

precise null hypotheses. The American Statistician 55 (1), 62–71.

Song, K., Backs, J., McAnally, J., Qi, X., Gerard, R. D., Richardson, J. A., Hill, J. A.,

Bassel-Duby, R., and Olson, E. N. (2006). The transcriptional coactivator CAMTA2

stimulates cardiac growth by opposing class II histone deacetylases. Cell 125 (3),

453–66.

Stelzer, G., Dalah, I., Stein, T. I., Satanower, Y., Rosen, N., Nativ, N., Oz-Levi, D.,

Olender, T., Belinky, F., Bahir, I., Krug, H., Perco, P., Mayer, B., Kolker, E.,

109 References

Safran, M., and Lancet, D. (2011). In-silico human genomics with GeneCards. Hu-

man Genomics 5 (6), 709–17.

Stephens, M. and Balding, D. J. (2009). Bayesian statistical methods for genetic asso-

ciation studies. Nature Reviews Genetics 10 (10), 681–90.

Stringer, S., Wray, N. R., Kahn, R. S., and Derks, E. M. (2011). Underestimated effect

sizes in GWAS: fundamental limitations of single SNP analysis for dichotomous

phenotypes. PloS One 6 (11), e27964.

Takahashi, N. and Smithies, O. (2004). Human genetics, animal models and computer

simulations for studying hypertension. Trends in Genetics 20 (3), 136–45.

Thompson, J. (2014). Bayesian data analysis with Stata. College Station, Texas, USA:

Stata Press.

Trynka, G., Hunt, K. A., Bockett, N. A., Romanos, J., Mistry, V., Szperl, A., Bakker,

S. F., Bardella, M. T., Bhaw-Rosun, L., Castillejo, G., de la Concha, E. G.,

de Almeida, R. C., Dias, K.-R. M., van Diemen, C. C., Dubois, P. C. A., Duerr,

R. H., Edkins, S., Franke, L., Fransen, K., Gutierrez, J., Heap, G. A. R., Hrdlickova,

B., Hunt, S., Plaza Izurieta, L., Izzo, V., Joosten, L. A. B., Langford, C., Mazz-

illi, M. C., Mein, C. A., Midah, V., Mitrovic, M., Mora, B., Morelli, M., Nutland,

S., Nunez, C., Onengut-Gumuscu, S., Pearce, K., Platteel, M., Polanco, I., Potter,

S., Ribes-Koninckx, C., Ricano-Ponce, I., Rich, S. S., Rybak, A., Santiago, J. L.,

Senapati, S., Sood, A., Szajewska, H., Troncone, R., VaradA˜ c , J., Wallace, C.,

Wolters, V. M., Zhernakova, A., Spanish Consortium on the Genetics of Coeliac

Disease (CEGEC), PreventCD Study Group, Wellcome Trust Case Control Consor-

tium (WTCCC), Thelma, B. K., Cukrowska, B., Urcelay, E., Bilbao, J. R., Mearin,

M. L., Barisani, D., Barrett, J. C., Plagnol, V., Deloukas, P., Wijmenga, C., and van

Heel, D. A. (2011). Dense genotyping identifies and localizes multiple common and

rare variant association signals in celiac disease. Nature Genetics 43 (12), 1193–1201.

UK Parkinson’s Disease Consortium, Wellcome Trust Case Control Consortium 2,

Spencer, C. C. A., Plagnol, V., Strange, A., Gardner, M., Paisan-Ruiz, C., Band, G.,

Barker, R. A., Bellenguez, C., Bhatia, K., Blackburn, H., Blackwell, J. M., Bramon,

E., Brown, M. A., Brown, M. A., Burn, D., Casas, J.-P., Chinnery, P. F., Clarke,

C. E., Corvin, A., Craddock, N., Deloukas, P., Edkins, S., Evans, J., Freeman,

C., Gray, E., Hardy, J., Hudson, G., Hunt, S., Jankowski, J., Langford, C., Lees,

110 References

A. J., Markus, H. S., Mathew, C. G., McCarthy, M. I., Morrison, K. E., Palmer, C.

N. A., Pearson, J. P., Peltonen, L., Pirinen, M., Plomin, R., Potter, S., Rautanen,

A., Sawcer, S. J., Su, Z., Trembath, R. C., Viswanathan, A. C., Williams, N. W.,

Morris, H. R., Donnelly, P., and Wood, N. W. (2011). Dissection of the genetics

of Parkinson’s disease identifies an additional association 5’ of SNCA and multiple

associated haplotypes at 17q21. Human Molecular Genetics 20 (2), 345–353.

UniProt Consortium (2014). Activities at the Universal Protein Resource (UniProt).

Nucleic Acids Research 42 (Database issue), D191–198.

Wagenmakers, E.-J., Lee, M., Lodewyckx, T., and Iverson, G. (2008). Bayesian versus

frequentist inference. In H. Hoijtink, I. Klugkist & P. A. Boelen (Eds.), Bayesian

evaluation of informative hypotheses, Statistics for social and behavioral sciences,

pp. 181–207. New York: Springer.

Wain, L. V. (2014). Rare variants and cardiovascular disease. Briefings in Functional

Genomics 13 (5), 384–91.

Wain, L. V., Verwoert, G. C., O’Reilly, P. F., Shi, G., Johnson, T., Johnson, A. D.,

Bochud, M., Rice, K. M., Henneman, P., Smith, A. V., Ehret, G. B., Amin, N.,

Larson, M. G., Mooser, V., Hadley, D., D¨orr,M., Bis, J. C., Aspelund, T., Esko,

T., Janssens, A. C. J. W., Zhao, J. H., Heath, S., Laan, M., Fu, J., Pistis, G.,

Luan, J., Arora, P., Lucas, G., Pirastu, N., Pichler, I., Jackson, A. U., Webster,

R. J., Zhang, F., Peden, J. F., Schmidt, H., Tanaka, T., Campbell, H., Igl, W.,

Milaneschi, Y., Hottenga, J.-J., Vitart, V., Chasman, D. I., Trompet, S., Bragg-

Gresham, J. L., Alizadeh, B. Z., Chambers, J. C., Guo, X., Lehtim¨aki,T., K¨uhnel,

B., Lopez, L. M., Polaˇsek,O., Boban, M., Nelson, C. P., Morrison, A. C., Pihur,

V., Ganesh, S. K., Hofman, A., Kundu, S., Mattace-Raso, F. U. S., Rivadeneira,

F., Sijbrands, E. J. G., Uitterlinden, A. G., Hwang, S.-J., Vasan, R. S., Wang,

T. J., Bergmann, S., Vollenweider, P., Waeber, G., Laitinen, J., Pouta, A., Zitting,

P., McArdle, W. L., Kroemer, H. K., V¨olker, U., V¨olzke, H., Glazer, N. L., Taylor,

K. D., Harris, T. B., Alavere, H., Haller, T., Keis, A., Tammesoo, M.-L., Aulchenko,

Y., Barroso, I., Khaw, K.-T., Galan, P., Hercberg, S., Lathrop, M., Eyheramendy,

S., Org, E., S˜ober, S., Lu, X., Nolte, I. M., Penninx, B. W., Corre, T., Masciullo,

C., Sala, C., Groop, L., Voight, B. F., Melander, O., O’Donnell, C. J., Salomaa, V.,

d’Adamo, A. P., Fabretto, A., Faletra, F., Ulivi, S., Del Greco, F. M., Facheris, M.,

111 References

Collins, F. S., Bergman, R. N., Beilby, J. P., Hung, J., Musk, A. W., Mangino, M.,

Shin, S.-Y., Soranzo, N., Watkins, H., Goel, A., Hamsten, A., Gider, P., Loitfelder,

M., Zeginigg, M., Hernandez, D., Najjar, S. S., Navarro, P., Wild, S. H., Corsi,

A. M., Singleton, A., de Geus, E. J. C., Willemsen, G., Parker, A. N., Rose, L. M.,

Buckley, B., Stott, D., Orru, M., Uda, M., LifeLines Cohort Study, van der Klauw,

M. M., Zhang, W., Li, X., Scott, J., Chen, Y.-D. I., Burke, G. L., K¨ah¨onen,M.,

Viikari, J., D¨oring,A., Meitinger, T., Davies, G., Starr, J. M., Emilsson, V., Plump,

A., Lindeman, J. H., Hoen, P. A. C. t., K¨onig,I. R., EchoGen consortium, Felix,

J. F., Clarke, R., Hopewell, J. C., Ongen, H., Breteler, M., Debette, S., Destefano,

A. L., Fornage, M., AortaGen Consortium, Mitchell, G. F., CHARGE Consortium

Heart Failure Working Group, Smith, N. L., KidneyGen consortium, Holm, H.,

Stefansson, K., Thorleifsson, G., Thorsteinsdottir, U., CKDGen consortium, Car-

diogenics consortium, CardioGram, Samani, N. J., Preuss, M., Rudan, I., Hayward,

C., Deary, I. J., Wichmann, H.-E., Raitakari, O. T., Palmas, W., Kooner, J. S.,

Stolk, R. P., Jukema, J. W., Wright, A. F., Boomsma, D. I., Bandinelli, S., Gyl-

lensten, U. B., Wilson, J. F., Ferrucci, L., Schmidt, R., Farrall, M., Spector, T. D.,

Palmer, L. J., Tuomilehto, J., Pfeufer, A., Gasparini, P., Siscovick, D., Altshuler, D.,

Loos, R. J. F., Toniolo, D., Snieder, H., Gieger, C., Meneton, P., Wareham, N. J.,

Oostra, B. A., Metspalu, A., Launer, L., Rettig, R., Strachan, D. P., Beckmann,

J. S., Witteman, J. C. M., Erdmann, J., van Dijk, K. W., Boerwinkle, E., Boehnke,

M., Ridker, P. M., Jarvelin, M.-R., Chakravarti, A., Abecasis, G. R., Gudnason,

V., Newton-Cheh, C., Levy, D., Munroe, P. B., Psaty, B. M., Caulfield, M. J., Rao,

D. C., Tobin, M. D., Elliott, P., and van Duijn, C. M. (2011). Genome-wide as-

sociation study identifies six new loci influencing pulse pressure and mean arterial

pressure. Nature Genetics 43 (10), 1005–1011.

Wang, K., Li, M., and Hakonarson, H. (2010). ANNOVAR: functional annotation

of genetic variants from high-throughput sequencing data. Nucleic Acids Re-

search 38 (16), e164.

Wellcome Trust Case Control Consortium, Maller, J. B., McVean, G., Byrnes, J., Vukce-

vic, D., Palin, K., Su, Z., Howson, J. M., Auton, A., Myers, S., Morris, A., Pirinen,

M., Brown, M. A., Burton, P. R., Caulfield, M. J., Compston, A., Farrall, M., Hall,

A. S., Hattersley, A. T., Hill, A. V., Mathew, C. G., Pembrey, M., Satsangi, J., Strat-

112 References

ton, M. R., Worthington, J., Craddock, N., Hurles, M., Ouwehand, W., Parkes, M.,

Rahman, N., Duncanson, A., Todd, J. A., Kwiatkowski, D. P., Samani, N. J., Gough,

S. C., McCarthy, M. I., Deloukas, P., and Donnelly, P. (2012). Bayesian refinement

of association signals for 14 loci in 3 common diseases. Nature Genetics 44 (12),

1294–301.

Welter, D., MacArthur, J., Morales, J., Burdett, T., Hall, P., Junkins, H., Klemm,

A., Flicek, P., Manolio, T., Hindorff, L., and Parkinson, H. (2014). The NHGRI

GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Re-

search 42 (Database issue), D1001–6.

Yi, N., Liu, N., Zhi, D., and Li, J. (2011). Hierarchical generalized linear models for mul-

tiple groups of rare and common variants: jointly esxoimating group and individual-

variant effects. PLoS Genetics 7 (12), e1002382.

Yi, N. and Ma, S. (2012). Hierarchical shrinkage priors and model fitting for high-

dimensional generalized linear models. Statistical Applications in Genetics and

Molecular Biology 11 (6), 1–20.

Yi, N., Xu, S., Lou, X. Y., and Mallick, H. (2014). Multiple comparisons in genetic

association studies: a hierarchical modeling approach. Statistical Applications in

Genetics and Molecular Biology 13 (1), 35–48.

Yi, N. and Zhi, D. (2011). Bayesian analysis of rare variants in genetic association

studies. Genetic Epidemiology 35 (1), 57–69.

Zhang, J.-T. (2005). Approximate and asymptotic distributions of Chi-squared-type

mixtures with applications. Journal of the American Statistical Association 100,

273–285.

Zhang, L., Baladandayuthapani, V., Mallick, B. K., Manyam, G. C., Thompson, P. A.,

Bondy, M. L., and Do, K.-A. (2014). Bayesian hierarchical structured variable selec-

tion methods with application to molecular inversion probe studies in breast cancer.

Journal of the Royal Statistical Society: Series C (Applied Statistics) 63 (4), 595–

620.

Zhao, Q., Kelly, T. N., Li, C., and He, J. (2013). Progress and future aspects in genetics

of human hypertension. Current Hypertension Reports 15 (6), 676–86.

113 A Appendix

114 A Appendix

A.1 Generalized linear model of the normal

distribution

The following equations describe a generalized linear model with a normal error distribu- tion with yi as an observed value, and µ as the mean and σ as the standard deviation of the normal distribution [McCullagh & Nelder 1989].

Probability Density Function (PDF) of the normal distribution1:

2 1 − 1 y−µ f(y|µ, σ) = √ e 2 ( σ ) (A.1) 2πσ2

Exponential Family Representation (EFR) of the Probability Density Function (PDF):

1 1 y − µ2 ln(f(y|µ, σ) = − ln(2πσ2) − (A.2) 2 2 σ  1 (y − µ)2  f(y|µ, σ) = exp − ln(2πσ2) − (A.3) 2 2σ2 −y2 + 2yµ − µ2 1  = exp − ln 2πσ2 (A.4) 2σ2 2 2yµ − µ2 y2 1  = exp − − ln 2πσ2 (A.5) 2σ2 2σ2 2 ( 2 ) yµ − µ 1  y2  = exp 2 − + ln 2πσ2 (A.6) σ2 2 σ2 yθ − b(θ)  = exp + c(y, φ) (A.7) a(φ)

Parameters:

Canonical parameter θ = µ

Dispersion parameter φ = σ2

a(φ) = φ = σ2

µ2 b(θ) = 2 1 h y2 2 i c(y, φ) = − 2 σ2 + ln(2πσ )

Mean and Variance: θ2 0 ∂ 2 E(y) = b (θ) = ∂θ = θ = µ V (y) = b00(θ) · a(φ) = 1 · φ = σ2

1Indices i for individuals are omitted to enhance readability.

115 A Appendix

Canonical link function h and its inverse function h−1 (identity link):

h(µ) = 1 · θ; h−1(θ) = µ

Likelihood L and log-likelihood `:

Y L = f(y|θ, φ) (A.8) i Y = f(y|µ, σ) (A.9) i ( µ2 ) Y yµ − y2 1 = exp 2 − − ln 2πσ2 (A.10) σ2 2σ2 2 i µ2 X yµ − y2 1 ` = 2 − − ln 2πσ2 (A.11) σ2 2σ2 2 i µ2 X yµ − y2 ∝ 2 − (A.12) σ2 2σ2 i 1 X ∝ − y2 − 2yµ + µ2 (A.13) 2σ2 i 1 X ∝ − (y − µ)2 (A.14) 2σ2 i

First derivative of the log-likelihood:

∂` 1 X = − (−2y + 2µ) (A.15) ∂µ 2σ2 i 1 X = (y − µ) (A.16) σ2 i

Maximum likelihood estimator:

1 X 0 = (y − µ) (A.17) σ2 i X 0 = −nµ + y (A.18) i P y µ = i (A.19) n

116 A Appendix

Second derivative of the log-likelihood (with known variance σ2):

∂ 1 P (y − µ) ∂2` σ2 = i (A.20) ∂µ2 ∂µ 1 X = (0 − 1) (A.21) σ2 i n = − (A.22) σ2

Standard deviation of the maximum likelihood estimator (standard error, SE):

s 1 SE = (A.23) ∂2` − ∂µ2 s 1 = n (A.24) −(− σ2 ) s 1 = n (A.25) σ2 σ = √ (A.26) n

(Scaled) deviance D:

D = −2 (`fit − `saturated) (A.27)

= +2 (`saturated − `fit) (A.28) ( y2 µ2 ) X y · y − y · µ − = +2 2 − 2 (A.29) σ2 σ2 i 2 X  y2 µ2  = y2 − − yµ + (A.30) σ2 2 2 i 1 X = 2y2 − y2 − 2yµ + µ2 (A.31) σ2 i 1 X = y2 − 2yµ + µ2 (A.32) σ2 i 1 X = (y − µ)2 (A.33) σ2 i

117 A Appendix

A.2 Generalized linear model of the binomial

distribution

The following equations describe a generalized linear model with a binomial error distri- bution with yi as the observed number of successes, n as the number of trials and p as the probability of success per trial of the binomial distribution [McCullagh & Nelder 1989].

Probability Density Function (PDF) of the binomial distribution1:

n f(y|n, p) = py(1 − p)n−y (A.34) y

Exponential Family Representation (EFR) of the Probability Density Function (PDF):

n ln(f(y|n, p) = ln + y ln(p) + (n − y) ln(1 − p) (A.35) y n f(y|n, p) = exp{y ln(p) + n ln(1 − p) − y ln(1 − p) + ln } (A.36) y ( p ) y ln + n ln(1 − p) n = exp 1−p + ln (A.37) 1 y ( p 1 ) y ln − n ln( ) n = exp 1−p 1−p + ln (A.38) 1 y yθ − b(θ)  = exp + c(y, φ) (A.39) a(φ)

Parameters:

p eθ Canonical parameter θ = ln 1−p ; p = 1+eθ Dispersion parameter φ = 1

a(φ) = φ = 1

1 θ b(θ) = n ln( 1−p ) = n ln(1 + e ) n c(y, φ) = ln y

Mean and Variance:

E(y) = b0(θ) (A.40) ∂n ln(1 + eθ) = (A.41) ∂θ

1Indices i for individuals are omitted to enhance readability.

118 A Appendix

eθ = n · (A.42) 1 + eθ = n · p (A.43)

V (y) = b00(θ) · a(φ) (A.44)

= b00(θ) · 1 (A.45) ∂n · p = (A.46) ∂θ eθ ∂n θ = 1+e (A.47) ∂θ eθ(1 + eθ) − eθ · eθ = n · (A.48) (1 + eθ)2 eθ 1 = n · · (A.49) 1 + eθ 1 + eθ 1 + eθ − eθ = n · p · (A.50) 1 + eθ 1 + eθ eθ  = n · p · − (A.51) 1 + eθ 1 + eθ

= n · p · (1 − p) (A.52)

Canonical Link Function h and its inverse function h−1 (logistic link):

p −1 eθ h(p) = θ = ln 1−p ; h (θ) = 1+eθ Likelihood L and log-likelihood `:

Y L = f(y|θ, φ) (A.53) i Y L = f(y|n, p) (A.54) i ( p 1  ) Y y ln( 1−p ) − n ln( 1−p ) n L = exp + ln (A.55) 1 y i X   p   1  n ` = y ln − n ln + ln (A.56) 1 − p 1 − p y i X   p   1  ` ∝ y ln − n ln (A.57) 1 − p 1 − p i X   p   ` ∝ y ln + n ln(1 − p) (A.58) 1 − p i X ` ∝ {y ln(p) − y ln(1 − p) + n ln(1 − p)} (A.59) i X ` ∝ {y ln(p) + (n − y) ln(1 − p)} (A.60) i

119 A Appendix

First derivative of the log-likelihood:

∂` X  1 1  = y + (n − y) (−1) (A.61) ∂p p 1 − p i X y n − y  = − (A.62) p 1 − p i

Maximum likelihood estimator for a single observation i:

y n − y 0 = − (A.63) p 1 − p n − y y = (A.64) 1 − p p n − y 1 − p = (A.65) y p n − y 1 = − 1 (A.66) y p 1 n − y y = + (A.67) p y y 1 n = (A.68) p y y p = (A.69) n

Second derivative of the log-likelihood:

n o ∂ P y − n−y ∂2` p 1−p = i (A.70) ∂p2 ∂p X  y 0 · (1 − p) − (n − y)(−1) = − − (A.71) p2 (1 − p)2 i X  y n − y  = − − (A.72) p2 (1 − p)2 i

Standard deviation of the maximum likelihood estimator (standard error, SE) for a single observation i:

s 1 SE = (A.73) ∂2` − ∂p2 v u 1 = u (A.74) t  y n−y  − − p2 − (1−p)2

120 A Appendix

v u 1 = u (A.75) t y n−y y 2 + (1− y )2 ( n ) n s 1 = (A.76) n2 n2 y + n−y s 1 = (A.77) n3 y(n−y) r y(n − y) = (A.78) n3 r 1 y n − y = (A.79) n n n r p(1 − p) = (A.80) n

(Scaled) deviance D for a single observation i with y as the observed andµ ˆ as the estimated number of successes:

D = − 2 (`fit − `saturated) (A.81)

= + 2`saturated − 2`fit (A.82)  y   n 1 n = + 2 y · ln y − n · ln y + ln (A.83) 1 − n 1 − n y " # µˆ 1 n − 2 y · ln n − n · ln + ln (A.84) µˆ µˆ y 1 − n 1 − n  y  n 1 = + 2 y · ln y − n · ln y (A.85) 1 − n 1 − n " µˆ # n 1 − 2 y · ln µˆ − n · ln µˆ (A.86) 1 − n 1 − n  y n  = + 2 y · ln − n · ln (A.87) n − y n − y  µˆ n  − 2 y · ln − n · ln (A.88) n − µˆ n − µˆ  y µˆ  = + 2 y · ln − y · ln (A.89) n − y n − µˆ  n n  − 2 n · ln − n · ln (A.90) n − y n − µˆ

= + 2[y ln y − y ln(n − y) − y lnµ ˆ + y(n − µˆ) (A.91)

− n ln n + n ln(n − y) + n ln n − n ln(n − µˆ)] (A.92)  y  = + 2 y · ln + (n − y) · ln(n − y) − (n − y) · ln(n − µˆ) (A.93) µˆ  y  n − y  = + 2 y · ln + (n − y) · ln (A.94) µˆ n − µˆ

121 A Appendix

A.3 Examples of prior distributions

location=−5 location=−5 location=−5 0.8 0.8 0.8 scale= 0.5 scale= 1 scale= 1.5 0.6 0.6 0.6 0.4 0.4 0.4 Density Density Density 0.2 0.2 0.2 0.0 0.0 0.0

−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10

x x x

location=0 location=0 location=0 0.8 0.8 0.8 scale= 0.5 scale= 1 scale= 1.5 0.6 0.6 0.6 0.4 0.4 0.4 Density Density Density 0.2 0.2 0.2 0.0 0.0 0.0

−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10

x x x

location=5 location=5 location=5 0.8 0.8 0.8 scale= 0.5 scale= 1 scale= 1.5 0.6 0.6 0.6 0.4 0.4 0.4 Density Density Density 0.2 0.2 0.2 0.0 0.0 0.0

−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10

x x x

Figure A.1: The normal distribution with varying values for location and scale

122 A Appendix

location=−1 location=−1 location=−1 0.5 0.5 0.5 df= 1 df= 15 df= 30 0.4 0.4 0.4 0.3 0.3 0.3 Density Density Density 0.2 0.2 0.2 0.1 0.1 0.1 0.0 0.0 0.0

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

x x x

location=0 location=0 location=0 0.5 0.5 0.5 df= 1 df= 15 df= 30 0.4 0.4 0.4 0.3 0.3 0.3 Density Density Density 0.2 0.2 0.2 0.1 0.1 0.1 0.0 0.0 0.0

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

x x x

location=1 location=1 location=1 0.5 0.5 0.5 df= 1 df= 15 df= 30 0.4 0.4 0.4 0.3 0.3 0.3 Density Density Density 0.2 0.2 0.2 0.1 0.1 0.1 0.0 0.0 0.0

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

x x x

Figure A.2: The t distribution with varying parameter values for location (vertical) and scale (horizontal)

123 A Appendix

location=−5 location=−5 location=−5 0.8 0.8 0.8 scale= 0.5 scale= 2.5 scale= 10 0.6 0.6 0.6 0.4 0.4 0.4 Density Density Density 0.2 0.2 0.2 0.0 0.0 0.0

−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10

x x x

location=0 location=0 location=0 0.8 0.8 0.8 scale= 0.5 scale= 2.5 scale= 10 0.6 0.6 0.6 0.4 0.4 0.4 Density Density Density 0.2 0.2 0.2 0.0 0.0 0.0

−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10

x x x

location=5 location=5 location=5 0.8 0.8 0.8 scale= 0.5 scale= 2.5 scale= 10 0.6 0.6 0.6 0.4 0.4 0.4 Density Density Density 0.2 0.2 0.2 0.0 0.0 0.0

−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10

x x x

Figure A.3: The Cauchy distribution with varying parameter values for location and scale

124 A Appendix

df=1 df=2 0.8 0.8 0.6 0.6 0.4 0.4 Density Density 0.2 0.2 0.0 0.0

0 5 10 15 20 25 30 0 5 10 15 20 25 30

x x

df=3 df=4 0.8 0.8 0.6 0.6 0.4 0.4 Density Density 0.2 0.2 0.0 0.0

0 5 10 15 20 25 30 0 5 10 15 20 25 30

x x

df=5 df=6 0.8 0.8 0.6 0.6 0.4 0.4 Density Density 0.2 0.2 0.0 0.0

0 5 10 15 20 25 30 0 5 10 15 20 25 30

x x

Figure A.4: The χ2 distribution with varying degrees of freedom

125 A Appendix

5 df=1 5 df=2 4 4 3 3 Density Density 2 2 1 1 0 0

0 1 2 3 4 5 0 1 2 3 4 5

x x

5 df=3 5 df=4 4 4 3 3 Density Density 2 2 1 1 0 0

0 1 2 3 4 5 0 1 2 3 4 5

x x

5 df=5 5 df=6 4 4 3 3 Density Density 2 2 1 1 0 0

0 1 2 3 4 5 0 1 2 3 4 5

x x

Figure A.5: The inverse χ2 distribution with varying degrees of freedom

126 A Appendix

shape=0.5 shape=0.5 shape=0.5 2.0 2.0 2.0 rate=0.5 rate=1 rate=1.5 1.5 1.5 1.5 1.0 1.0 1.0 Density Density Density 0.5 0.5 0.5 0.0 0.0 0.0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

x x x

shape=1 shape=1 shape=1 2.0 2.0 2.0 rate=0.5 rate=1 rate=1.5 1.5 1.5 1.5 1.0 1.0 1.0 Density Density Density 0.5 0.5 0.5 0.0 0.0 0.0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

x x x

shape=1.5 shape=1.5 shape=1.5 2.0 2.0 2.0 rate=0.5 rate=1 rate=1.5 1.5 1.5 1.5 1.0 1.0 1.0 Density Density Density 0.5 0.5 0.5 0.0 0.0 0.0

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

x x x

Figure A.6: The gamma (Γ) distribution with varying parameter values for shape (vertical) and rate (horizontal)

127 A Appendix

A.4 Supplementary tables

Table A.1: Descriptives of a single simulated dataset (10,000 individuals) selected from 1,000 datasets with genetic effects (simulated data Ib)

All No Hypertension Hypertension 10000 (100%) 6770 (67.70%) 3230 (32.30%) Sex, female, N(%) 4923 (49.23%) 2884 (42.60%) 2039 (63.13%) Age, years, Mean (SD) 49.96 (9.96) 47.51 (9.36) 55.09 (9.20) DBP, mmHg, Mean (SD) 94.77 (19.1) 86.57 (15.34) 111.94 (14.16) SBP, mmHg, Mean (SD) 123.49 (15.66) 123.49 (15.66) 158.06 (13.67) Rare variants rv1, AC (AF%) 11 (0.055%) 5 (0.037%) 6 (0.093%) rv2, AC (AF%) 34 (0.170%) 10 (0.074%) 24 (0.372%) rv3, AC (AF%) 68 (0.340%) 43 (0.318%) 25 (0.387%) rv4, AC (AF%) 92 (0.460%) 52 (0.384%) 40 (0.619%) rv5, AC (AF%) 101 (0.505%) 74 (0.547%) 27 (0.418%) Common variants cv1, AC (AF%) 1955 (9.775%) 981 (7.245%) 974 (15.077%) cv2, AC (AF%) 3901 (19.505%) 1405 (10.377%) 2496 (38.638%) cv3, AC (AF%) 5931 (29.655%) 4057 (29.963%) 1874 (29.009%) cv4, AC (AF%) 7899 (39.495%) 5359 (39.579%) 2540 (39.319%) cv5, AC (AF%) 10,038 (50.190%) 6799 (50.214%) 3239 (50.139%) DBP = diastolic blood pressure, SBP = systolic blood pressure, AC = allele count, AF = allele frequency

Table A.2: Genetic variants in the primary analysis set (461,868 variants)

Criterion Count Frequency total 461,868 100.00% dbSNP 241,377 52.26% coding 267,510 57.92% protein-altering 75,024 16.24% rare 412,949 89.41% dbSNP 192,825 41.75% coding 241,456 58.47% protein-altering 63,087 15.28% common 48,919 10.59% dbSNP 48,552 99.25% coding 26,054 53.26% protein-altering 11,937 24.40% coding=variant is located in an exonic region, protein- altering=missense, nonsense, stop-gain, stop-loss, splice function, Reference: UCSC (hg19) database

128 A Appendix

Table A.3: Table of averaged coefficients of a normal linear model of systolic blood press- sure (SBP). SBP was predicted from individual variants (model 1), unweighted genetic scores (model 2), or weighted genetic scores (model 3). Genetic variants had simulated non-null (rv1, rv2, cv1, cv2) or null effects (rv3, rv4, rv5, cv3, cv4, cv5). Analysis was performed in 1,000 datasets with 10,000 individuals with genetic effects (simulated data Ib).

Nr Predictor Estimate CI95% Coverage [%] P < 0.05 [%] Model 1: Variants 1 Intercept 120.00 119.23 120.78 94.1% 100.0% 2 Sex 10.01 9.42 10.60 93.4% 100.0% 3 Age - 50 1.00 0.97 1.03 93.7% 100.0% 4 RV1 9.88 3.18 16.59 95.2% 82.7% 5 RV2 19.98 15.28 24.68 94.3% 100.0% 6 RV3 −0.06 −3.89 3.77 96.0% 4.0% 7 RV4 0.00 −3.31 3.32 95.8% 4.2% 8 RV5 0.06 −2.90 3.01 94.9% 5.1% 9 CV1 9.99 9.30 10.69 94.5% 100.0% 10 CV2 20.00 19.48 20.52 95.1% 100.0% 11 CV3 −0.00 −0.46 0.45 94.4% 5.6% 12 CV4 0.01 −0.42 0.43 95.1% 4.9% 13 CV5 −0.01 −0.42 0.41 96.3% 3.7% Model 2: Unweighted genetic scores 14 Intercept 117.05 116.12 117.99 00.0% 100.0% 15 Sex 10.01 9.29 10.73 95.2% 100.0% 16 Age - 50 1.00 0.96 1.03 94.5% 100.0% 17 RV (a) 3.32 1.24 5.41 NA 86.4% 18 CV (a) 4.32 4.05 4.58 NA 100.0% Model 3: Weighted genetic scores 19 Intercept 120.00 119.52 120.48 75.6 100.0% 20 Sex 10.01 9.42 10.60 93.4 100.0% 21 Age - 50 1.00 0.97 1.03 93.7 100.0% 22 RV (a) 1.00 0.78 1.22 NA 100.0% 23 CV (a) 1.00 0.98 1.02 NA 100.0% a) Bias and coverage of the unweighted sum scores in model 2 and weighted sum score in model 3 were not evaluated because the true value was not available.

129 A Appendix

Table A.4: Table of average coefficients of a reduced Bayesian hierarchical model predicting simulated systolic blood pressure using weakly informative prior Cauchy (0, 1) distributions. Analysis was performed in 1,000 samples with 10,000 individuals with genetic effects (simulated data Ib). Individual variant effects (model 1), average variant effects (model 2), or joint variant effects (model 3) of grouped variants on systolic blood pressure were tested.

Nr Level Predictor Estimate CrI95% Coverage [%] P < 0.05 [%] Model 1: Individual variant effects 1 Variants Intercept 120.00 119.23 120.78 94.10% 100.0% 2 Variants sex 10.00 9.42 10.59 93.30% 100.0% 3 Variants age 1.00 0.97 1.03 93.70% 100.0% 4 Variants rv1 9.69 3.06 16.32 95.30% 82.3% 5 Variants rv2 19.94 15.24 24.63 94.50% 100.0% 6 Variants rv3 −0.05 −3.71 3.60 96.30% 3.7% 7 Variants rv4 0.00 −3.17 3.18 95.90% 4.1% 8 Variants rv5 0.05 −2.79 2.90 95.10% 4.9% 91 Variants cv1 9.99 9.30 10.69 94.50% 100.0% 10 Variants cv2 19.10 19.48 20.52 95.10% 100.0% 11 Variants cv3 −0.00 −0.46 0.45 94.50% 5.5% 12 Variants cv4 0.01 −0.42 0.43 95.10% 4.9% 13 Variants cv5 −0.01 −0.42 0.41 96.30% 3.7% Model 2: Average variant effects 14 Average (a) Rare 3.14 1.51 4.78 4.78 91.2% 15 Average (a) Common 4.30 4.08 4.51 4.51 100.0% Model 3: Joint variant effects 16 Joint (a,b) Rare NA NA NA 100.00% 100.0% 17 Joint (a,b) Common NA NA NA 100.00% 100.0% a) True value not available, b) Credible intervals not available. RV=rare variant, CV=common variant. A Cauchy (location=0, scale=1) prior distribution was used for individual variant effects.

130 A Appendix

Table A.5: Table of average coefficients of a full Bayesian hierarchical model predicting simulated systolic blood pressure using weakly informative prior Cauchy (0, 1) distributions for group effects and Cauchy distributions with selected param- eters for individual variants. Analysis was performed in 1,000 samples with 10,000 individuals without genetic effects (simulated data Ia).

Nr Predictor Prior Estimate CrI95% Coverage [%] P < 0.05 1 Intercept L=0, S=0.1 120.00 119.57 120.43 93.0% 100.0% 2 Sex L=0, S=0.1 9.99 9.40 10.58 93.1% 100.0% 3 Age - 50 L=0, S=0.1 1.00 0.97 1.03 93.6% 100.0% 4 RV L=0, S=0.1 0.09 −0.21 0.38 97.4% 2.6% 5 CV L=0, S=0.1 0.50 −0.94 1.93 88.1% 11.9% 6 Intercept L=0, S=0.3 120.00 119.53 120.46 89.8% 100.0% 7 Sex L=0, S=0.3 9.99 9.40 10.58 93.1% 100.0% 8 Age - 50 L=0, S=0.3 1.00 0.97 1.03 93.6% 100.0% 9 RV L=0, S=0.3 0.10 −0.85 1.05 96.6% 3.4% 10 CV L=0, S=0.3 1.00 −0.35 2.36 68.7% 31.3% 11 Intercept L=0, S=0.5 120.00 119.52 120.48 87.4% 100.0% 12 Sex L=0, S=0.5 9.99 9.40 10.58 93.3% 100.0% 13 Age - 50 L=0, S=0.5 1.00 0.97 1.03 93.6% 100.0% 14 RV L=0, S=0.5 0.17 −1.17 1.51 95.5% 4.5% 15 CV L=0, S=0.5 0.91 −0.17 1.99 62.2% 37.8% 16 Intercept L=0, S=0.8 119.99 119.51 120.48 85.8% 100.0% 17 Sex L=0, S=0.8 9.99 9.40 10.58 93.3% 100.0% 18 Age - 50 L=0, S=0.8 1.00 0.97 1.03 93.6% 100.0% 19 RV L=0, S=0.8 0.34 −1.11 1.78 92.0% 8.0% 20 CV L=0, S=0.8 0.78 −0.08 1.64 57.0% 43.0% 21 Intercept L=0, S=1.0 119.99 119.50 120.49 85.2% 100.0% 22 sex L=0, S=1.0 9.99 9.40 10.58 93.3% 100.0% 23 age L=0, S=1.0 1.00 0.97 1.03 93.5% 100.0% 24 RV L=0, S=1.0 0.47 −1.00 1.93 89.8% 10.2% 25 CV L=0, S=1.0 0.73 −0.05 1.51 55.0% 45.0% 26 Intercept L=1, S=0.1 119.95 119.23 120.67 93.8% 100.0% 27 Sex L=1, S=0.1 9.99 9.40 10.58 93.1% 100.0% 28 Age - 50 L=1, S=0.1 1.00 0.97 1.03 93.6% 100.0% 29 RV L=1, S=0.1 0.08 −1.03 1.19 96.1% 3.9% 30 CV L=1, S=0.1 0.09 −0.16 0.34 90.3% 9.7% 31 Intercept L=1, S=0.3 119.93 119.22 120.64 92.8% 100.0% 32 Sex L=1, S=0.3 9.99 9.40 10.58 93.1% 100.0% 33 Age - 50 L=1, S=0.3 1.00 0.97 1.03 93.6% 100.0% 34 RV L=1, S=0.3 0.09 −1.02 1.21 96.2% 3.8% 35 CV L=1, S=0.3 0.12 −0.14 0.39 86.1% 13.9% 36 Intercept L=1, S=0.5 119.87 119.20 120.54 88.5% 100.0% 37 Sex L=1, S=0.5 9.99 9.40 10.58 93.1% 100.0% 38 Age - 50 L=1, S=0.5 1.00 0.97 1.03 93.6% 100.0% 39 RV L=1, S=0.5 0.10 −1.01 1.21 95.4% 4.6% 40 CV L=1, S=0.5 0.22 −0.10 0.54 75.1% 24.9% 41 Intercept L=1, S=0.8 119.83 119.23 120.42 85.4% 100.0% 42 Sex L=1, S=0.8 9.99 9.40 10.58 93.1% 100.0% 43 Age - 50 L=1, S=0.8 1.00 0.97 1.03 93.6% 100.0% 44 RV L=1, S=0.8 0.15 −0.97 1.26 94.3% 5.7% 45 CV L=1, S=0.8 0.39 −0.04 0.82 61.1% 38.9% 46 Intercept L=1, S=1.0 119.84 119.28 120.40 84.5% 100.0% 47 Sex L=1, S=1.0 9.99 9.40 10.58 93.1% 100.0% 48 Age - 50 L=1, S=1.0 1.00 0.97 1.03 93.6% 100.0% 49 RV L=1, S=1.0 0.20 −0.92 1.32 92.0% 8.0% 50 CV L=1, S=1.0 0.46 −0.03 0.94 56.7% 43.3% L=location, S=scale

131 A Appendix Nr Gene12 PANK4 Chr3 EPHA24 EIF4G3 1 Start5 TCEB3 1 24399746 MAP3K6 16450831 1 2458035 End7 LRP8 16482582 21132784 1 #Var 17 18 21437876 LEPR 24069855 30 276816699 [-0.128320, GBP4 24088549 [+0.000010, -0.000020] +0.802200] 16 1 2769335710 [+0.000000, LRIG2 +1.025900] [-0.132810, -0.000020] DBP 16 [-0.174380, 1 53708040 effects -0.000020] WARS211 29 [+0.000000, +0.000000] 91 107 65886334 53793821 1 [-0.168990, TXNIP -0.000020]12 [+0.000040, +3.993370] [-0.373450, -0.000010] 66096095 1 89646830 10 13 21 ARNT 113 113615830 91 [-0.327410, 46 +0.000000] 89664633 [+0.000000, 119573838 +0.000000] 18 113667342 SBP SCNM114 effects 1 YES YES 119683295 [+0.000010, +4.164030] #RV 94 8 145438437 11 [-0.722050, CGN15 5 -0.000170] 17 1 #CV [+0.000010, +4.509430] 145442628 [+0.000010, S100A9 6 [-0.300890, 15078218016 BHGLM 15 +0.378440] 1 -0.000010] 57 YES YES [+0.000250, 150849244 [+0.000000, 151138497 56 +0.830740] +0.203340] S100A617 7 [-0.192350, -0.000010] 151142773 [+0.000170, YES +0.542080] 1 4 TMEM7918 12 1 51 [-0.281230, 1 151483861 -0.000010] 35 153330329 19 CD1C 151511167 119 1 YES [-0.230730, 153333503 YES -0.000010] 8 1 153507075 FCER1A20 3 41 156252703 3 153508717 27 TNN21 156262234 1 YES 1 [-2.884990, -0.000010] +0.247880 YES 1 TOR3A YES 2 158259562 [+0.000000, 12 0 +0.326660 +0.000000] 159259503 [+0.000000, 158264564 +0.000000] 159278014 103 [-0.138360, -0.000010] 1 [-0.221430, 1 -0.008740] NO [+0.000000, 175036993 1 +0.000000] +0.178340 7 179051111 12 175117202 +0.000000 [+0.012240, 179065129 +0.393940] 5 28 +1.289530 [+0.016420, YES 48 42 +0.528100] 12 [+0.000030, 19 +3.378150] 0 2 [+0.000170, +0.145770] 19 [+0.000040, 5 +0.167330 -0.131510 +4.096790] [+0.000000, +0.000000] 0 YES NO 111 13 1 YES 42 21 NO 0 NO 2 -0.168780 YES NO YES 26 2 YES Table A.6 : 128 genes with ranges of simulated genetic effects in simulated data II (GAW19) (sorted by genomic position)

132 A Appendix Nr Gene22 GLUL23 TPR Chr24 PPP1R15B25 1 Start GALNT226 182350838 1 1 182360539 MTR 20437249127 186280785 204380945 1 End OPN328 186344457 1 230193535 #Var SUMF129 13 230417876 38 1 [+0.000010, +0.258190] BTD30 236958580 [-0.223610, 1 +0.257220] [+0.000000, 8 +0.000000] MLH1 23706728131 241756451 3 [+0.000000, +0.000000] [+0.000050, DBP +0.898320] effects +0.495410 241803701 MAP4 2132 [+0.000000, 4402828 16 157 +0.000000] 3 SEMA3F33 9 [-0.197600, 3 4508966 4 -0.000010] 15642858 17 52 ARHGEF334 [+0.006800, +1.090430] [+0.000000, 37034840 3 +0.000000] 15687328 11 3 [+0.000000, SBP 13 FLNB35 +0.000000] YES effects YES +0.383640 37092337 47892179 3 [+0.000230, +0.220630] 85 50192847 #RV DNASE1L336 7 48130769 [+0.000270, 46 56761445 #CV 11 +0.258520] YES 11 50226508 MUC13 1537 BHGLM [-0.396350, 56809745 3 -0.000100] 3 25 [-0.236790, 39 -0.000010] 3 ABTB1 10 438 58178352 57994126 [-0.233540, YES [+0.000000, 12 -0.000060] [+0.000010, [-5.048990, +0.000000] +0.579240] -0.000010] PPP2R3A39 3 58196730 7 58157982 YES [+0.000010, YES +0.584440] [-0.314880, 124624288 -0.000010] 35 [-7.825690, 50 ZBTB3840 -0.000010] 3 10 124653595 3 36 [-0.245190, YES 127391780 59 +0.000000] SENP541 135684514 [+0.000010, 76 [-0.041440, 3 +0.132480] 127399769 4 +0.112860] 135866752 3 MTRR42 7 [+0.000010, 31 +0.648010] [-0.040040, 5 +0.282180] 141043054 7 LNPEP YES YES 43 1 [-1.150160, 3 30 176 141168632 -0.028690] 5 22 196594726 [+0.000010, KCNN2 YES [+0.000000,44 +0.322350] +0.000000] 5 YES 196661584 29 [+0.000000, 9 +0.190080] STK10 5 YES 9 7869216 22 [-0.095990, 96271345 5 YES -0.000260] 8 71 7901235 -0.153370 113698015 [+0.000000, YES 96365115 3 +0.000000] [-0.293540, 5 -0.000790] 113832197 11 25 171469073 [-1.630730, -0.000050] 14 YES 171615346 39 [+0.000000, 1 [-0.219830, YES +0.000000] -0.000010] 24 [+0.000000, 19 +0.000000] 3 [-0.852350, -0.218440 -0.000010] [+0.000000, +0.000000] 5 45 102 45 YES [-0.288960, -0.000010] +0.000000 18 YES 3 11 81 YES YES YES 16 -0.500240 YES 34 1 NO

133 A Appendix Nr Gene45 NSD146 HOXA10 Chr47 STK17A48 5 Start 7 SBDS49 176560079 27210209 176727214 7 CCL2450 27213955 End 43622691 LRCH451 32 #Var 7 43666978 [+0.000010, AKR1B152 +0.278500] 1 7 66452689 [+0.000000, +0.293470] CNOT453 7 75441113 6 66460588 100171633 7 KEL [+0.007540,54 75443033 77 +0.226640] 100183811 134127106 DBP effects [+0.007920, 1 REPIN1 +0.238020]55 7 134143888 15 2 135046546 30 -0.108010 ERMP156 25 135194875 [+0.000000, +0.000000] 7 3 [-0.607670, SNAPC3 -0.000390] 7 YES 57 [+0.000010, 142638200 [+0.000000, +0.239320] +0.000000] 150065878 6 SBP ZNF189 158 [-1.274120, 9 effects 142659503 -0.000810] [+0.000110, 150071133 +0.092670 +0.021400] #RV 99 9 AKAP259 5784571 YES #CV 18 -0.071150 22 37 8 15422781 ZFP3760 BHGLM 9 6 [-0.303280, [+0.007020, 5833081 +0.429390] +0.552210] 15461627 104161135 22 PSMD561 9 [+0.000000, 4 4 +0.000000] [-0.262850, 104172942 +0.120690 +0.863720] 13 112810877 +0.000000 YES GSN62 8 1 9 112934791 37 YES YES 69 [-0.194390, PPP2R4 -0.000010] 7 115804094 1363 9 [-0.206340, -0.000040] [+0.000000, 115819071 123578331 +0.000000] SURF2 964 NO [-0.181520, 6 -0.000220] 9 123605299 4 [-0.259980, 9 -0.000050] +0.079880 9 TPRN 1065 54 [-0.112360, -0.003180] [-0.165420, 124030379 -0.000200] 131873227 YES 10 YES POLR2L66 32 124095120 YES 36 [-0.363040, 9 131911225 -0.000010] [-0.078490, -0.002220] 2 18 136223420 [-0.581790, SLC22A18 [+0.000000,67 -0.000090] +0.000000] 9 21 136228041 1 11 5 64 4 PRDX5 140086068 [+0.000000, YES [-0.890930, 11 +0.000000] 4 -0.000140] 30 [+0.000010, +0.686570] 140095163 15 839720 YES NO 7 [-0.287610, [+0.000000, 2920950 -0.000010] +0.343780] 26 [+0.000000, +0.000000] YES 11 2 12 842529 2946476 [+0.001320, 64085559 74 29 [-0.692490, YES +0.239220] -0.000030] 3 YES [+0.000000, 64089295 +0.000000] 20 2 12 41 5 [+0.000000, [+0.057740, YES +0.000000] +0.063630] 33 4 [+0.000050, [+0.000000, +0.276710] YES +0.000000] 6 YES [+0.000000, +0.000000] 0 57 10 [-0.587070, YES -0.016230] NO 9 2 14 YES YES 0 NO

134 A Appendix Nr Gene68 RHOD69 TCIRG1 Chr70 P2RY271 11 11 Start GAB2 6682428872 67806461 66839488 PANX1 1173 67818366 End 72929342 CASP574 11 11 #Var 72953472 24 11 ZW1075 77926335 [-0.732820, -0.000520] [+0.000030, +1.038130] 93862093 TMPRSS4 11 7805292676 [+0.000000, +0.000000] [+0.000040, 5 +1.270400] 104864966 93915137 TRIM2977 11 11 104893895 11 [-0.221010, DBP 23 -0.000520] effects 83 113603904 117947726 MTIF378 [+0.000000, +0.000000] 8 113644485 117990556 11 [-0.277000, [+0.000120, 6 -0.000640] FLT379 +6.054650] 4 6 [-5.890730, 119981993 -0.000130] LCP1 10 13 [-0.952880,80 120008863 -0.002610] 5 19 37 [+0.000590, [-4.696870, YES +0.245410] -0.000100] YES SBP 28009775 ING181 effects [-1.236140, [-0.132420, 13 16 -0.003390] -0.000010] [+0.001830, +0.759330] 4 #RV 28024326 5 BBS482 24 [+0.000350, [+0.000000, 28577410 +0.370220] +0.000000] 13 #CV 30 43 [+0.000000, CYP1A283 BHGLM +0.000000] 28674729 46700057 YES 13 YES 5 28 5 PTPN9 11136496984 46756459 7 15 43 9 11 [-0.211740, 15 -0.000080] 111373421 MRPS1185 5 YES [+0.000010, 72978519 +1.419720] [+0.000000, +0.000000] 75041183 6 YES 3 15 SERPINF1 YES [+0.000010,86 73030817 +1.462110] 9 75048941 YES 15 [-0.533330, 75759461 P2RX5 [+0.000000, -0.000020] 987 +0.000000] 17 70 YES 89010683 2 75871632 [+0.003360, 13 ZZEF1 +1.106710]88 [-0.479990, -0.000020] 1665258 89021861 [+0.371450, 2 [-0.200410, +0.764960] 13 +0.161140] CHRNE 1789 36 7 [+0.000000, 1680859 +0.000000] 28 [-0.100310, +0.080650] RABEP1 [+0.000020,90 17 3 YES 3576521 +0.717610] YES 37 17 1 16 [+0.000000, C1QBP +0.000000] 6 57 [-6.187970, 3907738 3599698 -0.000150] [+0.000020, 17 +0.498450] 4801063 25 3 [+0.000020, 4046253 [-2.544390, +0.625740] NO YES -0.000060] 5 14 5185557 17 4806369 [+0.000120, +0.310520] 53 43 3 YES 25 5239985 5336098 [+0.000000, YES +0.000000] [+0.000020, 14 +0.204680] 11 [+0.000010, 5342471 [+0.000000, +0.329480] YES +0.000000] 6 1 47 [+0.000000, +0.381290] 209 YES 5 YES 2 74 [+0.000150, +0.548650] 46 [+0.000330, +1.248940] 11 YES -0.046150 YES 13 YES 1 -0.033780 NO 6 0 YES

135 A Appendix Nr Gene91 ACADVL92 SAT2 Chr93 17 RAI194 7120443 Start BLMH95 17 7128586 THRA96 17 7529555 End KRT2397 17 13 17584786 #Var 7531194 KRT15 [-0.288780, 2857521298 17 +0.000000] 17714765 28619184 DHX8 17 3821844599 5 [-0.921650, -0.000010] 47 [+0.000030, 39078947 38250120 USH1G 17 +0.097910]100 [+0.000010, +0.471450] 4 DBP 76 effects [+0.000190, CAPS 39093895 +0.747370] 39669996101 [+0.000000, 17 +0.155770] 1 [-0.279680, ZNF562 -0.098420] 39675270 17102 41561333 13 4 127 8 COL5A3 72912175 [-0.237040, 19 -0.083420] 41601680103 [+0.000000, 17 +0.000000] 19 3 ZNF443 YES 72919358 7 [+0.000020, SBP +0.200250]104 effects 5914192 24 [-0.834780, 19 -0.212740] 1 [+0.000000, LPHN1 9759337 #RV +0.000000] +0.140510105 12 YES 5916222 10070236 YES #CV 19 EMR3 21 [+0.000240, 4 9785776 +0.128070]106 BHGLM 37 10121147 12540519 [+0.000000, 19 F2RL3 +0.000000] 9107 6 YES 12551926 8 58 14258548 4 COPE 19 [-0.250490, 26 +0.315250108 -0.007750] [+0.000010, +0.097710 +0.136500] 14316997 [-0.747470, LRP3 -0.000010] YES 19 14729928 9109 [+0.000010, YES [-0.225740, +0.285480] -0.006990] [+0.000000, 0 +0.000000] 26 CAPN12 16999825 14785730 30 19110 [-0.380610, -0.000090] 219 [+0.000010, 18 17002830 +0.274980] FBL 19010322 16111 19 20 2 NO [-0.390210, [+0.000000, 19 +0.203650 -0.000090] +0.337770] 63 SPTBN4 19030199 33685598 [-0.029100, 17112 +0.156230] 3 39220831 0 151 [+0.000040, YES 50 SNRPA 18 +0.210990] 33699773 [-0.035010,113 12 YES +0.605630] 39235114 19 19 [+0.000000, +0.000000] ZNF574 YES [+0.000010, +0.239270] 5 NO 27 40325092 6 40973125 4 46 26 [+0.000020, 19 +0.836790] [+0.000010, 81 +0.278630] 40337054 41082365 [+0.000060, +0.146200] 41256758 19 YES [+0.000000, YES YES +0.000000] 6 37 [+0.000160, +0.378440] 41271297 2 13 42580289 63 126 [-0.146080, 42585720 [-0.322360, 79 YES +0.193100] 5 +0.000000] YES 1 12 [-0.219910, +1.171620] [-0.835620, 12 10 -0.000010] YES [+0.000100, +0.286650] YES 41 185 [+0.000110, YES +0.312870] 11 4 -0.103790 25 YES YES 0 NO -0.093480 16 5 YES

136 A Appendix Nr Gene114 ZNF180115 RELB Chr116 19 SIX5117 44978644 HIF3A Start 19118 45004575 RCN3 45504706119 19 FPR1 15 End 45541456 19120 46268042 #Var [-0.193620, ZNF350 46800302 +0.165610] 19121 46272497 1 46846690 [-0.185880, EPS8L1 +0.099170] 50030874 19122 28 19 ZNF17 50046890 52249022 23123 [+0.000010, 45 +0.291880] 52467592 DBP 19 [+0.003930, effects ZNF211 +1.911650] 52255150 [+0.000000,124 +0.000000] 11 52490079 10 55587220 [+0.002660, +1.296250] ZNF544 +0.335790 19 [+0.000030, +0.406850]125 76 55599291 5 19 [+0.000060, ZNF132 57922528 +1.013430] 86 YES 5126 58144534 [-0.117710, 19 57933307 KRTAP11-1 -0.000100] 22 2 SBP127 effects 53 [-0.204450, 10 -0.065560] 58154147 [+0.000070, 58740069 +0.630710] 19 21 RUNX1 #RV [-0.133670, [+0.000000, -0.000120]128 +0.000000] +0.000000 8 [+0.000060, #CV 58775008 YES +0.571220] 58944180 32252963 7 C21orf33 YES 5 BHGLM In total, 16 [-0.435860, 24 7,317 58951589 112 32253874 62 -0.001510] variants 21 in 16 128in genes [-1.426190, YES simulated were -0.000080] data 21 available, 36160097 of II [-0.205350, 11 which 15 -0.000710] 6 (GAW19), [-0.452420, 6,578generalized [+0.000000, #CV 7 1 -0.000010] variants 5 linear +0.000000] = in models 45553493 111 number in 36260987 genes of simulated were [+0.098060, data common [-0.286500, analysed. +0.437120] II [-0.194790, variants 21 -0.000060] (GAW19) +0.000000] in #RV were YES 45565605 = simulated applied YES 31 number to data YES [+0.100740, of test II +0.449090] rare NO 17 grouped (GAW19), variants variant BHGLM effects. [-0.907800, = -0.000210] 39 Bayesian 5 hierarchical 11 6 [-0.872010, -0.000040] 5 [+0.000000, 27 +0.000000] 5 YES [-0.472350, YES -0.000020] [+0.038120, +0.511450] 3 5 YES 52 38 YES YES 1 4 YES NO

137 A Appendix 39 12 15 36 55 34 54 22 51 14 20 9 10 − − − − − − − − − − − − − 10 10 10 10 10 10 10 10 10 10 10 10 10 × × × × × × × × × × × × × 39 80 73 43 80 18 33 72 86 55 42 24 14 ...... 37 4 76 2 90 1 16 2 58 1 93 1 21 1 54 4 61 1 71 9 90 2 95 3 63 6 ...... 3 2 7 8 5 1 10 16 18 − − − − − − − − − 84 25 36 3 43 28 19 3 19 4 83 41 4 54 30 70 29 ...... 4 7 7 3 10 13 21 12 23 − − − − − − − − − 95 30 03 00 94 59 82 95 2 44 95 2 05 3 89 52 2 ...... 3 9 5 6 2 12 18 10 20 − − − − − − − − − 0280 109 6477 2517 2 3435 1335 1547 601 2 6477 2517 4 1040 404 1547 601 3 3435 1335 0229 89 0280 109 0229 89 ...... ; Bayesian hierarchical models used a prior Cauchy (0,1) 7 − 10 × 0984 0 0 80 0 84 0.03109 814 0.337 8757 80 0 84 0 10 0 09 0 78 0 10 0 00 0 78 0 ...... = 1 6 6 7 6 6 7 6 6 α − − − − − − − − 93 41 03 41 93 03 41 38 4 93 49 3 38 4 63 0 49 3 ...... 3 4 5 4 3 5 4 3 2 − − − − − − − − − simulated simulated 4 TNN 1:175092674:C/T rs2285215 3 5 LEPR 1:66075952:G/C rs8179183 3 10 TNN 1:175092674:C/T rs2285215 3 6 CGN 1:151501841:C/G rs12044926 11 LEPR 1:66075952:G/C rs8179183 3 distribution for individual variants. Nr Gene Variant IDFrequentist single-variant analysis, DBP (1,425 variants, 110 genes) 1 dbSNP ID MAP42 3:47956424:C/T DBP effect, MAP4 SBP rs1137524 3:47957996:C/G effect, EAF rs2230169 EAC EstimateFrequentist single-variant analysis, SBP (1,154 CI95% variants, 93 genes) 7 MAP4 P 3:48040283:C/T rs11711953 Bayesian hierarchical multiple-variant analysis, DBP (1,258 variants, 9812 genes) MAP4Bayesian hierarchical multiple-variant analysis, 3:47957996:C/G SBP (1,004 variants, 8013 genes) rs2230169 MAP4Variant ID=chromosome:position:reference/effect 3:47956424:C/T allele; EAF=effectCI95%=95% confidence allele interval; rs1137524 frequency, variants are EAC=effect sorted by allele p count, value; SE=standard error, 3 MAP4 3:48040283:C/T rs11711953 8 MAP4 3:47957996:C/G rs2230169 9 MAP4 3:47956424:C/T rs1137524 single variant and reduced BayesianCauchy hierarchical (0, generalized 1) linear distributions models in (normal) simulated including data multiple II variants using (GAW19) data. weakly informative No prior genome-wide significant associations with hypertension were observed. Table A.7 : Functional variants with genome-wide significant associations with simulated blood pressure using frequentist normal linear models including a

138 A Appendix

Table A.8: Associations of top ten genes with observed diastolic blood pressure using a reduced Bayesian hierarchical generalized linear model (normal) to test the average effect of the group of rare variants per gene in empirical GAW19 data (sorted by Bayesian p value)

Nr Gene #RV #CV Estimate CrI95% Bayesian P 1 CDC27 48 3 0.09 −0.10 0.28 0.33 2 TGM5 59 4 0.08 −0.10 0.26 0.38 3 OR7G3 14 3 0.17 −0.21 0.54 0.39 4 SMC2 81 3 −0.05 −0.19 0.09 0.48 5 VRK3 58 5 0.05 −0.10 0.21 0.48 6 GPR156 27 7 0.09 −0.17 0.36 0.50 7 OR2T6 23 4 −0.09 −0.37 0.18 0.51 8 KIAA1524 56 4 0.06 −0.13 0.24 0.55 9 RFXANK 20 3 0.09 −0.22 0.41 0.56 10 SPINK7 8 3 0.14 −0.34 0.63 0.56 The Bayesian hierarchical model used weakly informative prior Cauchy (0, 1) distributions for individual variant effects. RV=rare variants, CV=common variants, CrI95%=95% credible interval; genome-wide sig- nificance level α = 1 × 10−7

Table A.9: Associations of top ten genes with observed systolic blood pressure using a reduced Bayesian hierarchical generalized linear model (normal) to test the average effect of the group of rare variants per gene in empirical GAW19 data (sorted by Bayesian p value)

Nr Gene #RV #CV Estimate CrI95% Bayesian P 1 SMC2 81 3 −0.04 −0.19 0.11 0.63 2 OR7G3 14 3 0.08 −0.29 0.45 0.68 3 KRTAP10-5 20 4 0.06 −0.25 0.38 0.70 4 C7orf33 9 4 0.08 −0.38 0.54 0.72 5 TRPM4 149 9 0.02 −0.09 0.14 0.73 6 LILRA2 63 5 0.03 −0.14 0.20 0.73 7 TGM5 59 4 0.03 −0.15 0.21 0.73 8 RFXANK 20 3 0.06 −0.26 0.37 0.73 9 TRPM2 190 19 0.02 −0.08 0.12 0.74 10 OR2T6 23 4 −0.05 −0.33 0.24 0.74 The Bayesian hierarchical model used weakly informative prior Cauchy (0, 1) distributions for individual variant effects. RV=rare variants, CV=common variants, CrI95%=95% credible interval; genome-wide sig- nificance level α = 1 × 10−7

139 A Appendix

Table A.10: Associations of top ten genes with hypertension using a reduced Bayesian hierarchical generalized linear model (binomial) to test the average effect of the group of rare variants per gene in empirical GAW19 data (sorted by Bayesian p value)

Nr Gene #RV #CV Estimate CrI95% Bayesian P 1 TGM5 59 4 0.13 0.02 0.24 0.0162 2 TRPM4 149 9 0.10 0.02 0.19 0.0163 3 EIF2AK4 86 14 0.13 0.02 0.24 0.0258 4 C9orf50 29 6 −0.24 −0.47 −0.02 0.0347 5 OCA2 91 17 0.13 0.01 0.25 0.0354 6 TUBGCP4 39 3 0.19 0.01 0.37 0.0440 7 ULK4 109 31 0.12 0.00 0.24 0.0451 8 PPP5C 43 7 0.19 0.00 0.37 0.0457 9 OR1A2 11 4 0.28 0.00 0.56 0.0491 10 TRPM2 190 19 0.09 −0.00 0.18 0.0577 The Bayesian hierarchical model used weakly informative prior Cauchy (0, 1) distributions for individual variant effects. RV=rare variants, CV=common variants, CrI95%=95% credible interval; genome-wide sig- nificance level α = 1 × 10−7

Table A.11: Associations of top ten genes with observed diastolic blood pressure using a reduced Bayesian hierarchical generalized linear model (normal) to test the joint effect of the group of common variants per gene in empirical GAW19 data (sorted by Bayesian p value)

Nr Gene #RV #CV Q χ2 df Bayesian P 1 TJP2 93 15 29.42 27.76 4.60 2.69 × 10−5 2 LGALS4 29 3 4.80 18.04 2.06 1.32 × 10−4 3 COX10 22 7 20.22 16.70 2.41 4.07 × 10−4 4 C5orf34 28 4 65.99 12.09 1.13 6.45 × 10−4 5 HCN4 111 6 18.46 13.73 1.78 7.76 × 10−4 6 KIAA1432 79 12 25.31 16.55 2.93 8.11 × 10−4 7 GUCY1A2 32 5 12.48 13.73 1.97 1.00 × 10−3 8 GPR146 48 6 14.65 15.10 2.52 1.02 × 10−3 9 TP53 25 6 15.66 13.62 2.00 1.11 × 10−3 10 VPS37C 27 6 11.36 16.44 3.38 1.37 × 10−3 The Bayesian hierarchical model used weakly informative prior Cauchy (0, 1) distributions for individual variant effects. RV=rare variants, CV=common variants, genome-wide significance level α = 1 × 10−7

140 A Appendix

Table A.12: Associations of top ten genes with observed systolic blood pressure using a reduced Bayesian hierarchical generalized linear model (normal) to test the joint effect of the group of common variants per gene in empirical GAW19 data (sorted by Bayesian p value)

Nr Gene #RV #CV Q χ2 df Bayesian P 1 TJP2 93 15 125.98 30.25 1.53 1.16 × 10−07 2 CNTNAP2 89 12 27.08 21.93 2.56 3.86 × 10−05 3 MIR548I4 9 4 23.20 17.58 1.29 4.89 × 10−05 4 URB2 66 9 21.42 18.76 3.12 3.49 × 10−04 5 PAK6 69 15 95.14 12.51 1.14 5.29 × 10−04 6 ULK4 109 31 4688.25 10.73 1.00 1.05 × 10−03 7 SERPINI1 24 4 4134.76 10.08 1.00 1.50 × 10−03 8 RNH1 56 9 13.02 22.05 6.42 1.64 × 10−03 9 ZSCAN29 30 6 4004.29 9.86 1.00 1.69 × 10−03 10 LPPR2 16 4 26.09 10.22 1.29 2.28 × 10−03 The Bayesian hierarchical model used weakly informative prior Cauchy (0, 1) distributions for individual variant effects. RV=rare variants, CV=common variants; genome-wide significance level α = 1 × 10−7

Table A.13: Associations of top ten genes with observed hypertension using a reduced Bayesian hierarchical generalized linear model (binomial) to test the joint effects of groups of rare variants per gene in empirical GAW19 data (sorted by Bayesian p value)

Nr Gene #RV #CV Q χ2 df Bayesian P 1 GM2A 12 6 15.13 11.22 4.04 0.0248 2 OR51Q1 18 11 15.98 14.67 7.94 0.0640 3 HPCAL4 13 4 12.64 10.43 5.09 0.0675 4 TRIM33 39 4 25.46 20.14 13.86 0.1207 5 PRSS58 12 5 10.08 8.33 4.83 0.1273 6 VTCN1 15 5 12.03 10.30 6.34 0.1315 7 PON3 22 5 14.17 10.71 6.74 0.1360 8 ZNF677 20 4 14.57 7.23 4.26 0.1432 9 NFE2L1 22 4 13.72 13.99 10.72 0.2154 10 MYOZ3 17 3 11.04 8.28 6.01 0.2196 The Bayesian hierarchical model used weakly informative prior Cauchy (0, 1) distributions for individual variant effects. RV=rare variants, CV=common variants; genome-wide significance level α = 1 × 10−7

141 A Appendix

Table A.14: Associations of top ten genes with observed hypertension using a reduced Bayesian hierarchical generalized linear model (binomial) to test the joint effect of the group of common variants per gene in empirical GAW19 data (sorted by Bayesian p value)

Nr Gene #RV #CV Q χ2 df Bayesian P 1 COX10 22 7 1.91 16.34 2.61 6.16 × 10−4 2 DRAXIN 44 7 1.28 14.55 2.65 1.53 × 10−3 3 FAM63A 21 3 1.01 10.95 1.64 2.64 × 10−3 4 HSD3B1 18 3 0.37 10.10 1.46 3.16 × 10−3 5 NPPA-AS1 13 5 1.50 14.16 3.20 3.30 × 10−3 6 NPPA 14 5 1.50 14.16 3.20 3.30 × 10−3 7 KIAA1432 79 12 3.12 17.81 5.34 4.14 × 10−3 8 C9orf72 25 6 2.41 11.46 2.29 4.60 × 10−3 9 TNFSF13 20 3 0.32 8.66 1.45 6.65 × 10−3 10 LGALS4 29 3 0.20 10.60 2.27 6.82 × 10−3 The Bayesian hierarchical model used weakly informative prior Cauchy (0, 1) distributions for individual variant effects. RV=rare variants, CV=common variants; genome-wide significance level α = 1 × 10−7

142 A Appendix

A.5 Supplementary figures

Residuals vs Fitted Normal Q−Q

SAMPLE_1868● SAMPLE_1868● ●SAMPLE_1585 ● ●● SAMPLE_1203● 4 ● SAMPLE_1203SAMPLE_1585 ● ● ● ●●

40 ● ● ● ● ●● ●●● ● ● ● ●● ●●● ● ●● ●●● ●●● ●● ● ●●● ● ● ● ●● ● ●●● ● ●● ● ● ●● ●● ●●● ● ●●● ● ● ● ● ●● ●●● ●● ● ●●●●●●●● ● ● ● ● ● ●●● ● ●●● ●● ● ●●● ● ● ●● 2 ● 20 ● ●● ●●●●●●●● ●●● ●● ● ● ●●●●●● ●● ● ●● ● ●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●● ● ● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●● ●●●● ●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● 0 ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● 0 ●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●● ●● ●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●● Residuals ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●● ●●●●● ● ●●●●●● ●●●●● ● ●●●●●●●●●●●●●● ●● ●● ●● ●●●●●●●●●● ● ● ●●●●● ● ●●● ● ●●●● ● ●● ● ●● ● ● ● ●●●●●● ● ● ●● ● ● ● ● ●● ●●● ●●●●●● ● ●● ● ●●● ● ●● ● ●●●●● −20 ● ● ● ● ●● ●●●● ● ●● ● ●●●●●● ● −2 ●●●● ● residuals Standardized ● ●● ●●● ● ● ●●●

−40 ●

68 70 72 74 76 78 80 −3 −2 −1 0 1 2 3

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage SAMPLE_1868● ●SAMPLE_1585 ● SAMPLE_1203● ● 2.0 ● SAMPLE_1868 ● ● ● ● ●● ● 4 ● s ● ● ● l ●● ● ● ● ● ● ● ● ● ●● ● a ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● u ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●● ● d ● ● ● ● ● ●●●●●●● i ● ● ● ● ●● ● ● ● ● ●●●●●●●●● ● ●● ●● ● ●● ●●● ●● ●●● ●●●●● ●● 1.5 ● ●● s ●● ● ● ●● ● ● ●● ●● ● ● ● ● 2 ●● ● ● ●●●● ●● ●●●●●●● ●●● ● ●● ● ●● ●●●●●●●●●● ●●●● ● ● e ●● ●● ●●● ●●●● ●●●● ●●●● ●●●● ● ●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●● ● ● r ●●●●● ●● ●● ●●●●● ●●●●●●●●● ● ● ●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●● ●●●●●●●● ●● ●●●●● ● ●● ● ●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ● ●●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● d ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● e ● ● ●●●●●●●●● ● ●●●● ●●●●●● ●● ●●● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ● ●●●●●●● ● ●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●● ● z ●●● ● ● ●●●●●●●●●●●●●●● ● ●●● ●●●● ●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● 0 i ●● ●●● ●●●●●●●●●●● ● ● ●●● ●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●● 1.0 ● ●● ●● ●● ●● ● ●● ● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ● d ● ●● ● ● ●● ●● ●●●● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●● ●● ●●●●● ●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●● ●● r ● ●●●●● ●●●●●●●●●●● ●●●●●●●● ● ●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● a ● ●● ●●●●●●●●●● ●●●●●●●●●●● ●● ●●●● ●●●●●● ●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●● ●●●● ●●●●● ●●●●●● ●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●● d ●● ●●●●● ●●●● ●●●●●● ●● ●●● ●●●● ●●● ● ●●●●●●●●●●●●●● ●●●●● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●● ●●● ●●● ●●●●●●●●●●●●●●● ●●●●●●● ● n ● ●●● ●●●●●●● ●●●●● ●●●● ●●● ● ●●●●●●●●● ●●● ● ● ● ●●●●●● ●● ● ● ● ●●●● ●●●●●● ●●●●● ●●●● ●● ●●●● ●●●● ●●●●● ●●●●● ●● ● ●●●●●●● ●● ● ●● a ● ● ● ●● ●● ● ● ● ●● ● ●●● ● ●● ● ●● ● −2 ● ● ● 0.5 ●● ●●●● ● ●●● ●● ● ●●● ● ● t ●● ● ● ● ● ● ● ● ● ● ●●●●● ●●●●● ●●●●● ●●● ●●● ●●● ●●●● ●● ● ● ●●● ●●● ●●●●●●●●●●●● ●●●● ● ●●●● ●●●● ●●●● ● SAMPLE_1172● ● S ●● ●● ● ●●●●● ●●● ● ●●● ● ● ● SAMPLE_0636 ● ● ● ●● ● ●● ● ● ●● ●●● ● residuals Standardized ● ● ● ●●● ●● ●●● ●● ● ●●● ●●● ●●●● ● ● ● ● ●● ●● ●●●● ● ● ● ● ●●● ● ● ● ● ●● ●●●● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● Cook's distance −4 0.0

68 70 72 74 76 78 80 0.000 0.004 0.008

Fitted values Leverage

(a) Diastolic blood pressure

Residuals vs Fitted Normal Q−Q

SAMPLE_1692●SAMPLE_1581● SAMPLE_1581●SAMPLE_1692● ● SAMPLE_1925● 4 ●SAMPLE_1925 ● ● ●●●● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ●●● ● ●●●● ●● ●●●● ● ● ● ● ● ●●● ●● ●●● ●●●● ● ●●● 50 ● ●● ● ● ● ●●● ● ●● ● ● ●● ● ●● ● ● ●● ●●● ● ● ●● ●● ●● ● ●●●● ●●●●●●● ●● ● ●● ●●●● ●●●● ● ● 2 ●● ● ●● ● ● ●●●● ● ●● ●●●●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ●● ● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●● 0 ●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●● ●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ● ●●●●●●●●● ● ● ● 0 ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● Residuals ● ●● ● ● ●●● ● ●● ●● ● ●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ● ● ●●●●●● ●● ● ●●●●●●●●●● ● ●●●●●● ● ● ●●●●●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●●●●●●● ● ●●● ●●● ●● ●● ●●● ● ●●●● ● ● ● ●●● ● ● ●●●●● ● ●● ● ●●●●● ● ● ● ●●●● ● −2 ●●● Standardized residuals Standardized ●●

−50 ●● ● ●● ●

100 110 120 130 140 150 −3 −2 −1 0 1 2 3

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage ●SAMPLE_1581● SAMPLE_1692● SAMPLE_1925●

2.0 ● ● ● ● ● ●

4 ● ● ● ● ●● ● ● ● ● ●SAMPLE_0348 s ● ●● ● ● ● l ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●●SAMPLE_1613 a ● ●●●● ●● ● ●●●●● ● ●● ●●● ●●● ● ● ●●●●● ● ●● ●●SAMPLE_0158● u ● ● ● ●● ● ● ●● ●● ● ● ●●● ● ●● ● ● ● ● ●● ● ●●●●●●● ● ● d ●● ● ● ● ●● ●● ● ● ●●● ●● ● ●● i ● ● ●●●● ● ●● ● ●●●●● ●● ● 1.5 ● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● 2 ● ● ● s ●● ● ● ● ●● ●●●● ●● ●● ● ●● ●●●●●●●●●● ●●● ● ●● ●● ●● ●● ●●●●● ● ●● ● ●●●● ●●●● ●●● ●●● ● ● e ● ●● ● ● ● ● ●● ● ●●●●●●●●●●●● ●●●● ●●●● ●●●●●●●●●●●●● ●●● ●●● ● ● ● r ●●● ● ●●● ●●●● ●●●●●● ●● ● ●●●●●●●●●●●●●●●●●● ●● ●● ● ● ●● ●●●●●●●●●●●●●●●●●●●● ●●● ●●● ●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● d ●● ● ● ●●●●●●●●●●●●●● ●●●●● ●●● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● e ● ●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●● ●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● z ●● ●●●●● ●●●●●●●●●●● ●●●●●●●● ● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● ● ● ● ● 0 ●● ● ●● ●● ● ● 1.0 ● ●● ●●●●● ● ●●●●●● ● ●●●●●● ●●●●● ● ● ● ● i ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● d ● ●● ●● ● ● ●● ●● ●●● ● ●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● r ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● a ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●● ●● d ● ●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ● ●●●●●●●●●●● ● ●● ● ● ● n ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●● ●● ●● ●● ●●●●●● ● ● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ●●●●● ●●●●● ● ● a ●● ●●● ●●●● ●●● ●●●● ● ● ● ● ● ● ● 0.5 ● ● ●● ● ● ● ●● ●● ●● ●●●● ● ●●●●● ●●● −2 ● t ● ●● ● ● ●● ●● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ● ● ● ● ●●●●●● ●●●●●●●●●●●●● ●●●●●●● ●●● ●● ● ● ● ● S ●●● ● ● ●●●●●●●●● ● ●●● ● ● ● ● ● ●● ●●●● ● ● residuals Standardized ●●● ●● ●●●●●●●●●●●●● ●● ● ●● ●●● ● ● ●● ● ●●●●● ●● ●●●● ●●●●● ● ●●● ●●● ● ●● ●●●● ● ● ● ●●● ● ●● ● ● ● ● ●● ● Cook's distance −4 0.0

100 110 120 130 140 150 0.000 0.004 0.008

Fitted values Leverage

(b) Systolic blood pressure

Figure A.7: Diagnostic plots for diastolic and systolic blood pressure in empirical GAW19 data

143 A Appendix

0.002 ●

● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ●●● ● ●● ●● ● ● ●● ●● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● 0.001 ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ●●● ● ● ●●●●●●● ● ●● ●●● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●●● ●● ● ●● ● ●● ●●●● ● ● ●●● ●●● ●● ● ●● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●●●●● ● ● ● ● ●● ● ● ●●● ●● ● ● ●● ● ● ●● ●●●● ●●●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ●● ●●●●● ● ● ●●● ● ●●● ●● ●● ● ● ●● ●●● ●● ● ●●●● ● ●● ● ● ●● ● ● ●● ●● ● ●● ●● ●● ●● ● ● ● ●●●● ●●●● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●●● ● ●● ● ●●●●● ● ● ● ● ● ● ● ● ● ●●●●● ●●● ● ● ●●● ● ● ● ●● ● ● ●● ●● ● ● ● ●● ●●● ● ● ●● ● ● ●●● ●●●●● ● ●● ● ● ● ● ● ● ● ●● ● ●●● ●● ●● ●● ● ●●● ●● ●●●●● ●● ● ●●●●● ● ●●●●● ● ●● ● ● ● ● ● ●● ● ●●● ●● ● ● ●●●● ●● ● ● ●● ● ● ● ● ●●● ● ●●●●●●● ● ●●● ● ● ●●●●● ● ●● ●● ● ●●●●●● ● ● ● ● ● ● ●●● ● ●● ●●●● ● ● ●●●● ● ● ●● ●● ● ● ● ● ●●●●● ● ●● ● ●● ● ●● ● ● ● ● ●● ● ● ●● ● ● ●● ● ●●●● ● ●● ●● ● ● ● ●●●●● ● ● ●● ●●●● ● ● ● ● ●● ● ● ● ●●● ● ●●● ●●● ● ● ● ●● ●●●● ● ●●●● ● ● ● ● ● ●●●● ●● ● ●●● ●●● ●●● ● ● ●● ●●● ● ● ●● ●● ● ●● ● ● ● ●●● ●● ●●● ● ●● ●● ●●● ● ●●●● ● ●● ● ●●● ● ● ● ●● ●● ● ● ● ●● ●●● ● ●●●● ●● ●● ●● ● ● ●● ● ●●●● ● ●● ●● ● ●● ●● ●●● ●●● ●●●●● ●●●●●● ● ● ●●● ●● ●●● ● ● ● ●●● ●●●●●● ● ●●●●● ● ● ●●●●●● ●● ● ● ● ● ● ●● ● ●● ● ● ●●●● ●●●●● ● ● ●●● ● ● ● ● ●● ● ●●● ●● ● ● ● ● ● 0.000 ● ●● ●● ● ● ● ● ● ●●● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ●● ●●● ● ●●● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●●● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ●● ●●● ●● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ●● ●●● ● ●●●● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ●●● ●●●●● ●● ●● ● ●●● ●● ●● ●●● ● ● ● ●● ●●● ●● ● ●● ● ●●● ● ●● ● ●● ●●●● ● ● ● ●●●● ●● ●● ●● ● ●● ● ● ●● ● ● ●●● ● ● ● ●● ● ●●● ●● ●● ● ● ● ●● ● ● ●● ●● ● ●●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●● ● ● ●● ● ●● ● ● ● ●●● ● ●●●●● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ●● ●●● ●● ●●● ● ●●● ●● ● ● ● ● ● ● ●● ● ●● ● APC2 ● ● ● ● ● ● ● ● ●● ●● ● ●●●● ● ● ●● ● ● ●● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ●● ● ●● ● ● ● ●● ● ● ●● ● ● ●●● ●● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●●● ●●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ●●●● ● ●● ●●● ●●● ● ● ● ● ● ● ●●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●●●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ●● ● ●● ● ●● ● −0.001 ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.002 ● ● ● ● ● ● ● ● ● ● ● ● ●

−0.006 −0.004 −0.002 0.000 0.002 0.004 APC1

Figure A.8: Individuals (N=1,943) plotted according to the first two ancestry principal components (APCs) based on 465,887 single-nucleotide variants

144 A Appendix

(a) Manhattan Plot

(b) Quantile Quantile Plot

Figure A.9: Associations between 262,929 genetic variants (MAC ≥ 2, valid p value) and diastolic blood pressure using a frequentist normal linear model estimating individual variant effects in 1,851 individuals from empirical GAW19 data (genome-wide significance = 1 × 10−7)

145 A Appendix

(a) Manhattan Plot

(b) Quantile-Quantile Plot

Figure A.10: Associations between 259,086 genetic variants (MAC ≥ 2, valid p value) and systolic blood pressure using a frequentist normal linear model estimating individual variant effects in 1,851 individuals from empirical GAW19 data (genome-wide significance = 1 × 10−7)

146 A Appendix

(a) Manhattan Plot

(b) Quantile-Quantile Plot

Figure A.11: Associations between 262,931 genetic variants (MAC ≥ 2, valid p value) and hypertension using a frequentist Firth logistic model estimating individual variant effects in 1,851 individuals from empirical GAW19 data (genome-wide significance = 1 × 10−7)

147 A Appendix

(a) Estimate

(b) P value

Figure A.12: Comparison of genetic associations with diastolic blood pressure based on p values of a frequentist normal linear model estimating single variant effects (SVT) versus a (reduced) Bayesian hierarchical normal linear model estimat- ing multiple variant effects (BHGLM) using weakly informative Cauchy (0, 1) prior distributions in 1,851 individuals from empirical GAW19 data

148 A Appendix

(a) Estimate

(b) P value

Figure A.13: Comparison of genetic associations with systolic blood pressure based on p values of a frequentist normal linear model estimating single variant effects (SVT) versus a (reduced) Bayesian hierarchical normal linear model estimat- ing multiple variant effects (BHGLM) using weakly informative Cauchy (0, 1) prior distributions in 1,851 individuals from empirical GAW19 data

149 A Appendix

(a) Estimate

(b) P value

Figure A.14: Comparison of genetic associations with hypertension based on p values of a frequentist Firth logistic model estimating single variant effects (SVT) versus a (reduced) Bayesian hierarchical logistic model estimating multiple variant effects (BHGLM) using weakly informative Cauchy (0, 1) prior distributions in 1,851 individuals from empirical GAW19 data

150