DETECTING ASSOCIATION OF COMMON AND RARE VARIANTS WITH COMPLEX DISEASES

By

YALI LI

Submitted in partial fulfillment of the requirements

For the degree of Doctor of Philosophy

Dissertation Advisor: Dr. Xiaofeng Zhu

Department of Epidemiology and Biostatistics

CASE WESTERN RESERVE UNIVERSITY

May 2010

CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

______Yali Li candidate for the ______degreePh.D. *.

(signed)______Xiaofeng Zhu (chair of the committee)

______Jing Li

______Courtney Gray-McGuire

______Robert C. Elston

______

______

(date) ______3/23/2010

*We also certify that written approval has been obtained for any proprietary material contained therein.

Table of Contents

Table of Contents………………………………………………………………………….1

List of Tables…………………………………………………………………………...... 3

List of Figures……………………………………………………………………………..4

Acknowledgements……………………………………………………………………...5

Abstract……………………………………………………………………………………6

Chapter 1 Introduction ...... 8 1.1 Genome-Wide Association studies for Common Diseases ...... 8 1.1.1 Overview ...... 8 1.1.2 Recent Successes of Genome-Wide Association Studies...... 9 1.1.3 Limitations and Challenges of Genome Wide Association Studies ...... 11 1.2 Multiple Rare Variants Hypothesis and Recent Evidence ...... 19 1.2.1 Common Disease-Multiple Rare Variants Hypothesis ...... 19 1.2.2 Recent Evidence of Rare Variants Contributing to Common Diseases ...... 21 1.3 Current statistical methods for detecting rare genetic variants in association studies ...... 23 1.3.1 Multivariate and Collapsing Methods ...... 23 1.3.2 Two-Stage Methods ...... 25 1.4 Statistical methods for combining p-values ...... 31 Chapter 2 Detecting Association with Rare Variants for Common Diseases Using Haplotype-Based Methods ...... 35 2.1 Introduction ...... 35 2.2 The Haplotype-based Truncated Product Method (HTPM) ...... 38 2.3 The Combined Method ...... 42 Chapter 3 Power Analysis of Haplotype-Based Methods to Detecting Rare Variants and an Application Example using the WTCCC Data ...... 43 3. 1 Simulation Studies ...... 43 3.2 Type I and Power Evaluation of the HTPM and the combined method ...... 47 3.3 Comparison of Haplotype-Based Methods with CMC method ...... 52

1

3.4 Application of haplotype-based method to the WTCCC data ...... 58 3.5 Discussion and Future Direction ...... 61 Chapter 4 Genome Wide Association Study of Blood Pressure: The Candidate Association Resource (CARe) Study ...... 64 4.1 Introduction ...... 64 4.2 Methods ...... 68 4.2.1 Study Sample ...... 68 4.2.2 Genotyping and Quality Control ...... 68 4.2.3 Genotype imputation ...... 70 4.2.4 Phenotype modeling ...... 71 4.2.5 Statistical analyses ...... 71 4.3 Results ...... 74 4.3.1 Study sample...... 74 4.3.2 Genome-wide association of African American cohorts for blood pressure .... 77 4.3.3 Association of SNPs in IBC chips and blood pressure in African and European Americans ...... 82 4.4 Discussion ...... 89 4.4.1 Top GWA SNPs for Affy 6.0 million SNP chip ...... 89 4.4.2 Top SNPs from the meta analysis of the IBC array ...... 90 4.4.3 Comparison with published SNPs from previous African Americans studies in the CARe African Americans ...... 92 4.4.4 Comparison with published SNPs from previous European American studies in CARe African Americans ...... 95 4.4.5 Independent replication of top CARe SNPs in African and European American cohorts ...... 99 4.4.6 Future direction...... 103

2

List of Tables

Table 1-1 Comparison of characteristics of common and rare disease variants...... 20 Table 3-1 Type I error rate for simulation data analyzed as haplotype phase known and phase unknown ...... 47 Table 3-2 ENCODE region SNP and haplotype information...... 53 Table 3-3 Type I error rate comparison of haplotype-based methods and CMC method using simulations based on ENCODE re-sequencing data...... 55 Table 3-4 List of showing association to hypertension (HT) or Coronary artery disease (CAD) in WTCCC data ...... 60 Table 4-1 Study sample characteristics ...... 76 Table 4-2 Top associated SNPs for blood pressure in African Americans from meta- analysis of Affymetrix 6.0 arrays (P < 1×10-6) ...... 79 Table 4-3 Top associated SNPs for blood pressure in African Americans and European Americans from meta-analysis of IBC arrays (P < 1×10-5) ...... 83 Table 4-4 Top associated SNPs for blood pressure in African Americans and European Americans from meta-analysis of IBC arrays (P < 1×10-4) ...... 85 Table 4-5 Lookup result of published SNPs from previous African Americans studies ...... 94 Table 4-6 Lookup result of published SNPs from previous European Americans studies ...... 97 Table 4-7 Replication results from different studies...... 101 Table 4-8 Power calculation of replication studies...... 102

3

Table of Figures

Figure 1-1 Genome-wide scans for seven diseases of WTCCC study ...... 11

Figure 1-2 The effects of population structure at a SNP locus...... 13

Figure 1-3 Multiplicative change in p-values due to population structure ...... 14

Figure 1-4 The effect of small effect size on the power of association studies ...... 16

Figure 1-5 Effects of allele frequency on sample size requirements ...... 18

Figure 1-6 Distribution of odds ratios for common and rare variants ...... 20

Figure 2-1 Low-frequency variants and disease susceptibility ...... 36

Figure 3-1 Flow chart of simulation studies...... 44

Figure 3-2 Power comparison of single-SNP analysis, two-stage method, HTPM and combined method, when haplotype phase is known...... 50

Figure 3-3 Power comparison of single-SNP analysis, two-stage method, HTPM and combined method, when haplotype phase is unknown...... 51

Figure 3-4 Power comparisons of CMC method, HTPM and combined method using simulations based on ENCODE re-sequencing data...... 57

Figure 4-1 Comparison of the 1st and 2nd principle components calculated from Broad and FamCC ...... 73

Figure 4-2 Manhattan plots and QQ plots of genome-wide SNPs for three phenotypes of blood pressure in African Americans from meta-analysis of Affymetrix 6.0 arrays...... 78

Figure 4-3 Regional plots of two blood pressure loci in African Americans from meta- analysis of Affymetrix 6.0 arrays...... 81

Figure 4-4 Manhattan plots and QQ plots of genome-wide SNPs for three phenotypes of blood pressure in African Americans from meta-analysis of IBC arrays...... 87

Figure 4-5 Manhattan plots and QQ plots of genome-wide SNPs for three phenotypes of blood pressure in European Americans from meta-analysis of IBC arrays...... 88

4

Acknowledgements

First of all, I would like to express my deepest gratitude to my advisor, Dr.

Xiaofeng Zhu, for his guidance, encouragement and constant support throughout this

work. I also want to acknowledge Dr. Robert C. Elston for his support, advice and

comments on this work. I would like to thank Dr. Courtney Gray-McGuire and Dr. Jing

Li for serving on the committee and providing valuable advice.

Dr. Tao Feng helped me tremendously in simulation programming and haplotype

inferring. His efforts are greatly appreciated. I would also like to thank Dr. Sun J. Kang

for inspiring discussions.

Part of this dissertation work includes analyzing data from National Heart, Lung

and Blood Institute (NHLBI)’s Candidate-gene Association REsource (CARe) Study. I am very grateful to CARe committee and all the cohort investigators for data contribution, coordination, quality control and preliminary analysis. Dr. Ervin Fox of University of

Mississippi is our primary collaborator on CARe project. His efforts in organizing working group and leading discussions are greatly appreciated. Thanks to Dr. Daniel

Levy of Boston University School of Medicine and Dr. J. Hunter Young of Johns

Hopkins Medical Institutions for their contribution to the CARe project by reviewing the results and joining the discussion.

Finally, I would like to dedicate this dissertation to my husband and my parents, for their constant love and enduring support during the past five years.

5

Detecting Association of Common and Rare Variants with Complex Diseases

Abstract

By

YALI LI

Current Genome-Wide Association Studies (GWAS) have successfully detected many genetic variants contributing to common diseases but not rare ones. Here two haplotype-based methods are proposed for detecting rare variants contributing a common disease. One method is a haplotype-based truncated product method (HTPM), for which we borrow a p-value combination method from testing for the multiple hypotheses, but use it for the purpose of clustering the information on rare risk haplotypes. The other method is the combined method, for which a set of risk haplotypes are chosen based on haplotype frequency comparison between cases and controls, and then tested using the same sample. Our simulation study demonstrates that both methods have increased power for detecting the association between rare variants and diseases, compared with other available methods. Both methods are applied to the Wellcome Trust Case Control

Consortium (WTCCC) coronary artery disease and hypertension data and replicated the previous findings of genes associated with hypertension and coronary artery disease respectively at a genome-wide significance level of 5%. These results suggest that haplotype-based methods are powerful methods in searching for rare genetic variants and can be applicable to the data from current genome-wide association studies.

Another part of this dissertation is a genetic association study of systolic and diastolic blood pressure (SBP and DBP) -the Candidate Gene Association Resource

6

(CARe) Study. The CARe Consortium consists of a total of 8,591 AA and 20,082 for EA

participants who were genotyped in seven United States cohorts. Participants who had

SBP and DBP measures and genome-wide or candidate gene genotyping using the

Affymetrix 6.0 array and Illumina IBC array, respectively were included in the association analysis. Primary association analyses were performed within each cohort using an additive genetic model; a meta-analysis combined data across all cohorts using inverse variance weighting. The strongest genome-wide signal identified in African

Americans was rs10474346, (P=3.56×10-8) located near GPR98 and ARRDC3 on

5q14 for DBP; and rs2258119 in C21orf91 on chromosome 21 (P=4.69×10-

8) for SBP. Suggestive evidence of association was detected for several genes.

7

Chapter 1 Introduction

1.1 Genome-Wide Association studies for Common Diseases

1.1.1 Overview

The development of many common diseases is very complex, with the interaction

of many genes and environmental factors. Identification of the genetic risk factors for common diseases will help in a better understanding of the etiology of the diseases. The

contains roughly 11 millions of SNPs, some of which may directly cause

the changes in disease risks while some other variants may indirectly affect the disease

susceptibility because of linkage disequilibrium (LD) with causal SNPs. Genome-wide association (GWA) studies have been applied to searching genetic variants across the genome for a possible role in disease, with genotyping of densely distributed SNPs.

With the completion of human genome sequence [Lander, et al. 2001; Venter, et al. 2001], the advancement in SNP genotyping technology, the expansion of SNP databases and the progress of the International HapMap project [2003; Frazer, et al.

2007], we have recently seen a great development of GWA. Nowadays, researchers have

identified risk and protective factors for many common diseases, including asthma,

diabetes, heart diseases, Alzheimer disease and cancer, as well as complex traits like

obesity and height, by genotyping 500,000 or more SNPs for a large sample of subjects.

The success of GWAS can substantially influence health care. The research results of

GWA may provide information for the development of personalized medicine and

targeted therapies. In addition, identification of individuals with different risks of disease

will enable preventive therapeutic management. 8

1.1.2 Recent Successes of Genome-Wide Association Studies

In mapping genes contributing to common diseases, a popular hypothesis is that causal variants are common in the population [Chakravarti 1999; Lander 1996]. Under this assumption, variants accounting for the genetic variance of common disease susceptibility can be detected by testing a large number of tagging SNPs across the genome through linkage disequilibrium (LD) in association studies [Risch 2000]. The

International HapMap Project [2003; Frazer, et al. 2007], which is based on the “common disease-common variants” (CDCV) assumption, facilitates design and analysis in searching for common variants in association studies. The HapMap project is designed with the purpose of understanding the pattern of whole genome variation and LD in four population samples. As a benefit from the HapMap project, tagging SNPs can be selected for genotyping in order to improve efficiency and reduce cost. In addition, the technological advancement in genotyping technologies, such as the available Affymetrix and Illumina dense genotyping chips, provides us with good coverage of the human genome by genotyping hundreds of thousands of SNPs at a time. As a result of the first wave of GWA studies, recently there have been substantial successes in identifying susceptibility loci for many complex traits.

The Wellcome Trust Case Control Consortium (WTCCC) study [Consortium

2007] is currently the largest and most comprehensive genome-wide study. The study has analyzed 2,000 samples from Great Britain for each of seven diseases, type 1 diabetes, type 2 diabetes, coronary heart disease, hypertension, bipolar disorder, rheumatoid arthritis and Crohn's disease. Also, a common set of 3,000 controls from the 1958 British

Birth Cohort and the three national UK Blood Services are used for comparison with case 9

samples. Genotyping was conducted using the Affymetrix GeneChip 500K Mapping

Array Set. Case control comparisons have identified 24 independent association signals

(p < 5 10-7): 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn's disease, 3

in rheumatoid arthritis, 7 in type 1 diabetes and 3 in type 2 diabetes (Wellcome Trust

Case Control Consortium 2007, Figure1-1).

Follow-up studies have validated and replicated the findings of several SNPs, in

Type 1 diabetes [Todd, et al. 2007], Type 2 diabetes [Zeggini, et al. 2007], coronary

artery disease [Samani, et al. 2007] and rheumatoid arthritis [Thomson, et al. 2007]. Also,

there has been rapid expansion in the numbers of loci implicated in predisposition for

other diseases, such as Crohn's disease [Barrett, et al. 2008; Parkes, et al. 2007], prostate

cancer [Gudmundsson, et al. 2007; Thomas, et al. 2008], breast cancer [Easton, et al.

2007; Gold, et al. 2008]. Several common variants were identified as risk loci for

important continuous traits, such as body mass index [Frayling, et al. 2007; Herbert, et al.

2006], lipids [Kathiresan, et al. 2008; Willer, et al. 2008] and height [Gudbjartsson, et al.

2008; Lettre, et al. 2008; Weedon, et al. 2008]. The website of the National Human

Genome Research Institute (NHGRI) (http://www.genome.gov/) has an updated catalog

of published GWA studies.

10

Figure 1-1 Genome-wide scans for seven diseases of WTCCC study [Consortium

2007]. The -log transformations of the trend test p-value for each SNP is plotted against position on each chromosome, with each panel representing one of the seven diseases. P- values smaller than 1 10-5 are highlighted in green.

1.1.3 Limitations and Challenges of Genome Wide Association Studies

Despite the dramatic advances in identifying gene variants that influence common diseases, we are still facing great challenges. Most of the heritable risk for most common diseases still remains unidentified. Many disease susceptibility variants are undiscovered.

Some common diseases with a strong heritable component, such as bipolar disease and hypertension, have been almost completely unlucky in GWA studies. Moreover, the

11

proportion of phenotypic variation explained by common variants is surprisingly low, leading to the question of ‘missing heritability’. For example, height is a highly heritable

trait with estimated heritability around 0.8, which means that within a population, about

80% of the variance in height among individuals is due to genetic factors. Although the

three GWA studies in 2008 [Gudbjartsson, et al. 2008; Lettre, et al. 2008; Weedon, et al.

2008] have found 40 previously unknown variants, each locus can only explain a small

proportion of the phenotypic variance (0.3% - 0.5%).

The failure of GWA studies to identify the susceptibility variants can be due to

various aspects and I am going to list and briefly discuss a few as follows:

1) Subject ascertainment

Most GWA studies have applied case-control designs because of easy

ascertainment of study subjects; nevertheless they raise issues of optimal selection of

cases and controls. For example, the common-control design, such as that of the

WTCCC study, may result in loss of power because of the possible inclusion of

individuals with latent diagnoses of the disease of interest.

2) Population stratification

Different human populations may be extremely similar across the genome as a

whole but carry quite different genetic variants relevant to disease. In addition, the

linkage disequilibrium between markers can also vary between populations. A potential

problem for every population-based association study is the presence of population

structure in the samples, which leads to false positives or to missed real effects [Marchini,

12

et al. 2004] (Figure 1-2). As an example, if the cases and controls are sampled from

several subpopulations that have different allele frequencies of a testing marker, and if

the disease prevalences are also different among subpopulations, then association may be

present between the disease and the marker, which leads to false-positive results.

Figure 1-2 The effects of population structure at a SNP. Shown in the figure is an

example of population stratification with two populations. There is an excess of

individuals from population 2 in cases and population 2 has a lower frequency of allele A

than population 1. As a result, the structure mimics the signal of association [Marchini,

et al. 2004].

To account for multiple testing due to the large number of tests, GWA studies often apply an extremely stringent significance level (<10-7) and therefore require large sample sizes to detect true signals. The confounding effect of population stratification becomes even larger when larger sample sizes and small significance levels are used to declare reliable association evidence, hence potentially leading to more false positives

13

[Marchini, et al. 2004] (Figure 1-3). Although methods such as genomic control and principal component analysis have been developed to control the effect by population structure, power may also be reduced by these methods.

Figure 1-3 Multiplicative change in p-values due to population structure (shown on the log10 scale). Three populations are simulated with c1 = 0.234, c2 = 0.116, c3 = 0.152*,

RR = 1:1:3. The key indicates the different P values considered. As sample size increases, the effects of structure become more severe [Marchini, et al. 2004] . *The variance parameter cj specifies how far the jth subpopulation’s allele frequencies tend to be from typical values, which is similar to FST values.

3) Variants with small effect size

Meta-analyses suggested that typical effect sizes of detected individual genetic variants for complex diseases correspond to odds ratios of 1.2–1.6 [Ioannidis, et al. 2006].

It is not impossible to detect some smaller effects, but extremely difficult to differentiate

14 them from the large amount of noise. One weakness of GWAS is that the signal from true risk variant may be disguised with statistical noise from the large numbers of markers that are not associated with a disease. GWAS always set an extremely high threshold for a variant to be accepted as a disease susceptibility locus. Although this is for the purpose of reducing false positives, it may also lose any true disease markers with small effects in the background noise.

A power analysis estimated the effect of small effect size on the power of association studies [Iles 2008]. The result showed that, assuming all typed variants have a multiplicative mode of inheritance, there is good power to detect a variant with a genotype relative risk (GRR) of 2 even at low frequency. When GRR goes down to 1.5, a variant is still detectable with a minor allele frequency (MAF) as low as 0.05 for a sample size of 3,000, but only has decent power for a MAF > 0.2 for sample sizes of 1,000. For the situation of GRR = 1.3, power is good at high allele frequency (MAF > 0.2) for sample sizes of 3,000, but is generally poor (power < 0.2) for sample sizes of 1,000.

When the effect size is as small as GRR = 1.2, power is poor even for sample sizes of

3,000 (illustrated in Figure 1-4).

15

Figure 1-4 The effect of small effect size on the power of association studies. Power

calculated for 1000 cases and 1000 controls (A) and 3000 Cases and 3000 Controls (B) to reach a genome-wide significance set at α = 5 × 10−7. Lines are for GRR = 2 (solid), 1.5

(dashed), 1.3 (dotted), 1.2 (dashed and dotted). A multiplicative mode of inheritance is

assumed [Iles 2008].

4) Rare variants

GWA studies are based on the “common disease-common variants” (CDCV)

assumption, which assumes that common diseases are a result of common variants.

Under this model, disease susceptibility is suggested to result from the joint action of

several common variants, and unrelated affected individuals share a significant

proportion of risk alleles. Although this assumption has its own theoretical justification, it may not apply to all common human diseases. The results from GWA studies for various diseases could not provide unequivocal evidence in support of the CDCV hypothesis [Iles

16

2008] . The extreme alternative to CDCV is the “common disease-multiple rare variants”

(CDMRV) hypothesis that many rare variants contribute to common disease susceptibility [Iyengar and Elston 2007]. A recent simulation study showed that multiple rare variants may create a 'synthetic' association with a common variant in GWA studies

[Dickson, et al.]. This study has demonstrated that even the detected signals for common variants may come from the effect of rare variants.

Current GWA studies have essentially no power to detect rare disease susceptibility variants [Zeggini, et al. 2005]. As illustrated in Figure 1-4, the power for those variants with a frequency below 0.2 is very limited for GRRs < 1.5. Figure1-5

shows that if susceptibility alleles have minor allele frequencies (MAFs) of less than 0.1

and their effect sizes are less than an odds ratio of 1.3, then unrealistically large sample

sizes of more than 10,000 cases and 10,000 controls (or 10,000 families) would be

required to achieve convincing statistical support for a disease association.

17

Figure 1-5 Effects of allele frequency on sample size requirements. The sample sizes of cases and controls required in an association study to detect disease variants are shown for allelic odds ratios of 1.2 (red), 1.3 (blue), 1.5 (yellow) and 2 (black). The calculation of sample size is based on a statistical power of 80% at a significance level of P <10-6, assuming a multiplicative model for the effects of alleles [Wang, et al. 2005].

18

1.2 Multiple Rare Variants Hypothesis and Recent Evidence

1.2.1 Common Disease-Multiple Rare Variants Hypothesis

As we reviewed in the last section, GWA studies have detected many common susceptibility genetic variants responsible for complex diseases. However, most of the common variants found in GWA studies are observed to have modest odds ratios between 1.2 and 1.5 [Bodmer and Bonilla 2008]. These common variants can only account for a limited fraction of the observed familial aggregation. There have been many cases of the signals identified and confirmed by previous family-based studies that failed to be captured in follow-up association studies, although it has been argued that linkage analysis is less powerful than association analysis for identifying complex-disease genes.

One explanation of this failure in follow-up association studies is the multiple rare variants hypothesis.

The allelic architecture of most common diseases is likely to be broad under an evolutionary model with the collective effect of mutations, random genetic drift and selection [Pritchard 2001; Pritchard and Cox 2002]. The randomness of the evolutionary process can also result in the fact that many disease susceptibility alleles are rare and population specific. Bodmer and Bonilla [Bodmer and Bonilla 2008] have compared the characteristics of common and rare disease variants (Table 1-1) and summarized the odds ratio distributions for common and rare variants identified in recently published studies

(Figure 1-6). There is a distinct difference between the two distributions, as most common variants have ORs between 1.2 and 1.5 while most rare variants have ORs above 2.

19

Table 1-1 Comparison of characteristics of common and rare disease variants

[Bodmer and Bonilla 2008].

Figure 1-6 Distribution of odds ratios for common and rare variants [Bodmer and

Bonilla 2008]. Common variants are defined as variants with MAF > 5% and rare variants are defined as variants with MAF < 5%.

20

1.2.2 Recent Evidence of Rare Variants Contributing to Common Diseases

Recent studies, mainly based on resequencing methods, have identified multiple

rare variants for several common diseases. One remarkable finding is with breast cancer.

There are ten genes accounting for inherited breast cancer and all those genes bear many

rare mutations [Walsh and King 2007]. The accumulated evidence suggests that the high

heterogeneity of inherited breast cancer can be at least partially explained by a “common

disease-multiple rare variants model”. Stratton and Rahman [Stratton and Rahman 2008]

have suggested three well-defined classes of breast cancer susceptibility alleles with

different levels of risk and prevalence in the population: rare high-penetrance alleles, rare

moderate-penetrance alleles and common low-penetrance alleles. Other common diseases are also shown to have a similar pattern of inheritance. It has been reported that rare variants in three genes – SLC12A3, SLC12A1 and KCNJ1 – contribute to the reduction in blood pressure and protection from hypertension [Ji, et al. 2008a]. Also, Cohen and colleagues [Cohen, et al. 2004] have sequenced three genes – ABCA1, APOA1, and

LCAT – and found that multiple rare genetic variants in the coding regions significantly contribute to low plasma HDL cholesterol level. In addition, multiple rare variants have been reported to be associated with metabolic phenotypes [Romeo, et al. 2007] and plasma angiotensinogen level [Zhu, et al. 2005a].

There are great difficulties in the detection of rare variants by current LD based association and GWA studies because only common SNPs are ascertained. Mostly the strategy for discovery of rare variants is gene-based. The procedures involve sequencing candidate genes in a chosen disease group and then examining the frequencies of

21

identified rare variants in an appropriately selected control group. Those procedures are

also followed by the assessment of the potential effects those variants may impose on the

function of relevant gene products. A rare variant can be regarded as a genetic risk factor

for a complex trait if it can not only show significant difference in its frequency between

case and control groups, but also can be proved to affect the function of the relevant gene

product. While this strategy has led to some successes in detecting rare variants

associated with common disease, it nevertheless poses great challenges. The choice of

candidate genes is very critical to the success of detecting rare variants. Appropriate

choices of case and control groups are also very crucial. In addition, the strategy is

mostly based on a separate analysis of each rare variant and thus lacks statistical power.

Haplotype-based analysis may provide a better solution because it has been theoretically

proven to be more powerful compared to single SNP analysis, based on accurate

haplotype frequency estimates [Zaitlen, et al. 2007]. There have been some studies that

successfully detected rare variants using haplotype analyses, including a finding of two

rare haplotypes having significant effects on the osteoporosis phenotype [Liu, et al. 2005] and a report on detection of rare variants contributing to variation in angiotensinogen levels [Zhu, et al. 2005a]. However, methods based on individual haplotype analysis still face the problem of low power and thus require large sample sizes to detect rare variants.

Thus, new methods for statistical analysis of rare variants are in great need for association studies.

22

1.3 Current statistical methods for detecting rare genetic variants in association

studies

Recently, several statistical methods have been developed and proven to be powerful for detecting associations with rare variants. Both of them share a similar basic

idea of summarizing information from multiple rare variants.

1.3.1 Multivariate and Collapsing Methods

Li and Leal [Li and Leal 2008] have discussed and compared several methods in their paper published in the latest issue of the American Journal of Human Genetics. One approach is the single-marker test that uses standard univariate statistical tests to test individual variants within a gene followed by multiple-comparison correction. Another method is the multiple-marker test, which tests all variants simultaneously with the use of a multivariate test, for example, Hotelling’s T2 test. The test is constructed following

Xiong’s previous work [Xiong, et al. 2002]:

An indicator variable is defined for the genotype of the ith variant for the jth

1 individual in cases = 0 ; 1𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝑖𝑖 𝑖𝑖 𝐴𝐴𝐴𝐴 𝑋𝑋𝑗𝑗𝑗𝑗 � 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝑖𝑖𝑖𝑖 𝐴𝐴𝐴𝐴 − 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝑖𝑖𝑖𝑖 𝑎𝑎𝑎𝑎 similarly, is defined for the control population. Then the covariance matrix for the

𝑗𝑗𝑗𝑗 pooled samples𝑌𝑌 of the indicator variables for M variants is given by:

1 = 𝑁𝑁𝐴𝐴 ( ) ( ) + 𝑁𝑁𝐴𝐴� ( ) ( ) + 2 =1 𝑇𝑇 =1 𝑇𝑇 𝑗𝑗 𝑗𝑗 𝑗𝑗 𝑗𝑗 𝑆𝑆 𝐴𝐴 𝐴𝐴̅ �� 𝑋𝑋 − 𝑋𝑋� 𝑋𝑋 − 𝑋𝑋� � 𝑌𝑌 − 𝑌𝑌� 𝑌𝑌 − 𝑌𝑌� � 𝑁𝑁 𝑁𝑁 − 𝑗𝑗 𝑗𝑗 23

where and are the sample sizes of cases and controls, respectively; and are

𝐴𝐴 𝐴𝐴̅ the sample𝑁𝑁 means𝑁𝑁 of the indicator variables across the cases and controls, respectively.𝑋𝑋� 𝑌𝑌�

Hotelling’s T2 statistic is defined as:

2 = ( ) S 1( ) + 𝑁𝑁𝐴𝐴 𝑁𝑁𝐴𝐴̅ 𝑇𝑇 − 𝑇𝑇 𝑋𝑋� − 𝑌𝑌� 𝑋𝑋� − 𝑌𝑌� 𝑁𝑁𝐴𝐴 𝑁𝑁𝐴𝐴̅ + M 1 Under the null hypothesis, 2 is asymptotically distributed as an F + 2 𝑁𝑁𝐴𝐴 𝑁𝑁𝐴𝐴� − − 𝑀𝑀�𝑁𝑁𝐴𝐴 𝑁𝑁𝐴𝐴� − � 𝑇𝑇 distribution, with M and + M 1 degrees of freedom. Under the alternative

𝐴𝐴 𝐴𝐴̅ hypothesis, 2 is asymptotically𝑁𝑁 𝑁𝑁 distributed− − as a non-central 2 distribution and the

𝑀𝑀 1 𝑇𝑇 1 𝜒𝜒 1 noncentrality parameter (NCP) is given by = + , where is the A A − 𝑇𝑇 � 𝜈𝜈𝐻𝐻 𝜇𝜇 �𝑁𝑁𝐴𝐴 ∑ 𝑁𝑁𝐴𝐴� ∑ � 𝜇𝜇 𝜇𝜇 vector of expected differences between cases and controls. The power of the multivariate

2 2 test is thus given by = ( ) ,1 .

𝜂𝜂𝐻𝐻 𝑃𝑃𝑃𝑃�𝜒𝜒𝑀𝑀 𝜈𝜈𝐻𝐻 ≥ 𝜒𝜒𝑀𝑀 −𝛼𝛼 � Different from multiple marker tests, which can have a large number of degrees of

freedom, the collapsing method collapses genotypes across variants, resulting in

enrichment of signals and reduction in the number of degrees of freedom at the same time.

This method uses an indicator variable to code the presence of rare variants. For the jth

1 case, = . is defined in a similar way for the jth control. 0 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑋𝑋𝑗𝑗 � 𝑌𝑌𝑗𝑗 Then the multiple tests of association𝑜𝑜𝑜𝑜ℎ𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 of rare variants are transformed into one test of the

difference in the proportion of individuals who carry rare variants between cases and

controls.

24

Li and Leal have developed the Combined Multivariate and Collapsing (CMC)

method to take the advantage of both the multiple marker test and the collapsing method.

They proposed dividing markers into subgroups according certain criteria (for example,

allele frequencies) and then collapsing the genotypes within each group. To analyze the

groups of genotype data, a multivariate test such as Hotelling’s T2 test is applied. The

power of the CMC method is calculated in the same way as described for the multiple

marker test, with the number of dimension equal to the number of subgroups.

The CMC method has been demonstrated to be both robust and powerful through

simulation studies. However, there are some potential problems. First, depending on how

rare a rare variant is regarded to be, when there are a large enough number of markers collapsed into one group, it is highly possible that each individual can carry at least one rare variant. When this is the case, the power of the test will be decreased tremendously.

Another drawback is that collapsing will weaken the signal if some variants are protective while others increase disease risk. In this situation, prior information on high- risk and protective variants is very crucial to obtain appropriate power.

1.3.2 Two-Stage Methods

Zhu and colleagues [Zhu, et al.] have developed a novel association method that can summarize multiple rare variants. The rationale is that rare risk haplotypes can be co- classified together and the power to test association can be improved by testing the co- classified haplotypes as a set. Two designs for co-classifying the rare variants in association studies have been proposed. Those designs will be described in detail in the following sections.

25

1. Co-classifying rare risk haplotypes.

Design 1: Co-classifying rare risk haplotype using affected sibpairs.

The main idea to detect the rare variants by this design is first to identify a set of susceptibility haplotypes through comparing the shared haplotype frequencies in affected sibpairs with those in a general population, and then test for the association of the set of these haplotypes in an association study, rather than testing individual haplotypes or loci for association. The reasoning behind this design is that a haplotype will be enriched among the shared haplotypes in affected sibpairs, if this haplotype carries a disease variant, whether the disease variant is rare or common. Let = { 1, 2, … , } be a set

𝑘𝑘 consisting of rare risk haplotypes with the corresponding haplotype𝐷𝐷 𝐷𝐷 frequencies𝐷𝐷 𝐷𝐷

1, 2, … , and d be the rest of the haplotypes with frequency 1 = 1 =1 , 𝑘𝑘 𝑘𝑘 𝑖𝑖 𝑖𝑖 respectively.𝑝𝑝 𝑝𝑝 𝑝𝑝 Let s1 and s2 be two sibs of a sibpair, where si=1 refers− 𝑝𝑝 to the− i-th∑ sib𝑝𝑝 being affected and si=0 refers to not being affected. Let I be the number of haplotypes shared

by a sibpair. Let fg be the penetrance of genotype g. Then the frequency of a rare risk

haplotype Di among the shared haplotypes in affected sibpairs can be written as

( , = =1) ( | = = 1, ) = 1 2 1 2 ( = =1, ) 1 𝑃𝑃 2𝐷𝐷𝑖𝑖 𝑖𝑖𝑖𝑖 𝑠𝑠ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑠𝑠 𝑠𝑠 𝑃𝑃 𝐷𝐷𝑖𝑖 𝑠𝑠 𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎 𝑠𝑠ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑃𝑃 𝑠𝑠 𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎 𝑠𝑠ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 The denominator is the proportion of haplotypes IBD in affected sibpairs. Under

the assumption of all the risk haplotypes having the same effect,

26

( 1 = 2 = 1, )

𝑃𝑃 𝑠𝑠 𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎2 𝑠𝑠ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 ( | ) ( ) ( | ) = 1 2 1, 2 =1 1, 2 � � 𝑓𝑓𝑔𝑔 𝑓𝑓𝑔𝑔 𝑃𝑃 𝑎𝑎𝑎𝑎𝑎𝑎 𝑠𝑠ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝐼𝐼 𝑃𝑃 𝐼𝐼 𝑃𝑃 𝑔𝑔 𝑔𝑔 𝐼𝐼 1𝐼𝐼 𝑔𝑔 𝑔𝑔 = [ 2 3 + 2 2(1 ) + 2 (1 ) + 2 (1 )2 + 2(1 4 2 1 2 1 0 1 0

2𝑓𝑓 𝑝𝑝 2 2 𝑓𝑓 𝑓𝑓 𝑝𝑝2 − 𝑝𝑝 𝑓𝑓 2𝑝𝑝 − 𝑝𝑝 2 𝑓𝑓 𝑓𝑓 𝑝𝑝 − 𝑝𝑝 𝑓𝑓 ) + 2 + 2 1 (1 ) + 0 (1 ) ]

− 𝑝𝑝 𝑓𝑓 𝑝𝑝 𝑓𝑓 𝑝𝑝 − 𝑝𝑝 𝑓𝑓 − 𝑝𝑝 Similarly,

( , 1 = 2 = 1)

𝑃𝑃 𝐷𝐷𝑖𝑖 𝑖𝑖𝑖𝑖 𝑠𝑠ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑠𝑠1 𝑠𝑠 = [ 2 + 2(1 ) + 2 2 + 2 (1 ) + 2(1 )2] 4 2 1 2 1 2 1 𝑝𝑝𝑖𝑖 𝑓𝑓 𝑝𝑝 𝑓𝑓 − 𝑝𝑝 𝑓𝑓 𝑝𝑝 𝑓𝑓 𝑓𝑓 𝑝𝑝 − 𝑝𝑝 𝑓𝑓 − 𝑝𝑝 Consider different genetic models, when λ is genotypic relative risk:

1) Multiplicative: 2 = 0, 1 = 0

𝑓𝑓 𝜆𝜆𝑓𝑓 𝑓𝑓 √𝜆𝜆𝑓𝑓 ( | 1 = 2 = 1, )

𝑖𝑖 2 𝑃𝑃 𝐷𝐷 𝑠𝑠 𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎 𝑠𝑠ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎[ + 1 + + 1 ] = 2 2 + 2 (1 ) + (1 )2 + 2 3 + 2 2(1 ) + (1 ) + 2 (1 )2 + (1 )3 𝜆𝜆 �√𝜆𝜆𝑝𝑝 − 𝑝𝑝� 𝜆𝜆𝜆𝜆 − 𝑝𝑝 𝑝𝑝𝑖𝑖 𝜆𝜆 𝑝𝑝 𝜆𝜆𝜆𝜆 − 𝑝𝑝 − 𝑝𝑝 𝜆𝜆 𝑝𝑝 𝜆𝜆√𝜆𝜆𝑝𝑝 − 𝑝𝑝 𝜆𝜆𝜆𝜆 − 𝑝𝑝 √𝜆𝜆𝑝𝑝 − 𝑝𝑝 − 𝑝𝑝 2) Dominant: 2 = 1 = 0

𝑓𝑓 𝑓𝑓 𝜆𝜆𝑓𝑓 ( | 1 = 2 = 1, )

𝑃𝑃 𝐷𝐷𝑖𝑖 𝑠𝑠 𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎 𝑠𝑠ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 2 2 = 2 2 + 2 2 (1 ) + (1 )2 + 2 3 + 2 2 2(1 ) + 2 (1 ) + 2 (1 )2 + (1 )3 𝜆𝜆 𝑝𝑝𝑖𝑖 𝜆𝜆 𝑝𝑝 𝜆𝜆 𝑝𝑝 − 𝑝𝑝 − 𝑝𝑝 𝜆𝜆 𝑝𝑝 𝜆𝜆 𝑝𝑝 − 𝑝𝑝 𝜆𝜆 𝑝𝑝 − 𝑝𝑝 𝜆𝜆𝜆𝜆 − 𝑝𝑝 − 𝑝𝑝 3) Recessive: 2 = 0, 1 = 0

𝑓𝑓 𝜆𝜆𝑓𝑓 𝑓𝑓 𝑓𝑓 ( | 1 = 2 = 1, )

𝑃𝑃 𝐷𝐷𝑖𝑖 𝑠𝑠 𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎 𝑠𝑠ℎ𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 ℎ𝑎𝑎𝑎𝑎2𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎+ 1 + ( + 1 )2 = 2 2 + 2 (1 ) + (1 )2 + 2 3 + 2 2(1 ) + (1 ) + 2 (1 )2 + (1 )3 𝜆𝜆 𝑝𝑝 − 𝑝𝑝 𝜆𝜆𝜆𝜆 − 𝑝𝑝 𝑝𝑝𝑖𝑖 𝜆𝜆 𝑝𝑝 𝑝𝑝 − 𝑝𝑝 − 𝑝𝑝 𝜆𝜆 𝑝𝑝 𝜆𝜆𝑝𝑝 − 𝑝𝑝 𝑝𝑝 − 𝑝𝑝 𝑝𝑝 − 𝑝𝑝 − 𝑝𝑝 27

In each genetic model, it can be observed that the rare haplotypes are enriched

among haplotypes shared by affected sibpairs. The enrichment of a rare risk haplotype is

dependent only on the cumulative risk haplotype frequency ( ) in the population and the genotypic relative risk ( ). 𝑝𝑝

𝜆𝜆 If there is no risk variant in the region, no enrichment among the haplotypes

shared by affected sibpairs can be observed. The power of co-classification is dependent

on the misclassification rate that can be tolerated, the genotypic relative risk and the

enrichment of a rare risk haplotype among the haplotypes shared by affected sibpairs.

Assume there are N1 haplotypes shared among N affected sibpairs, and there are k

different haplotypes 1, 2, … , with observed haplotype frequencies 1, 2, … ,

𝑘𝑘 𝑘𝑘 among the N1 sharedℎ haplotypes.ℎ ℎ With a further assumption that the i-th𝑝𝑝 haplotype𝑝𝑝 𝑝𝑝 has the

corresponding haplotype frequency 0 in the general population, the risk haplotype set is

𝑖𝑖 0 𝑝𝑝1 0 defined as = { | 0 > }, where is a predefined number that affects 𝑝𝑝𝑖𝑖 � −1 𝑝𝑝𝑖𝑖 � 𝑆𝑆 ℎ𝑖𝑖 𝑝𝑝𝑖𝑖 − 𝑝𝑝𝑖𝑖 𝛾𝛾� 𝑁𝑁 𝛾𝛾 the misclassification rate and power. The power to detect a rare risk haplotype is the

0 1 0 probability of observing more than [ 0 + ] rare haplotypes in N 𝑝𝑝𝑖𝑖 � −𝑝𝑝𝑖𝑖 � 𝑁𝑁 �𝑝𝑝𝑖𝑖 𝛾𝛾� 𝑁𝑁 � Bernoulli trials, with the sampling probability , which is the rare risk haplotype

𝑖𝑖 frequency among shared haplotypes. Here [x] 𝑝𝑝is the largest integer less than x. is used

0 instead of 1 because sibpairs are expected to share haplotypes. When , 𝑁𝑁the

𝑖𝑖 haplotype 𝑁𝑁frequency in𝑁𝑁 the general population, is unknown,𝑁𝑁 the haplotype frequency𝑝𝑝 shared by sibpairs is compared instead with that of untransmitted haplotypes from parents to sibs.

28

Design 2: Co-classifying rare risk haplotypes using unrelated cases.

Similar to the affected sibpair design, the frequency of a rare risk haplotype Di is also enriched among cases and can be calculated as

( | )

𝑖𝑖 𝑃𝑃 𝐷𝐷( 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎| ) ( ) + 0.5 + 0.5 ( | ) ( ) = ( ) 𝑃𝑃 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝐷𝐷𝑖𝑖𝐷𝐷𝑖𝑖 𝑃𝑃 𝐷𝐷𝑖𝑖𝐷𝐷𝑖𝑖 ∑𝑗𝑗 ≠𝑖𝑖 𝑃𝑃�𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎�𝐷𝐷𝑖𝑖𝐷𝐷𝑗𝑗 �𝑃𝑃�𝐷𝐷𝑖𝑖𝐷𝐷𝑗𝑗 � 𝑃𝑃 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝐷𝐷𝑖𝑖𝑑𝑑 𝑃𝑃 𝐷𝐷𝑖𝑖𝑑𝑑 𝑃𝑃 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 Considering different genetic models:

1) Multiplicative: 2 = 0, 1 = 0

𝑓𝑓 𝜆𝜆𝑓𝑓 𝑓𝑓 √𝜆𝜆𝑓𝑓

( | ) = + 1 √𝜆𝜆 𝑃𝑃 𝐷𝐷𝑖𝑖 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑖𝑖 √𝜆𝜆𝑝𝑝 − 𝑝𝑝 2) Dominant: 2 = 1 = 0

𝑓𝑓 𝑓𝑓 𝜆𝜆𝑓𝑓

( | ) = (1 )2( 1) 𝜆𝜆 𝑃𝑃 𝐷𝐷𝑖𝑖 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑖𝑖 𝜆𝜆 − − 𝑝𝑝 𝜆𝜆 − 3) Recessive: 2 = 0, 1 = 0

𝑓𝑓 𝜆𝜆𝑓𝑓 𝑓𝑓 𝑓𝑓 ( 1) + 1 ( | ) = 2( 1) + 1 𝑝𝑝 𝜆𝜆 − 𝑃𝑃 𝐷𝐷𝑖𝑖 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑖𝑖 𝑝𝑝 𝜆𝜆 − Assume there are N cases and the observed haplotype frequencies of

1, 2, … , are 1, 2, … , , among the N cases, respectively. Then the rare risk

𝑘𝑘 𝑘𝑘 ℎ ℎ ℎ 𝑝𝑝 𝑝𝑝 𝑝𝑝 0 1 0 haplotype set is defined as = { | 0 > }. The haplotypes in S will be 𝑝𝑝𝑖𝑖 � −𝑝𝑝𝑖𝑖 � 𝑆𝑆 ℎ𝑖𝑖 𝑝𝑝𝑖𝑖 − 𝑝𝑝𝑖𝑖 𝛾𝛾� 𝑁𝑁 considered as risk haplotypes in a case-control association analysis. The power to detect

29

0 1 0 a rare risk haplotype is the probability of observing more than 2 0 + 2 𝑝𝑝𝑖𝑖 � −𝑝𝑝𝑖𝑖 � � 𝑁𝑁 �𝑝𝑝𝑖𝑖 𝛾𝛾� 𝑁𝑁 �� rare haplotypes in 2N Bernoulli trials with the sampling probability , which is the rare

𝑖𝑖 risk haplotype frequency in cases. When 0, the haplotype frequency𝑝𝑝 in the general

𝑖𝑖 population, is unknown, the haplotype frequencies𝑝𝑝 between cases and controls are

compared instead. However, it must be noted that the controls used in the co-

classification stage must be independent of the controls used in the follow-up association test to avoid inflating type I error.

2. Test for association for the risk haplotype set.

The haplotypes in S are treated equally and the set is denoted as a haplotype S.

Then the association between haplotype S and a disease is tested by comparing the

frequency of haplotype S between cases and controls. These cases and controls must be

independent of the cases and controls used for co-classifying the rare risk haplotypes.

Zhu et al. have demonstrated that the two-stage methods have power to co-

classify rare risk haplotypes and detect association of those rare risk haplotypes with

phenotypes, both by simulation studies and application to the WTCCC hypertension data.

However, the proposed methods are based on two stages – co-classifying rare risk haplotypes and testing association, which raises the question of optimal sample sizes for each stage given a fixed total sample size, and this needs further investigation. This question can be examined both theoretically and by simulation studies, in a way similar to determining the optimal sample sizes for two-stage designs in linkage and association studies.

30

1.4 Statistical methods for combining p-values

GWA studies are facing the question of testing a genome-wide set of multiple hypotheses, with a dominant number of null hypotheses and a small number of multiple false null hypotheses. Traditional multiple-test adjustments are based on individual hypothesis without considering the fact that the combined evidence of multiple alternative hypotheses can be detected with more power. Several methods of combining p-values were developed to address this issue in the past decades.

Consider conducting tests for the L hypotheses Hi, i=1,2,…,L and pi is the p-value for each test (the tests are assumed to be independent), which has a uniform [0, 1] distribution. The Fisher combination test [Fisher 1932] uses the statistic:

= 2 𝐿𝐿 = 2 𝐿𝐿 =1 =1 𝑡𝑡 − � 𝑙𝑙𝑙𝑙𝑝𝑝𝑖𝑖 − 𝑙𝑙𝑙𝑙 �� 𝑝𝑝𝑖𝑖� 𝑖𝑖 𝑖𝑖 which is distributed as a chi-square distribution with 2L degrees of freedom under the null hypothesis (all Hi hypotheses are true). Edgington [Edgington 1972] proposed to use the sum of the p-values instead. Wilkinson (1951) found that the single test significance

1 level is = 1 (1 ) for an overall -level test, which is the basis of Šidák’s 𝐿𝐿 correction.𝜏𝜏 Stouffer− − [Stouffer𝛼𝛼 1949] applied𝛼𝛼 a normal transformation to the p-values before combining them. The combined p-value is then

1 1 = 1 Φ 𝐿𝐿 Φ (1 ) =1 − 𝑝𝑝 − � � − 𝑝𝑝𝑖𝑖 � √𝐿𝐿 𝑖𝑖

31

where Φ(x) denotes the probability function for the standard normal distribution. Simes

(1986) has proposed another test based on L ordered p-values using the fact that

( ) . Then the global p-value is given by ( )/ . 𝐿𝐿 𝑃𝑃𝑃𝑃�⋃𝑖𝑖 𝑝𝑝 𝑖𝑖 ≤ 𝑖𝑖𝑖𝑖⁄𝐿𝐿� ≤ 𝛼𝛼 𝑚𝑚𝑚𝑚𝑚𝑚�𝐿𝐿𝐿𝐿 𝑖𝑖 𝑖𝑖� More recently, Zaykin and colleagues [Zaykin, et al. 2002] developed the

truncated product method, which takes the product of p-values from the tests whose

significance levels exceed a certain threshold. They proposed using the product of all

the p-values smaller than a certain value : 𝑊𝑊

𝜏𝜏 ( ) = =1 , where ( ) is the indicator function. 𝐿𝐿 𝐼𝐼 𝑝𝑝𝑖𝑖≤𝜏𝜏 𝑊𝑊 ∏𝑖𝑖 𝑝𝑝𝑖𝑖 𝐼𝐼 ∙ Assuming all tests are independent, they have derived the distribution of under

the null hypothesis as: 𝑊𝑊

Pr( ) = 𝐿𝐿 Pr( | ) Pr( ) =1 𝑊𝑊 ≤ 𝑤𝑤 � 𝑊𝑊 ≤ 𝑤𝑤 𝑘𝑘 𝑘𝑘 𝑘𝑘 1 ( ) = 𝐿𝐿 (1 ) 𝑘𝑘− ( ) ! 𝑠𝑠 =1 𝐿𝐿 𝐿𝐿−𝐾𝐾 =0 𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘 − 𝑙𝑙𝑙𝑙𝑙𝑙 𝑘𝑘 � � � − 𝜏𝜏 �𝑤𝑤 � 𝐼𝐼 𝑤𝑤 ≤ 𝜏𝜏 𝑘𝑘 𝑘𝑘 𝑠𝑠 𝑠𝑠

+ ( > ) 𝑘𝑘 𝑘𝑘 𝜏𝜏 𝐼𝐼 𝑤𝑤 𝜏𝜏 �

To consider the problem within the context of correlated tests, they exploited

several solutions. The first one used a transformation of the correlated p-values.

Consider as the covariance matrix of a vector of p-values, R. There exists a matrix C

such that 𝚺𝚺 = if is positive definite.𝚺𝚺 Let = ( ) where is a random 𝐓𝐓 −𝟏𝟏 ∗ ∗ 𝚺𝚺 𝐂𝐂𝐂𝐂 𝚺𝚺 𝐙𝐙 𝚽𝚽 𝟏𝟏 − 𝐑𝐑 𝐑𝐑 32

vector from independent Uniform [0, 1] distributions, then Var( ) = Var( ) = 𝐓𝐓 = and thus is distributed as a multivariate normal distribution𝐂𝐂𝐂𝐂 𝐂𝐂 with𝐙𝐙 mean𝐂𝐂 zero 𝐓𝐓 𝐂𝐂𝐂𝐂and𝐂𝐂 covariance𝚺𝚺 matrix𝐂𝐂𝐂𝐂 . Therefore, the correlated p-value vector R can be transformed to

R = 1 Φ{C 1Φ 1(1𝚺𝚺 R)} and then the analysis will follow in the same manner as for ∗ − − unrelated− tests. −

The second solution is to use a Monte Carlo algorithm to calculate the combined p-values. Both methods are based on the assumption that the correlation structure of the

p-values is known. In practice, the correlation structure is unknown in many cases and

has to be estimated from data. The third solution can deal with this situation by

estimating the empirical distribution of W by permutation. The p-values will be

calculated for each permutation and the truncated product method will be applied. The

combined p-value will then be estimated by the proportion of permutations that have a truncated p-value smaller than the observed in the original data.

Zaykin et al. have applied the truncated product method to both simulation studies and real data analysis and shown that the method has adequate power and is useful in increasing the overall power.

Complementary to the truncated product method, Dudbridge and Koeleman

[Dudbridge and Koeleman 2003] proposed a rank truncated method, which combines both ideas from Hoh [Hoh, et al. 2001] and Zaykin [Zaykin, et al. 2002]. They have proposed a rank truncated product as: = =1 ( ), where ( ) is the order statistic of 𝐾𝐾 𝑅𝑅 𝑖𝑖 𝑖𝑖 𝑖𝑖 and is a fixed integer (1 < 𝑊𝑊). Then∏ this𝑝𝑝 statistic is 𝑝𝑝combined with the

𝑝𝑝𝑖𝑖 𝐾𝐾 ≤ 𝐾𝐾 𝐿𝐿 ( ) truncated product statistic = =1 , forming a new statistic = 𝐿𝐿 𝐼𝐼 𝑝𝑝𝑖𝑖≤𝜏𝜏 𝑊𝑊𝑇𝑇 ∏𝑖𝑖 𝑝𝑝𝑖𝑖 𝑊𝑊𝑅𝑅𝑅𝑅 33

( ) =1 ( ) = ( , ). This is the product of at most p-values that are 𝐾𝐾 𝐼𝐼 𝑝𝑝𝑖𝑖≤𝜏𝜏 ∏𝑖𝑖 𝑝𝑝 𝑖𝑖 𝑚𝑚𝑚𝑚𝑚𝑚 𝑊𝑊𝑅𝑅 𝑊𝑊𝑇𝑇 𝑊𝑊𝑅𝑅𝑅𝑅 𝐾𝐾 smaller than a given . One advantage of is that the truncation threshold can be

𝑅𝑅𝑅𝑅 chosen more depending𝜏𝜏 on the hypothesized𝑊𝑊 disease model and is less dependent on the sample size, which should provide some benefit in GWA studies.

34

Chapter 2 Detecting Association with Rare Variants for Common

Diseases Using Haplotype-Based Methods

2.1 Introduction

Genome-wide association (GWA) studies have detected many common susceptibility genetic variants responsible for common diseases, with the underlying common disease common variant (CDCV) hypothesis. However, these common variants can only account for a limited fraction of the observed familial aggregation with modest odds ratios between 1.2 and 1.5. There have been many cases that follow-up association studies failed to identify any associated SNPs in regions identified and confirmed by previous family-based linkage studies, although it has been argued that linkage analysis is less powerful than association analysis for identifying complex-disease genes. One emerging explanation for this deficit in follow-up association studies is the common disease multiple rare variants (CDMRV) hypothesis, which suggests the missing heritability for common diseases can be attributable to rare genetic variants with intermediate penetrance effects (Figure 2-1). For finding Mendelian disease genes, linkage mapping has been shown to be to be a powerful tool to identify rare variants with large effects. Most genetic variants identified by GWA studies are common variants

(minor allele frequency > 5%) with low to modest penetrance. Many rare variants with intermediate penetrance effects are hard to detect by traditional linkage approaches because of low penetrance of the risk homozygotes. At the same time, it is difficult to detect those variants by GWA studies because of the low allele frequency. However, new technologies like resequencing may provide a potential tool to identify those variants and the role they are playing in complex traits.

35

Figure 2-1 Low-frequency variants and disease susceptibility: variants with higher

penetrance but lower frequency variants cannot be detected by current GWA approaches

[McCarthy, et al. 2008].

Results from recent resequencing studies have identified many rare variants for several common diseases, including breast cancer, hypertension and low plasma HDL cholesterol level phenotypes. Because only common variants are ascertained and genotyped under current GWA study designs, it is difficult to detect rare variants by

current LD based association, which relies heavily on the CDCV hypothesis. Genotyping

arrays are designed with a tendency to include the high-frequency variants to capture a

large proportion of genetic variation. Thus developing powerful statistical methods for

testing association of rare variants and common diseases is very important for us to

explore the missing heritability of diseases.

Many current methods detecting association of rare variants are developed with

the idea of grouping multiple variants, which is obvious when multiple rare variants are

expected to contribute to disease risk. One method using a grouping approach is the

Combined Multivariate and Collapsing (CMC) method [Li and Leal 2008]. In this

36 method all rare variants in a gene are collapsed into one group and are analyzed with other un-collapsed common variants together using multivariate analysis (e.g. Hotelling’s

T2 test). Another method is the Weighted-Sum Method, which groups variants in one gene and assigns each individual a genetic score by a weighted sum of the variant counts.

The excess of variants in affected individuals is then tested [Madsen and Browning 2009].

Under the hypothesis that multiple variants contribute to a disease risk, those methods were proven to have statistical power to detect rare variants when applied to sequence data. However, under current GWA study designs, rare variants are mostly un-genotyped and the above described methods will lose power greatly when the genotype information of the rare variants is not available. In order to improve power, statistical tests clustering the information on multiple rare risk haplotypes identified from individual haplotype tests are constructed. Two haplotype-based methods have been developed: the Haplotype- based Truncated Product Method (HTPM) and the combined method.

37

2.2 The Haplotype-based Truncated Product Method (HTPM)

In a candidate gene or a genomic region, assume there are m different haplotypes

1, 2, … , with corresponding haplotype frequencies = ( 1, 2, … , )′ in the

𝑚𝑚 𝑚𝑚 diseaseℎ ℎ populationℎ and = ( 1, 2, … , )′ in the control𝒑𝒑 population,𝑝𝑝 𝑝𝑝 with𝑝𝑝 =1 = 1 𝑚𝑚 𝑚𝑚 𝑖𝑖 𝑖𝑖 and =1 = 1. Since 𝒒𝒒we are𝑞𝑞 detecting𝑞𝑞 𝑞𝑞 only rare variants that increase disease∑ risk,𝑝𝑝 we 𝑚𝑚 𝑖𝑖 𝑖𝑖 are interested∑ 𝑞𝑞 in testing the one-sided hypothesis:

0: = 0 against : > 0 (for a subset of haplotypes)

𝐻𝐻 𝑝𝑝𝑖𝑖 − 𝑞𝑞𝑖𝑖 𝐻𝐻𝑎𝑎 𝑝𝑝𝑖𝑖 − 𝑞𝑞𝑖𝑖

Let a sample of N1 cases and N2 controls to be taken. Suppose the observed counts

of haplotypes in the case and control samples are = ( 1, 2, … , )′ and =

𝑚𝑚 ( 1, 2, … , )′, respectively. 𝑿𝑿 𝑥𝑥 𝑥𝑥 𝑥𝑥 𝒀𝒀

𝑦𝑦 𝑦𝑦 𝑦𝑦𝑚𝑚 Then ~multinomial(2 1, ) and ~multinomial(2 2, ). The maximum

likelihood estimates𝑿𝑿 (MLE) of 𝑁𝑁 and𝒑𝒑 are 𝒀𝒀simply the vector of𝑁𝑁 observed𝒒𝒒 proportions,

′ ′ = ( 1 2 1 , 2 2 1 , … , 𝒑𝒑2 1)𝒒𝒒 and = ( 1 2 2 , 2 2 2 , … , 2 2) . It is

⁄ ⁄ 𝑚𝑚 ⁄ ⁄ ⁄ 2 𝑚𝑚 ⁄ well�𝒑𝒑 known𝑥𝑥 𝑁𝑁 that:𝑥𝑥 ( 𝑁𝑁) = 𝑥𝑥 and 𝑁𝑁 ( ) = Σ𝒒𝒒� 2 𝑦𝑦1 where𝑁𝑁 𝑦𝑦Σ =𝑁𝑁 ( ) 𝑦𝑦 is an 𝑁𝑁m by m

𝑝𝑝 𝑝𝑝 𝑖𝑖𝑖𝑖 matrix with elements:𝐸𝐸 𝒑𝒑� 𝒑𝒑 𝐶𝐶𝐶𝐶𝐶𝐶 𝒑𝒑� ⁄ 𝑁𝑁 � 𝜎𝜎 �

2 =

𝜎𝜎𝑖𝑖𝑖𝑖 −𝑝𝑝𝑖𝑖𝑝𝑝𝑗𝑗 𝑖𝑖𝑖𝑖 𝑖𝑖 ≠ 𝑗𝑗 = (1 ) =

𝑝𝑝𝑖𝑖 − 𝑝𝑝𝑖𝑖 𝑖𝑖𝑖𝑖 𝑖𝑖 𝑗𝑗

38

Similarly, ( ) = and ( ) = Σ 2 2. If 1 and 2 are large, has an

𝑞𝑞 approximate𝐸𝐸 multivariate𝒒𝒒� 𝒒𝒒 normal𝐶𝐶𝐶𝐶𝐶𝐶 𝒒𝒒� distribution⁄ 𝑁𝑁 with𝑁𝑁 mean 𝑁𝑁 , and covariance�𝒑𝒑 − 𝒒𝒒� matrix Σ /2 1 + Σ /2 2. 𝒑𝒑 − 𝒒𝒒

𝑝𝑝 𝑁𝑁 𝑞𝑞 𝑁𝑁 To test the above hypothesis, we compare the frequency of each haplotype

between cases and controls by constructing the statistic:

= 𝑝𝑝̂𝑖𝑖 − 𝑞𝑞�𝑖𝑖 𝑇𝑇𝑖𝑖 𝜎𝜎𝑝𝑝�𝑖𝑖−𝑞𝑞�𝑖𝑖 (1 ) (1 ) (1 ) (1 ) where 2 = + and can be estimated by + . 2 2 2 2 𝑝𝑝𝑖𝑖 −1𝑝𝑝𝑖𝑖 𝑞𝑞𝑖𝑖 −2𝑞𝑞𝑖𝑖 𝑝𝑝�𝑖𝑖 −1𝑝𝑝�𝑖𝑖 𝑞𝑞�𝑖𝑖 −2𝑞𝑞�𝑖𝑖 𝜎𝜎𝑝𝑝�𝑖𝑖−𝑞𝑞�𝑖𝑖 𝑁𝑁 𝑁𝑁 𝑁𝑁 𝑁𝑁

Under the null hypothesis, each has an approximate standard normal distribution.

′ 𝑖𝑖 = ( 1, 2, … , ) jointly follow a multivariate𝑇𝑇 normal distribution with mean and

𝑚𝑚 covariance𝑻𝑻 𝑇𝑇 𝑇𝑇 matrix𝑇𝑇 , where = ( 2 ) is a m by m matrix with elements: 𝟎𝟎 ∗ 𝚺𝚺 𝚺𝚺 � 𝜎𝜎𝑖𝑖𝑖𝑖 � ( + ) 2 2 2 = 𝑝𝑝𝑖𝑖𝑝𝑝1𝑗𝑗 𝑞𝑞𝑖𝑖𝑞𝑞2𝑗𝑗 2 ∗ − 𝑁𝑁 𝑁𝑁 𝑖𝑖𝑖𝑖 𝜎𝜎 � 𝑖𝑖 𝑖𝑖 𝑖𝑖𝑖𝑖 𝑖𝑖 ≠ 𝑗𝑗 𝜎𝜎𝑝𝑝� −𝑞𝑞�

= 1 =

𝑖𝑖𝑖𝑖 𝑖𝑖 𝑗𝑗 can be estimated by substituting and for and , respectively.

𝚺𝚺 𝑝𝑝̂𝑖𝑖 𝑞𝑞�𝑖𝑖 𝑝𝑝𝑖𝑖 𝑞𝑞𝑖𝑖 The truncated product method (TPM) has been applied to combine p-values from individual haplotype tests. The method was originally developed for combining independent p-values. However, as described in the previous section, the test statistics

𝑻𝑻 39

are correlated, with a covariance matrix . To deal with non-independent tests, the

correlated p-values need to be transformed𝚺𝚺 to uncorrelated p-values as discussed by

Zaykin et al. [Zaykin, et al. 2002].

First, we need to recall the fact that the correlation is approximately invariant

under monotone transformations:

[ ′( )]2 , ( ), = , [ ′𝑔𝑔( )𝜇𝜇]4 𝐶𝐶𝐶𝐶𝐶𝐶( �𝑋𝑋)𝑖𝑖 𝑋𝑋𝑗𝑗 � 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶�𝑔𝑔 𝑋𝑋𝑖𝑖 𝑔𝑔�𝑋𝑋𝑗𝑗 �� ≈ 𝐶𝐶𝐶𝐶𝐶𝐶𝑟𝑟�𝑋𝑋𝑖𝑖 𝑋𝑋𝑗𝑗 � � 𝑔𝑔 𝜇𝜇 𝑉𝑉𝑉𝑉𝑉𝑉 𝑋𝑋𝑖𝑖 𝑉𝑉𝑉𝑉𝑉𝑉�𝑋𝑋𝑗𝑗 � With this invariant property, the covariance matrix of the vector of p-values is approximately the same as , because = 1 Φ( ). Since is positive definite, we𝐑𝐑 can find the Cholesky factor matrix𝚺𝚺 C of 𝐑𝐑 ( = − ).𝑻𝑻 Then the 𝚺𝚺correlated p-value vector 𝐓𝐓 can be transformed to an uncorrelated𝚺𝚺 vector𝚺𝚺 𝐂𝐂of𝐂𝐂 p-value, = ( 1, 2, … , ), as follows𝐑𝐑 ∗ ∗ ∗ ∗ 𝐑𝐑 𝑝𝑝 𝑝𝑝 𝑝𝑝𝑚𝑚 = 1 Φ{C 1Φ 1(1 )} ∗ − − 𝐑𝐑 − − 𝐑𝐑 We take the product of all the transformed p-values smaller than a fixed value τ:

= ( ) ( ) =1 ∗ , where is the indicator function. 𝑚𝑚 ∗ 𝐼𝐼�𝑝𝑝𝑖𝑖 ≤𝜏𝜏 � 𝑊𝑊 ∏𝑖𝑖 𝑝𝑝𝑖𝑖 𝐼𝐼 ∙ Under the null hypothesis, the distribution of W for w<1 can be evaluated by conditioning on k, the number of the p-values that are less than :

𝜏𝜏

40

Pr( ) = 𝑚𝑚 Pr( | ) Pr( ) =1 𝑊𝑊 ≤ 𝑤𝑤 � 𝑊𝑊 ≤ 𝑤𝑤 𝑘𝑘 𝑘𝑘 𝑘𝑘 1 ( ) = 𝑚𝑚 (1 ) 𝑘𝑘− ( ) ! 𝑠𝑠 =1 𝑚𝑚 𝑚𝑚−𝑘𝑘 =0 𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘 − 𝑙𝑙𝑙𝑙𝑙𝑙 𝑘𝑘 � � � − 𝜏𝜏 �𝑤𝑤 � 𝐼𝐼 𝑤𝑤 ≤ 𝜏𝜏 𝑘𝑘 𝑘𝑘 𝑠𝑠 𝑠𝑠 + ( > ) 𝑘𝑘 𝑘𝑘 𝜏𝜏 𝐼𝐼 𝑤𝑤 𝜏𝜏 �

C++ code for calculating the TPM p-value is provided to the public by Zaykin et al. and can be applied in our study.

Another approach to deal with non-independent tests is to estimate the empirical distribution of the HTPM statistic W by permutation tests. The disease status is permuted among the individuals while the genotypes are unchanged. The HTPM method is applied and repeated k times. For each permutation, the truncated product of p-values is calculated under the null hypothesis. From the approximated null permutation distribution of W, the empirical combined p-value can be calculated as the proportion of the k permutations yielding a statistic smaller than what was observed in the original sample.

41

2.3 The Combined Method

Similar to the two-stage methods, the combined method involves first co- classifying rare risk haplotypes and then detecting association by comparing the frequency of classified haplotypes between cases and controls. The difference is the combined method uses the same sample for both co-classification and association testing.

The rare risk haplotypes are co-classified by defining the rare risk haplotype set as

= { | < }, where µ is a predefined value. The

𝑖𝑖 𝑖𝑖 frequency𝑆𝑆 ℎ 𝐹𝐹 of𝐹𝐹𝐹𝐹ℎ co𝑒𝑒𝑒𝑒-classified𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 risk𝑝𝑝 −haplotype𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑜𝑜𝑜𝑜 setℎ is𝜇𝜇 then compared between cases and controls by Fisher’s exact test. Since the same𝑆𝑆 sample is used in co-classifying haplotypes and testing, the empirical p-value is estimated by permutation tests. The disease status is permuted among the individuals while the genotypes are unchanged. The combined method is applied and repeated k times. For each permutation, Fisher’s exact test p-value is calculated under the null hypothesis. The empirical combined p-value can then be calculated as the proportion of the k permutations yielding a statistic smaller than what was observed in the original sample.

42

Chapter 3 Power Analysis of Haplotype-Based Methods to Detecting

Rare Variants and an Application Example using the WTCCC Data

3. 1 Simulation Studies

The haplotype frequencies in the ACE gene from a previous hypertension study in

African samples [Zhu, et al. 2001] were obtained. There were 13 polymorphisms genotyped in this gene, resulting in a total of 149 different haplotypes with frequencies≥0.01%. A total of 8 rare haplotypes, with frequencies in the range 1.0%-1.5% and with a cumulative risk haplotype frequency of 10%, were assigned to be risk haplotypes with the assumption that their effect on the phenotype is the same, i.e. the penetrance is only dependent on how many risk haplotypes an individual carried. An individual’s genotype was simulated by randomly drawing two haplotypes according to the haplotype frequencies and disease status was simulated based on the three modes of inheritance - dominant, recessive, multiplicative. There were 1900 cases and 3000 controls simulated and the total sample size was approximately equivalent to the WTCCC study.

To assess the type I error, a null model was simulated in which no risk haplotype

was assigned. To assess the power, we assigned 8 rare haplotypes as risk haplotypes, as

described above. The type I error and power were calculated as the proportion of 1000

simulations that resulted in rejection (Figure 3-1).

43

Figure 3-1 Flow chart of simulation studies: This flow chart outlines the simulation

algorithm. n represents the nth replication. Step 1: An individual’s genotype is simulated

by randomly drawing two haplotypes according to the haplotype frequencies. Step 2:

Disease status is simulated based on the penetrance functions, depending on how many

risk haplotypes an individual is carrying. A total of 30,000 subjects are generated and

1900 cases and 3000 controls are sampled. Step 3: Apply the test procedures. Calculate the corresponding p-values for the test statistics. The null hypothesis that none of the haplotypes has significant frequency difference between cases and controls. Step 4: Steps

1 to 3 are repeated 1000 times independently.

44

Since our goal is to detect rare genetic variants, we wish haplotypes consisting of

common SNPs to capture the rare variants. We expect only the rare haplotypes can better

capture the rare variants. We thus apply Fisher’s exact test to compare each individual

haplotype frequency difference between cases and controls. The individual haplotype

tests are correlated, but the covariance matrix structure of the p-values resulting from the individual tests is unknown. One option is to estimate the covariance matrix from the data and then the truncated product method can be applied as described in chapter 1. However, the results obtained with the estimated correlation structure are much dependent on the asymptotic distribution, which can be improper for testing rare haplotypes. When the covariance matrix of p-values is unknown, it is difficult to derive the distribution of the truncated product W. In this study, the empirical distribution of W is estimated by permutation tests.

Under the null hypothesis, there is no association between the disease and the rare variants, and the observed set of genotypes is just one of ( 1 + 2)! 1! 2! possible

ways of ordering the genotypes among the disease status, where𝑁𝑁 𝑁𝑁 1 and⁄𝑁𝑁 𝑁𝑁2 are the numbers of cases and controls, respectively. The permutation tests𝑁𝑁 are thus𝑁𝑁 implemented

in the following way:

We keep the individuals’ genotypes and randomly permute the disease status among all individuals. . For each permutation, we calculate W, the truncated product of p- values. We repeat this procedure many times to obtain the null distribution of W. In

practice, the number of all possible permutations may be impractically large, so an

approximate permutation test using a large number of randomly selected permutations are

45

used (for example, 1000 permutations for an overall significance level of 0.05) . From the

approximated null permutation distribution of W, the empirical combined p-value is

calculated as the proportion of the permutations yielding a statistic smaller than what was

observed in the original sample.

The above discussion is based on the situation where the haplotype phase is

known, which is highly unlikely in practice. When haplotype phase is unknown, it has to

be inferred. To assess the performance of haplotype-based methods under this situation, the haplotype phase is inferred for simulated data as if there is only genotype data available, using Beagle 3.1 [Browning and Browning 2007]. Beagle is an efficient

software for haplotype phase and missing data inference and employs a BEAGLE

haplotype-frequency model which can be applied to large-scale studies with millions of

markers and thousands of samples. The speed and accuracy of Beagle have been

compared against other haplotype-inference methods using simulated and real data sets.

The methods that have been compared against Beagle include fastPHASE [Scheet and

Stephens 2006], HAP [Halperin and Eskin 2004] and 2SNP [Brinza and Zelikovsky

2006]. The comparison shows that Beagle is the best algorithm in phase inference of

large data sets [Browning and Browning 2007].

46

3.2 Type I and Power Evaluation of the HTPM and the combined method

A null model where there was no risk haplotype was simulated to assess the type I error. We evaluated the type I error rates for the HTPM method and combined method and compared to the two-stage methods proposed by Zhu eta l. (2009) at significance levels of 0.05 and 0.01, respectively (Table 3-1). As discussed in chapter 1, there are two designs for two-stage methods, depending on whether the co-classification of rare haplotypes is determined in unrelated cases or affected sibpairs. The type I error is well controlled for HTPM. When the haplotype phase is known, the 95% confidence interval of type I error is (0.0347, 0.0608) at a significance level of 0.05 and (0.0047, 0.017) at a significance level of 0.01. The observed type I error rate for both HTPM method and combined method falls within the 95% confidence region. When the haplotype phase is unknown, the type I error of HTPM still falls in the 95% confidence interval, and so did that of the combined method. Two-stage methods show reasonable type I errors as well, because the significance level of 0.05 and 0.01 are within the 95% confidence interval of

(0.034, 0.072) and (0.0091, 0.0246), respectively.

Table 3-1 Type I error rate for simulation data analyzed as haplotype phase known and phase unknown

Two-stage Method Significance Combined Affected Unrelated HTPM Level Method Sibpairs Cases Phase Known 0.05 0.048 0.047 0.046 0.053 0.01 0.009 0.011 0.009 0.012 Phase Unknown 0.05 0.056 0.046 0.048 0.046 0.01 0.015 0.012 0.009 0.009

47

The power of the HTPM and combined method is compared with the power of

two-stage method, under different modes of inheritance and genotypic relative risk

(Figure 3-2). Single SNP analysis is also performed, by comparing the allele frequency

between cases and controls. The minimum p-value for testing the set of markers was corrected by the estimated effective number of tests or the number of independent tests.

For the two-stage methods, two designs were used, affected sibpair and unrelated cases, in the co-classification stage. According the power analysis comparing the different sample sizes used in the first stage (co-classification stage) and second stage (testing stage), designs with 800 cases or 400 affected sibpairs in first stage have the best power

[Zhu, et al.]. Therefore, 800 cases and 400 affected sibpairs are used in the first stage of the two –stage method in this simulation study.

Generally, single SNP analysis has virtually no power, no matter which mode of inheritance is considered. Under dominant models, there is a large increase in the power for all the methods except single SNP analysis, if genotypic relative risk rises from 1.2 to

2. The power approaches 1 when the genotypic relative risk rises above 2. For the HTPM method and the combined method, the power is greater than the two-stage methods. The

multiplicative model shows a similar pattern, only with a slower increase in power when

genotypic relative risk rises. Generally, the recessive model doesn’t show much power

except when genotypic relative risk is as large as 3.

Figure 3-2 shows the results when haplotype phase is known. However, it is

difficult to acquire the phase information most of the time. Therefore, the situation of

unknown haplotype phase is considered and phase is inferred by Beagle. The power of

the HTPM and the combined method is then compared against the two-stage method

48

(Figure 3-3). The power is slightly compromised when the haplotype phase is inferred, compared to the situation where we know the haplotype phase. However, the HTPM and the combined method still show reasonable power when genotypic relative risk is above

1.5 for the dominant model and above 2 for the multiplicative model. Overall, the power of the HTPM and the combined method is greater than that of the two-stage methods.

Under the multiplicative model, the two-stage method has significantly lower power than the HTPM and the combine method, when the co-classification of rare haplotypes is performed with unrelated cases. Co-classifying rare haplotypes in affected sibpairs still have reasonable power, under both dominant and multiplicative models.

When haplotype phase is unknown, the power of the combined method is slightly greater than the power of HTPM. When the haplotype phase in unknown and inferred, the combined method shows significantly better power than HTPM, especially under the multiplicative model.

49 a.

b.

c.

Figure 3-2 Power comparison of single-SNP analysis, two-stage method, HTPM and combined method, when haplotype phase is known. Power is plotted against genotypic relative risk at 1, 1.2, 1.5, 2, 2.5, and 3. Penetrance is simulated as 10%.

50 a.

b.

c.

Figure 3-3 Power comparison of single-SNP analysis, two-stage method, HTPM and combined method, when haplotype phase is unknown. Power is plotted against genotypic relative risk at 1, 1.2, 1.5, 2, 2.5, and 3. Penetrance is simulated as 10%.

51

3.3 Comparison of Haplotype-Based Methods with CMC method

The CMC method was originally designed for application to analysis of sequence

data. To perform a fair comparison of haplotype-based methods and the CMC method in

the term of efficacy and power, the HapMap ENCODE resequencing data [2004] is used

in simulation studies.

The HapMap ENCODE resequencing project was dedicated to provide dense

genotype which will result in the knowledge of a comprehensive catalogue of human

genomic components. Ten genomic regions of 500-kilobases were resequenced in 48 unrelated individuals (16 Yoruba, 8 Japanese, 8 Han Chinese, and 16 CEPH). All newly identified SNPs and SNPs that previously existed in dbSNP were genotyped in the 269

HapMap DNA samples (90 Yoruba, 44 Japanese, 45 Han Chinese, and 90 CEPH). Table

3-2 summarizes the SNP and haplotype information of the ENCODE data.

Region ENm010 has the shortest haplotype length and thus is used in the simulation study to compare haplotype-based methods and the CMC method. Haplotypes of the individuals were inferred by Beagle 3.1. A total of 55 rare haplotypes, with frequency < 1% and with a cumulative risk haplotype frequency of 10%, are chosen from the 529 haplotypes to be risk haplotypes, with the assumption that their effects on the phenotype is the same. Similar simulation strategies as described in section 3.1 were applied. There were 1900 cases and 3000 controls simulated and 1000 simulations were used to compare type I error and power.

52

Table 3-2 ENCODE region SNP and haplotype information. Abbreviations: Chr.-Chromosome; hap.- haplotype; ind.-individual.

CEU, CEPH HapMap; CHB, Chinese HapMap; JPT, Japanese HapMap; YRI, Yoruba HapMap.

No. SNPs homozygous No. No. Chr. Genomic interval No. Genotype SNPs Length in entire other No. No. unique Region name band (NCBI B36) CEU JPT+CHB YRI of hap. sample SNPs ind. hap. hap. ENr112 2p16.3 Chr2:51512208-52012208 2,601 2,573 2,608 1435 133 1302 269 538 509 ENr131 2q37.1 Chr2:234156563-234656627 2,214 2,107 2,129 1370 158 1212 269 538 536 ENr113 4q26 Chr4:118466103-118966103 2,538 2,401 2,405 2181 493 1688 269 538 513 ENm010 7p15.2 Chr7:26924045-27424045 1,830 1,787 1,742 1121 313 808 269 538 529 ENm013(500Kb) 7q21.13 Chr7:89621624-90121624 1,770 1,678 1,680 1785 494 1291 269 538 453 ENm014(500Kb) 7q31.33 chr7:126368183-126865324 3,343 3,239 3,232 1968 479 1489 269 538 513 ENr321 8q24.11 Chr8:118882220-119382220 2,128 2,100 2,092 1690 389 1301 269 538 512 ENr232 9q34.11 Chr9:130725122-131225122 1,909 1,828 1,808 1236 285 951 269 538 537 ENr123 12q12 Chr12:38626477-39126476 2,189 2,181 2,035 1610 200 1410 269 538 501 ENr213 18q12.1 Chr18:23719231-24219231 1,990 1,969 1,966 1505 287 1218 269 538 519 Total 22,512 21,863 21,697

53

For the CMC method, rare variants with minor allele frequencies (MAF) 1% in control samples were collapsed. An indicator variable is defined for the th case≤ as

1 𝑋𝑋 𝑗𝑗 = , is defined in a similar way for controls. Common 0 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑋𝑋𝑗𝑗 � 𝑌𝑌𝑗𝑗 variants with minor allele𝑜𝑜𝑜𝑜ℎ frequency𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 (MAF) > 1% were not collapsed. Assume in one

gene or genomic region, M variants are collapsed into K groups, a multivariate test like

Hotelling’s T2 test is then applied for the analysis of the K groups of genotype data. The

type I error and power were calculated as the proportion of 1000 simulations that resulted

in rejection.

The CMC method is based on multivariate tests of collapsed groups. However,

only rare variants are collapsed, while common variants are still included as one variant

per group in the multivariate analysis. In this particular simulation study, the total number

of variants is 808 within a 500 kb region, with the number of rare variants varying from

180 to 220 for different repeat of simulation. Therefore, the number of variables involved

in multivariate analysis can be as large as 600, with many of them highly correlated. A

large degree of freedom for the test statistic of Hotelling’s T2 test can thus be expected. In

an initial analysis including all the common variants, the type I error of the CMC

methods shows an unreasonable value of 0. The explanation for the abnormal value of

type I error exhibited by CMC method lies in the large number of variables included in

multivariate analysis, which may affect the validity of the null distribution assumption of

test statistics. A similar CMC test has been applied with only 1 common variant and 1

group of collapsed rare variants included. The results showed a reasonable type I error of

0.048 at a significance level of 0.05. Above evidence suggested that the CMC method

54

fails to function when a large number of non causal high-frequency variants are included in the analysis.

To make a fair comparison with the haplotype-based methods, a CMC test with

30 randomly chosen common variants included was performed. Table 3-3 compares the type I error of haplotype-based methods against the CMC method. The two haplotype methods – HTPM and the combined methods – have the same well controlled type I error.

The type I error of the CMC methods now is reasonable for both significance levels of

0.05 and 0.01.

Table 3-3 Type I error rate comparison of haplotype-based methods and CMC method using simulations based on ENCODE re-sequencing data.

Significance Combined CMC HTPM Level Method 0.05 0.043 0.057 0.057 0.01 0.012 0.014 0.016

The power of haplotype-based methods has also been compared to the CMC method (figure 3-4). Under each genetic model, HTPM and the combined method have very similar power. Both HTPM and combined methods showed reasonable power under additive model. Lower power is observed in the multiplicative model compared to the additive model, and the recessive model has virtually no power, as predicted.

Under the additive model, the power of haplotype-based methods is lower for the

ENCODE data based simulation study when genotypic relative risk equals 1.5 compared

55

to previous simulation (shown in figure 3-3), but reaches over 90% when the genotypic relative risk rises to 2. Encode-based simulation considered much larger numbers of

SNPs (808 SNPs) than previous simulation (13 SNPs). The accuracy of haplotype inference may decrease when the number of SNPs increases greatly, which may be a reason for the observed reduced power.

The power of the CMC method is consistently lower than the power of haplotype-

based methods. The difference between CMC method and haplotype-based methods increases as the genotypic relative risk increases, under both the additive model and the multiplicative model.

56

Figure 3-4 Power comparisons of the CMC, HTPM and combined method, using simulations based on ENCODE re-sequencing data. 57

3.4 Application of haplotype-based method to the WTCCC data

A Genome wide association study was performed by the WTCCC to explore

association of common variants with seven major diseases [Consortium 2007]. The

datasets includeD seven major diseases: Bipolar disorder, Coronary artery disease,

Crohn’s disease, Hypertension, Rheumatoid arthritis, Type 1 diabetes and Type 2 diabetes. The samples were recruited from the British population and include about 2000

cases for each disease and 3000 common controls. The genotypes were obtained by

genotyping with the Affymetrix GeneChip 500K Mapping Array set. While many genetic

variants have been detected and later confirmed to be associated with some of the seven

diseases, only 1 variant was detected for coronary artery disease (CAD) and nothing

reached genome wide significance for hypertension (HT). The missing heritability for

these two diseases may be contributed by multiple rare variants. To test this hypothesis,

the two-stage method was applied to the WTCCC CAD and HT data [de Bakker, et al.

2006].

To search for the rare genetic variants across the genome, a gene-based strategy is

applied here. Each SNP genotyped in a GWA study was mapped to a gene. The SNPs

that fall between genes were grouped based on genomic regions. Haplotype phase was

inferred using Beagle 3.1. When one gene contains a large number of SNPs, there will be

computational difficulty. In this case, the gene was broken into blocks of 50 SNPs instead. The total number of tests for association is the total number of genes/blocks and the Bonferroni-corrected genome-wide significance level was calculated accordingly and applied.

58

Using two-stage methods, Zhu et al. observed 3 genes associated with CAD

(EIF4H, HFE2 and CDKN2B) and 1 gene (ZFAT1) associated with HT at a genome-

wide significance level (p-value≤10-6). In addition, PSRC1 was identified to be

associated with CAD with a moderate signal (p-value≤10-4). Those results provided the

rationale that multiple rare variants may contribute to the variation of hypertension and

CAD. Therefore, haplotype-based methods are now applied to the WTCCC CAD and HT

data to confirm the results of the two-stage methods.

The Wellcome Trust Case Control Consortium (WTCCC) coronary artery disease and hypertension data were downloaded from the WTCCC website. The individuals dropped in the WTCCC study were also excluded in our analysis, resulting in 1952 hypertension (HT) cases, 1926 coronary arterial disease (CAD) cases and 2838 controls,

respectively. The same criteria as for the WTCCC study were applied for SNP exclusion,

except that all the SNPs with minor allele frequencies<1% were included in the study.

The HTPM and combined methods were applied to the WTCCC HT and CAD datasets

for the subset of genes identified by using the two-stage method. A million permutations

were used to compute the empirical p-values.

Table 3-4 summarizes the HTPM and combined method test results of genes

identified previously using a two-stage method in the WTCCC HT and CAD data. For

CAD data, the two-stage method identified HFE2 on chromosome 1, EIF4H on

chromosome 7 and CDKN2B on chromosome 9, with p-value ≤ 10-6. The results of both

the HTPM and combined methods have replicated the findings of those three genes with

p-value smaller than 10-6, except that HTPM has a p-value of 3×10-5 for the CDKN2B gene. The two-stage method also identified PSRC1on chromosome 1 of a moderate effect

59

with p-value 1.60×10-5, which was also replicated by HTPM and combined methods with p-values of 1.30×10-5 and 1.40×10-5, respectively.

For the HT data, the two-stage method has identified ZFAT1 on chromosome 8

with p-value ≤ 10-6. The results of both the HTPM and combined methods replicated this

finding with p-values smaller than 10-6.

Table 3-4 List of genes showing association to hypertension (HT) or Coronary

artery disease (CAD) in WTCCC data

Combined Two-stage HTPM Method Disease chr Gene Range (MB) p- value p-value p-value CAD 1 HFE2 144.11-144.15 <1E-06 <1E-06 <1E-06 CAD 1 PSRC1 109.62-109.62 1.60E-05 1.30E-05 1.40E-05 CAD 7 EIF4H 73.20-73.25 <1E-06 <1E-06 <1E-06 CAD 9 CDKN2B 21.99-22.12 1.00E-06 3.00E-05 <1E-06 HT 8 ZFAT1 135.57-135.67 <1E-06 <1E-06 <1E-06

60

3.5 Discussion and Future Direction

Two haplotype-based methods are developed to test association of rare variants

with common diseases. Both simulation studies and theWTCCC data application

demonstrated that the haplotype-based methods have reasonable power to detect rare

variants, with well-controlled type I error. The single SNP analysis generally shows no

power in detecting rare variants.

The methods are developed based on haplotypes rather than genotypes because

haplotypes can capture un-genotyped rare alleles. In an ideal situation, the haplotype

phase is known. However, in practice haplotype phase has to be inferred most of the time.

As shown in the results, power decreases when phase is inferred by Beagle, compared to the situation where phase is known. However the loss in the power is not substantial.

The power of the HTPM and the combined methods is superior to that of the two-

stage method, no matter whether the haplotype phase is known or unknown. The two-

stage methods allocate the sample into two independent parts, for a co-classification stage and an association testing stage, respectively. The power of detecting association may be decreased due to smaller sample size used in the association test. With a fixed total sample size, the optimal sample size for each stage needs to be determined. Therefore, the HTPM and the combined methods have the unified advantage of increased power by using the entire sample in the association test.

The combined method has greater power than the HTPM when the haplotype phase is unknown and inferred using Beagle. For the same dataset, the HTPM and the combined methods identify the same set of risk haplotypes. The difference between the

61

two methods is that HTPM uses the product of p-values while the combined method

combines the frequencies of haplotypes and then conducts the test by comparing the

combined frequencies between cases and controls. As shown in the simulation study

based on ENCODE data with many more SNPs, the difference in the power between the

HTPM and the combined methods is negligible.

Here the haplotype-based methods are developed for current GWA study design.

However, the methods can be applied to sequence data as well. Long-range haplotype

information provided by next-generation sequencing data will offer a significant

advantage over SNP data in detecting rare variants. In the simulation study based on

ENCODE re-sequencing data, the HTPM and the combined methods both show greater

power than the CMC method. The simulation study is based on a region of 808 variants

and the number of collapsed rare variants with an allele frequency <1% ranges is around

200. The rest large number of variants is not collapsed and thus contributes to a large

degree of freedom. The power of CMC method is compromised when many variants are

not collapsed and included individually in the multi-marker test. However, the power of the HTPM and combined methods is still reasonable with long-range haplotypes. Another advantage of haplotype-based method over CMC method is that CMC method can’t be applied to current GWAS design, where rare variants are not genotyped.

A drawback of the HTPM and combined methods is that they are both

computationally intensive, since permutation tests are required to determine empirical p- values. Limited by computation speed, currently it is not practical to apply these haplotype-based methods on a genome-wide scale. However, they can still be used in the situation where candidate genes are identified by the two-stage method. A possible

62

solution to the computation time problem is to allow the number of permutations to

change dynamically when applied to GWA study data, in a way that the number of

permutation tests varies depending on how many rejections have been observed.

Currently the haplotype-based methods are developed based on a case-control design. However, the methods can be easily adapted to be applied to continuous trait. A thought is to apply ANOVA test to each haplotype and then combine individual p-values together in a similar way as described in previous sections. The individual haplotypes can be grouped into 3 groups (risk, protective, non-risk/non-protective) at the minimum within-group difference and the maximum between-group difference of the trait.

63

Chapter 4 Genome Wide Association Study of Blood Pressure: The

Candidate Gene Association Resource (CARe) Study

4.1 Introduction

High blood pressure, or hypertension, affects more than one-third of the

population worldwide and the prevalence is still increasing. Hypertension has been a

major cause of strokes, heart attacks and heart failure. The prevalence of hypertension in

African Americans is among the highest in the world. In the United States, hypertension

occurs earlier, more often, and with greater severity among people of African compared

to European descent. By the end of the 1990’s, the prevalence of hypertension in African

Americans had reached 41%, compared to 28% in European Americans [Hertz, et al.

2005] . Among African Americans, hypertension develops earlier in life and average blood pressures are significantly higher. The risk of suffering end-organ effects, including end-stage renal disease, coronary heart disease and stroke, is also greater

[Collins and Winkleby 2002]. Compared with whites, African Americans develop

hypertension earlier in life and their average blood pressures are significantly higher. In

2004, the death rate from hypertension was three times greater in African Americans

compared to European Americans [Cohen, et al. 2006; Rosamond, et al. 2007]. In

contrast, geographic disparities in hypertension awareness, treatment, and control were

minimal [Howard, et al. 2006].

A portion of this hypertension burden among African Americans may be due to

genetic susceptibility. Blood pressure is a moderately heritable trait and results from the

combined effect of a complex set of genetic and environmental influences, with genes

64

cumulatively accounting for 30-40% of the population variance [Ward 1995]. Genome- wide linkage analysis has been widely applied in efforts to identify genomic regions harboring genes affecting the risk of hypertension. A review of 20 genome scans suggested that a large number genes, each exerting a small effect, is the most likely molecular architecture underlying hypertension [Samani 2003]. The observed effects are highly inconsistent, however, and it is well recognized that linkage analysis has limited power when applied to complex traits such as hypertension and the locus heterogeneity may further contribute to this observed inconsistency [Risch 2000]. Admixture mapping analysis, which used the information generated by recent admixture of genetically distinct populations, has also been applied to hypertension [Zhu and Cooper 2007; Zhu, et al.

2005b] and multiple genomic regions have been identified to harbor genetic variants contributing to hypertension. However, the actual genetic variants remain to be discovered. Two recent GWA studies of blood pressure involving more than 130,000 participants of primarily European descent have identified variants in 16 loci that are associated with systolic blood pressure (SBP), diastolic blood pressure (DBP), or hypertension [Levy, et al. 2009; Newton-Cheh, et al. 2009]. These variants are common in European populations but have small effects on blood pressure. These variants may not be causative variants themselves and therefore may not be associated with blood pressure phenotypes in populations of African descent, however. Prior reports note considerable differences in genetic association patterns for blood pressure and other traits across populations. These association differences may be due to differences in linkage disequilibrium patterns, causal pathways, or environmental exposures. Therefore, the

65

relationship of genetic variants with blood pressure must be examined in different

populations.

Adeyemo et al performed the first GWA study for blood pressure phenotypes in

African Americans, a case-control study of hypertension among 1017 African Americans from Washington, D.C [Adeyemo, et al. 2009]. No SNP reached genome wide significance for association with hypertension or diastolic blood pressure in the case-

control analysis. Six SNPs, however, reached genome-wide significance in their association with systolic blood pressure compared with 508 normotensive controls. Three loci, SLC24A, PMS1 and IPO7, contained SNPs with suggestive evidence of replication in a European-derived study population. A SNP in SLC24A also had evidence of replication in a West African cohort. This study is limited by its small sample size and its findings need to be examined in other African American cohorts. Large carefully phenotyped cohorts are needed to identify common variants with small to modest effects.

Therefore, we performed the largest GWA study of blood pressure phenotypes to date.

Furthermore, to increase statistical power, replication cohorts should include populations with ethnicity similar to the discovery sample given concerns regarding population stratification and variation in linkage disequilibrium patterns. Therefore, we examined our discovery SNPs for replication in large cohorts of both European and African

Americans.

It has been reported that the SNPs on Affymetrix 6.0 chip have good coverage of genetic variation in European populations [Li, et al. 2008]. We hypothesize that a genome-wide association study using the Affymetrix 6.0 chip will identify SNPs that influence blood pressure in African Americans across multiple cohorts. We also

66 hypothesize that using the Ilumina iSelect IBC chip, we will identify SNPs significant for

SBP and DBP in both European and African Americans. Understanding the contribution of genetic factors to blood pressure will provide insight into the mechanisms underlying ethnic disparities in cardiovascular disease. Findings should assist in more target-focused treatments to prevent cardiac end-organ damage and its associated morbidity and mortality.

67

4.2 Methods

4.2.1 Study Sample

NHLBI’s Candidate-gene Association REsource (CARe) Study comprises nine

cohort studies: Atherosclerosis Risk In Communities (ARIC), Coronary Artery Risk

Development in Young Adults (CARDIA), Cleveland Family Study (CFS),

Cardiovascular Health Study (CHS), the Cooperative Study of Sickle Cell Disease

(CSSCD), Framingham Heart Study (FHS), Jackson Heart Study (JHS), Multi-Ethnic

Study of Atherosclerosis (MESA) and the Sleep Heart Health Study (SHHS). Each study

adopted collaboration guidelines and established a consensus on phenotype

harmonization, covariate selection and an analytical plan for within-study genome-wide association and prospective meta-analysis of results across studies. Each study received institutional review board approval of its consent procedures, examination and surveillance components, data security measures, and DNA collection and its use for genetic research. All participants in each study gave written informed consent for participation in the study and the conduct of genetic research. Among the nine cohorts,

African American samples from five cohorts (ARIC, CARDIA, CFS, JHS and MESA) had genome-wide genotyping using the Affymetrix 6.0 array and blood pressure data for association analysis. Seven cohorts (ARIC, CARDIA, CFS, CHS, FHS, JHS, MESA) had candidate gene genotyping using the Illumina iSelect IBC Chip. We excluded individuals younger than 18 years of age.

4.2.2 Genotyping and Quality Control

68

Genome-wide genotyping was performed by the Broad Institute using the

Affymetrix Genome-Wide Human SNP Array 6.0 and candidate gene genotyping was

performed using the Illumina iSelect IBC Chip.

Quality control of genotyping data was conducted at the Broad institute. Most

of the QC steps are performed using PLINK (http://pngu.mgh.harvard.edu/purcell/plink/)

[Purcell, et al. 2007]. Quality control efforts were conducted at two levels: exclusion of

individuals and exclusion of SNPs. Samples with genotyping success rate <95% were

removed. An inbreeding coefficient is calculated and used as a measure of heterozygozity.

Samples with too low or too high levels of heterozygozity (<-4 SD or > 4 SD) were removed because of possible DNA contamination or poor DNA quality. For population-

based cohorts, a pair-wise Identity-By-Descent score was calculated and for each pair of

identical samples, the sample with the lowest genotyping success rate was removed.

Because of possible DNA sample contamination, samples that share 5% or more of their

genome with many other samples were also removed. Multidimensional scaling (MDS)

was used to identify outliers and the outliers were removed to reduce population

substructure.

There were 1176 SNPs that map to several loci in the human genome and they

were also removed. Individual SNPs were excluded if they had a call rate of less than 90%

or they are monomorphic. SNPs for which genotype missingness can be predicted by

surrounding haplotypes were also removed. For family data, Mendelian inconsistence

was checked using PLINK and the corresponding SNPs were removed. No SNPs were

removed due to significant deviation from Hardy-Weinberg Equilibrium (HWE) because

69

the African-American population is an admixed population, which may result in

departure from HWE.

4.2.3 Genotype imputation

Briefly, genotype imputation was performed with the MACH 1.0. haplotype and

imputation program (http://www.sph.umich.edu/csg/abecasis/MaCH/) [Li, et al. 2009],

which uses a hidden Markov model to estimate an underlying set of unphased genotypes

for each individual in a cohort. The phased haplotypes released along with HapMap

phase 2 (build 36 release 22) were used as input for imputation by MACH. Since we

were dealing with an African-American population, which is an admixed population with

the proportion of European admixture estimated to be ~17-19% [Parra, et al. 1998; Zhu, et al. 2005b], an artificial reference panel consisting of equal proportions of the YRI and

CEU HapMap phased haplotypes (using only SNPs found in both YRI and CEU panels, i.e. ~2.2M SNPs) was constructed. MACH was run in two rounds, to reduce computational load introduced by the large haplotype panel. In the first round, error and recombination rates were estimated from the reference haplotypes and input cohort using

30 iterations of the program. The second round used these rates to give a greedy estimate of the final genotypes in one iteration. Kang et al suggested that the accuracy of using the mixed panel for African-Americans is comparable to the accuracy reported when imputing a population of Nigerians using YRI as reference panel [Hao, et al. 2009]. The

genotype imputation analysis was done by Broad Institute.

70

4.2.4 Phenotype modeling

The modeled phenotypes were systolic and diastolic blood pressure at the first

examination attended for all cohorts except CARDIA, CFS and FHS. SBP and DBP

measured at the last examination were modeled for CARDIA, CFS and FHS. For

individuals taking antihypertensive therapies, blood pressure was imputed by adding 10

mm Hg and 5 mm Hg for SBP and DBP [Tobin, et al. 2005], respectively. Continuous

DBP and SBP were adjusted for age, age2, body mass index (BMI) and sex in generalized

linear models. Residuals were calculated and submitted to the Broad and univariate

analysis of genotype-phenotype association was performed in the Broad.

4.2.5 Statistical analyses

Within each cohort, ten main eigenvectors from principal component (PC)

analysis were calculated and included in the model of testing genotype-phenotype association. The PCs were calculated based on 3000 selected Ancestry Informative

Markers (AIMs). For comparison, we also calculated the PCs using the method described in Zhu et al. [Zhu, et al. 2008], in which the eigenvectors were calculated based on only unrelated individuals. PCs were then calculated for all individuals, including family members. Additionally in this method, all SNPs were used to calculate PCs. The results of the two methods were consistent, except for a few individuals (Figure 4-1). We did not find the discrepancy affected the final association results. For all datasets except CFS and

FHS, which are family data, association was tested by linear regression of SNP with an additive model using PLINK; for CFS and FHS, association was tested by using the LME

71 routine written in R by Qiong Yang and colleagues at BU, taking into consideration of family structure.

Meta-analysis of results was carried out using the inverse-variance weighting method in METAL (http://www.sph.umich.edu/csg/abecasis/METAL/index.html). No filtering on minor allele frequency was used. Genomic control was carried out on cohort- specific test statistics and used to adjust what in each study.

For comparison, analysis of pooling raw data from the five cohorts genotyped with Affymetrix 6.0 was carried out with FamCC [Zhu, et al. 2008]. Cohort specific genotypes and standardized DBP or SBP residuals were pooled together. Principal components were calculated for all unrelated individuals and predicted for related individuals. Genotype-phenotype association was tested by a linear regression model adjusting for ten main eigenvectors. Previously published SNP associations SNPs with blood pressure [Adeyemo, et al. 2009; Levy, et al. 2009; Newton-Cheh, et al. 2009] were examined. If the published SNPs were not available in either genotyped SNPs or imputed

SNPs in the current study, we used as proxies SNPs in LD with the examined SNPs.

72

Figure 4-1 Comparison of the 1st and 2nd principle components calculated from

Broad and FamCC

73

4.3 Results

4.3.1 Study sample

The complete study sample is composed of individuals from seven cohorts (ARIC,

CARDIA, CFS, JHS, MESA, CHS and FHS), each includes either African-Americans or

European-Americans or both. The cohort specific sample characteristics are described in table 4-1. Only individuals with available genotype and blood pressure phenotype data are used in the analysis.

SNPs from the Affymetrix 6.0, 1 million-SNP chip were genotyped only in

African Americans for the genome-wide association analysis. African-American participants from five cohorts (ARIC, CARDIA, CHS, JHS and MESA) were used in the

GWA studies with a combined sample size of 7,473. Among the African American study sample for the GWA studies, participants from CARDIA were substantially younger compared to the other cohorts, had less individuals on antihypertensive medications, and had a lower average SBP and DBP. One exception for the later, MESA participants had a lower DBP than those in CARDIA. Also, in general, women had a higher average BMI than men.

Both African Americans and European Americans were included for genotyping by the custom designed IBC chip [Keating, et al. 2008]. All seven cohorts (the five listed for the African American GWA studies with the addition of the CHS and FHS cohorts) were included in the IBC chip candidate gene analysis. Participants from CHS, a cohort that includes both African Americans and European Americans, were substantially older than the participants in the other cohorts. A total of 8,591 African Americans and 20,082

European Americans were genotyped with the IBC chi p and used in the candidate gene

74 analysis. Characteristics of the study sample for European Americans are presented in

Table 4-1. European American participants from CARDIA on average were significantly younger and participants from the CHS on average were significantly older than those in the other cohorts in the candidate gene analysis.

75

Table 4-1 Study sample characteristics

2 Antihypertensive Age (years) BMI (kg/m ) DBP (mm Hg) SBP (mm Hg) Study N Male (%) medication (%) Mean SD Mean SD Mean SD Mean SD cohorts with Affymetrix 6.0 genotyping ARIC 2511 37.08 43.96 53.31 5.80 29.64 6.00 79.72 12.07 128.26 20.80 CARDIA 833 38.06 12.97 39.48 3.85 30.83 7.51 76.86 12.14 116.90 16.43 CFS 489 40.70 38.91 45.71 16.17 34.56 9.62 76.49 10.66 128.16 16.00 JHS 2017 38.67 46.26 49.91 11.92 32.27 7.81 79.96 10.62 124.88 18.03 MESA 1623 45.72 50.46 62.22 10.12 30.16 5.87 74.46 10.24 131.43 21.65

cohorts with IBC genotyping a. African Americans ARIC 2692 36.89 44.05 53.24 5.82 29.72 6.13 79.43 11.99 128.02 20.74 CARDIA 1134 40.48 13.76 39.55 3.83 30.69 7.51 77.01 12.42 117.05 16.26 CFS 530 42.08 37.74 45.20 16.10 34.23 9.55 76.43 10.76 127.68 15.81 CHS 735 37.55 51.84 72.95 5.74 28.45 5.56 75.06 11.29 141.84 22.70 JHS 1916 39.25 45.95 49.93 12.02 32.17 7.84 79.90 10.63 124.90 18.04 MESA 1584 46.09 50.88 62.22 10.08 30.20 5.87 74.58 10.25 131.74 21.61

a. European Americans ARIC 9123 46.36 25.15 54.23 5.70 26.93 4.82 71.52 9.97 118.26 16.95 CARDIA 1337 46.75 3.66 40.76 3.36 27.14 5.88 72.23 10.21 109.98 12.98 CFS 560 45.89 23.77 46.85 16.67 31.74 8.43 73.60 9.69 123.91 15.49 CHS 3892 43.94 39.70 72.73 5.57 26.37 4.47 69.99 11.24 135.35 21.42 FHS 2884 45.6 39.74 61.23 9.56 28.14 5.29 73.99 9.78 127.07 18.85 MESA 2286 47.77 33.30 62.73 10.23 27.76 5.06 70.13 10.03 123.56 20.69

Study characteristics are shown for cohort samples examined in meta-analysis. N, sample size - the number of individuals with genotype and phenotype data available.

76

4.3.2 Genome-wide association of African American cohorts for blood pressure

We examined individuals aged ≥ 18 years from 3 population-based studies (ARIC,

CARDIA, MESA) and 2 family-based studies (CFS, JHS). Individuals treated for antihypertensive medication were imputed to have 10 mm Hg higher for SBP and 5 mm

Hg higher for DBP than the observed values. In total, around 800K SNPs were tested for association. The quantile-quantile (QQ) and Manhattan plots for genome-wide SNPs are

presented in Figure 4-2. The associations of top SNPs were consistent in size and

direction across the five individual cohorts.

For meta-analysis of the Affymetrix 6.0 data, we observed 1 SNP for DBP and 1

SNP for SBP reaching P-value < 5×10-8 (Table 4-2). The strongest signal for DBP was rs10474346 (P=3.56×10-8) in the inter-gene region of GPR98 and ARRDC3 on chromosome 5q14. Of note, our top SNP, rs10474346 is in tight LD with missense SNP s4377733 (pairwise r2=0.932) in hypothetical gene LOC729040.

For SBP (Table 4-2) there were 11 SNPs with P < 1.0×10-6. The strongest signal

is rs2258119 in C21orf91 on chromosome 21 (P=4.69×10-8). This top SBP SNP is in

tight LD with nearby rs2824495 (r2=1.0), which is a missense SNP in C21orf91.

Suggestive evidence of association was detected for IPO13 (chromosome 1p, rs1990151,

P=7.39×10-7), FMNL2 (chromosome 2, rs592582, P=5.55×10-7) and GPD2 (chromosome

2q, rs592582, P=4.46×10-7). The associations of the top SNPs were consistent in size and

direction across the six individual cohorts. The regional plots of association evidence for

each of the genome-wide significant loci are presented in Figure 4-3.

77

Pooled genotype data analysis was conducted for the five cohorts with Affymetrix

6.0 genotyping data using FamCC, on genotyped SNPs only. In general, the results were

consistent with that from meta-analysis.

A. DBP

B. SBP

Figure 4-2 Manhattan plots and QQ plots of genome-wide SNPs for three phenotypes of blood pressure in African Americans from meta-analysis of

Affymetrix 6.0 arrays.

78

Table 4-2 Top associated SNPs for blood pressure in African Americans from meta-analysis of the Affymetrix 6.0 arrays (P <

1×10-6)

Effect CARe meta-analysis , DBP CARe meta-analysis , SBP BP Nearest Effect allele Other Heterogeneity Heterogeneity Trait SNP ID Chr Position Type Gene allele freq allele Beta s.e. P P Beta s.e. P P SBP rs1990151 1 44186879 Genotyped IPO13 A 0.053 G 1.207 0.412 3.39E-03 0.563 3.483 0.704 7.39E-07 0.406 SBP rs13413144 2 153183499 Imputed FMNL2 T 0.052 A 1.746 0.500 4.77E-04 0.666 4.282 0.855 5.55E-07 0.084 SBP rs592582 2 157481632 Imputed GPD2 G 0.407 T 0.749 0.195 1.20E-04 0.942 1.660 0.329 4.46E-07 0.913 DBP rs1858309 5 90592314 Imputed GPR98 C 0.292 T 1.090 0.204 8.76E-08 0.038 1.345 0.346 1.03E-04 0.254 DBP rs7709572 5 90592789 Genotyped GPR98 G 0.295 C 1.095 0.203 7.41E-08 0.036 1.342 0.346 1.05E-04 0.292 DBP rs7724489 5 90594122 Imputed GPR98 A 0.293 T 1.083 0.205 1.17E-07 0.038 1.346 0.347 1.07E-04 0.246 GPR98/ DBP rs10474346 5 90599895 Imputed ARRDC3 C 0.330 T 1.104 0.200 3.56E-08 0.106 1.403 0.340 3.73E-05 0.417 SBP rs243601 21 18081637 Imputed C21orf91 G 0.485 A 0.629 0.206 2.26E-03 0.278 1.795 0.349 2.61E-07 0.356 SBP rs243603 21 18082171 Imputed C21orf91 C 0.449 T -0.586 0.204 4.06E-03 0.291 -1.751 0.345 3.91E-07 0.521 SBP rs243605 21 18082991 Imputed C21orf91 C 0.458 G -0.595 0.204 3.62E-03 0.291 -1.756 0.346 3.83E-07 0.399 SBP rs243607 21 18083386 Imputed C21orf91 G 0.443 A -0.589 0.204 3.84E-03 0.235 -1.794 0.345 1.99E-07 0.461 SBP rs243609 21 18083560 Imputed C21orf91 T 0.398 C -0.561 0.218 1.01E-02 0.179 -1.865 0.369 4.41E-07 0.484 SBP rs2220511 21 18086782 Imputed C21orf91 C 0.498 T -0.516 0.205 1.17E-02 0.575 -1.746 0.346 4.68E-07 0.469 SBP rs2258119 21 18089350 Genotyped C21orf91 C 0.323 T 0.791 0.199 6.90E-05 0.751 1.841 0.337 4.69E-08 0.703

Beta: the effect size on blood pressure in mmHg, per effect allele based on the additive genetic model. Results of the two SNPs with genome-wide significance (P < 5×10-8) are shown in bold.

79 a.

b.

80

c.

Figure 4-3 Regional plots of two blood pressure loci in African Americans from meta-analysis of Affymetrix 6.0 arrays. For each locus, we show the region extending

to within 500kb of the associated SNP on either side. Statistical significance of SNPs

around each locus are plotted as –log10(P) against chromosomal position. The

recombination rate of the region is also shown in light blue curve. For each locus, the

most significant SNP is shown in blue. If the most significant SNP of a locus is imputed

SNP, then the most significant genotype SNP is shown in blue too. Among genotyped

SNPs, SNPs in red have r2 ≥0.8 with the most significant genotyped SNP; SNPs in

orange have r2 between 0.5 and 0.8; SNPs in yellow have r2 between 0.2 and 0.5; SNPs in

white have r2 <0.2. Imputed SNPs are shown in grey. Superimposed on the plot are gene

locations (green) and recombination rate (blue). Chromosome positions are based on

HapMap release 22 build 36.

81

4.3.3 Association of SNPs in IBC chips and blood pressure in African and European

Americans

There are ~50,000 SNPs distributed in ~2500 cardiovascular genes in IBC chips.

We estimated that the equivalent number of independent tests among the SNPs in IBC

chip is ~25,000 in African Americans [Li and Ji 2005]. Meta-analysis of cohorts with

IBC chip data didn’t identify any SNPs that attained a prespecified significance level (P <

2×10-6) in African Americans. There was evidence suggestive of association for SBP

with two genes NUCB2 (, rs214070, P=8.65×10-6; in LD with missense

SNP rs757081, pairwise r2=0.68) and SLC25A42 (chromosome 11, rs2012318,

P=6.42×10-6); in LD with missense SNP rs4808907, in SFRS14, pairwise r2=0.61)

(Table 4-3). The top genes and SNPs with P<10-4 are presented in table 4-4. The QQ and Manhattan plots for the IBC chip results for DBP and SBP are presented in figures 4-

4 and 4-5, for European Americans and African Americans, respectively.

We observed one genome-wide significant SNP in European Americans for the

IBC chip (Table 4-3), missense SNP rs3184504 in SH2B3, which showed genome-wide association for SBP and DBP (DBP: P=9.14×10-7; SBP: P=7.99×10-8). Genes with

suggestive association evidence include SCD (DBP, rs11190478, P=3.84E-06), C12orf30

(DBP, rs17696736, P=5.74E-06), CCL5 (DBP, rs2280788, P=8.11E-06) and MEIS1

(SBP, rs7563565, P=5.84E-06). Of note, all genome-wide significant SNPs identified in

our analysis were associated with DBP and SBP, with the same direction of effect (Table

4-3). Many of the top SNPs identified for one blood pressure phenotype are also

associated with the other blood pressure phenotype.

82

Table 4-3 Top associated SNPs for blood pressure in African Americans and European Americans from meta-analysis of IBC arrays (P < 1×10-5)

CARe meta-analysis , DBP CARe meta-analysis , SBP Effect

Nearest Effect allele Other Heterogeneity Heterogeneity BP Trait SNP ID Chr Position Type Gene allele freq allele Beta s.e. P P Beta s.e. P P African Americans SBP rs12408339 1 154622134 Imputed RHBG A 0.076 G -1.499 0.440 6.55E-04 0.266 -3.317 0.746 8.64E-06 0.534 SBP rs214070 11 17261893 Genotyped NUCB2 A 0.057 T 0.830 0.392 3.42E-02 0.158 2.973 0.668 8.65E-06 0.238 SBP rs6511018 19 19047705 Imputed SLC25A42 G 0.352 A 0.586 0.187 1.69E-03 0.082 1.429 0.315 5.83E-06 0.027 SBP rs12985799 19 19048575 Imputed SLC25A42 C 0.333 T 0.606 0.188 1.29E-03 0.146 1.480 0.318 3.24E-06 0.086 SBP rs2012318 19 19069240 Genotyped SLC25A42 C 0.350 T 0.579 0.186 1.90E-03 0.087 1.417 0.314 6.42E-06 0.032 SBP rs11666627 19 19072238 Imputed SLC25A42 C 0.350 T 0.592 0.187 1.55E-03 0.086 1.474 0.316 3.00E-06 0.034 SBP rs10417974 19 19083070 Imputed SLC25A42 C 0.352 T 0.591 0.187 1.57E-03 0.059 1.459 0.315 3.71E-06 0.034

European Americans SBP rs7563565 2 66570096 Genotyped MEIS1 T 0.030 C 0.438 0.303 1.49E-01 0.766 2.319 0.512 5.84E-06 0.363 DBP rs10177762 2 238299876 Imputed LRRFIP1 C 0.212 A 0.626 0.141 9.20E-06 0.632 0.426 0.239 7.39E-02 0.768 DBP rs869434 2 238303270 Imputed LRRFIP1 G 0.232 A 0.606 0.132 4.34E-06 0.674 0.398 0.223 7.52E-02 0.799 DBP rs12624154 2 238312207 Imputed LRRFIP1 C 0.207 G 0.632 0.133 1.86E-06 0.657 0.356 0.225 1.12E-01 0.734 DBP rs11689226 2 238386035 Imputed FLJ40411 C 0.179 T 0.963 0.210 4.63E-06 0.497 0.578 0.355 1.03E-01 0.577 DBP rs4663791 2 238386054 Imputed FLJ40411 G 0.170 A 0.951 0.209 5.23E-06 0.514 0.561 0.352 1.11E-01 0.576 DBP rs4663794 2 238396670 Imputed FLJ40411 T 0.164 C 0.961 0.214 6.94E-06 0.543 0.564 0.361 1.19E-01 0.598 DBP rs1434473 2 238399359 Imputed FLJ40411 A 0.164 C 0.970 0.216 6.94E-06 0.548 0.570 0.365 1.18E-01 0.601 DBP rs17669878 10 102089563 Imputed SCD C 0.389 G 0.508 0.109 3.39E-06 0.773 0.634 0.185 5.90E-04 0.308 DBP rs11190478 10 102092122 Genotyped SCD C 0.382 G 0.490 0.106 3.84E-06 0.738 0.607 0.178 6.29E-04 0.288 DBP rs735877 10 102094511 Genotyped SCD T 0.382 C 0.488 0.106 4.18E-06 0.783 0.610 0.178 6.05E-04 0.299 DBP rs1502593 10 102099192 Genotyped SCD A 0.397 G 0.477 0.106 6.33E-06 0.633 0.575 0.177 1.16E-03 0.320 SBP rs2239196 12 110357550 Imputed SH2B3 A 0.070 G -0.944 0.284 8.90E-04 0.200 -2.228 0.477 3.04E-06 0.404 DBP/SBP rs3184504 12 110368991 Genotyped SH2B3 T 0.497 C 0.503 0.103 9.14E-07 0.003 0.923 0.172 7.99E-08 0.480 DBP/SBP rs4766578 12 110388754 Imputed ATXN2 A 0.498 T -0.518 0.104 6.40E-07 0.004 -0.936 0.176 1.03E-07 0.451

83

DBP/SBP rs10774625 12 110394602 Imputed ATXN2 G 0.498 A -0.518 0.104 6.37E-07 0.004 -0.936 0.176 1.05E-07 0.449 SBP rs16941541 12 110438087 Imputed ATXN2 A 0.075 G -0.933 0.273 6.17E-04 0.215 -2.168 0.459 2.32E-06 0.368 DBP/SBP rs653178 12 110492139 Imputed ATXN2 C 0.498 T 0.521 0.105 6.39E-07 0.003 0.939 0.177 1.15E-07 0.469 DBP/SBP rs11065987 12 110556807 Imputed BRAP G 0.455 A 0.545 0.111 8.42E-07 0.007 0.922 0.187 8.66E-07 0.440 DBP rs17696736 12 110971201 Genotyped C12orf30 G 0.448 A 0.469 0.104 5.74E-06 0.016 0.747 0.174 1.68E-05 0.263 DBP rs17630235 12 111076069 Imputed TRAFD1 A 0.437 G 0.488 0.107 5.52E-06 0.012 0.788 0.182 1.43E-05 0.311 DBP rs11066188 12 111095097 Imputed TRAFD1 A 0.436 G 0.489 0.108 5.69E-06 0.012 0.790 0.182 1.46E-05 0.313 DBP rs11066301 12 111355755 Imputed PTPN11 G 0.446 A 0.521 0.113 4.12E-06 0.015 0.822 0.191 1.73E-05 0.330 DBP rs11066320 12 111390798 Imputed PTPN11 A 0.446 G 0.523 0.114 4.24E-06 0.015 0.825 0.192 1.78E-05 0.332 DBP rs2280788 17 31231518 Genotyped CCL5 C 0.025 G -1.468 0.329 8.11E-06 0.441 -1.583 0.548 3.87E-03 0.662

MAF: minor allele frequency. Beta: the effect size on blood pressure in mmHg per allele, based on the additive genetic model. Results of the SNP with genome-wide significance (P < 2×10-6) are shown in bold.

84

Table 4-4 Top associated SNPs for blood pressure in African Americans and European Americans from meta-analysis of IBC

arrays (P < 1×10-4)

a. African Americans

CARe meta-analysis , DBP CARe meta-analysis , SBP Effect Effect allele Other Heterogeneity Heterogeneity BP Trait SNP ID Chr Position allele freq allele Beta s.e. P P Beta s.e. P P DBP rs35397 5 33986873 T 0.209 G 0.895 0.227 8.13E-05 0.133 0.617 0.383 1.07E-01 0.003 DBP rs10068980 5 161220425 A 0.496 G 0.690 0.176 8.83E-05 0.138 0.463 0.297 1.19E-01 0.210 DBP rs2523586 6 31435414 G 0.104 T 1.253 0.287 1.25E-05 0.067 1.129 0.486 2.03E-02 0.009 DBP rs10215932 7 150913729 T 0.073 C 1.304 0.330 7.74E-05 0.295 1.582 0.554 4.32E-03 0.230 DBP rs4930130 11 2576206 G 0.413 A -0.741 0.178 3.16E-05 0.294 -0.468 0.300 1.19E-01 0.177 DBP rs1791926 11 72621769 C 0.435 T 0.725 0.180 5.53E-05 0.064 0.687 0.302 2.30E-02 0.112 DBP rs285246 19 16281758 T 0.226 C 0.825 0.209 7.83E-05 0.271 0.871 0.352 1.34E-02 0.470 SBP rs2270476 2 211234009 A 0.108 G 0.972 0.284 6.12E-04 0.879 1.986 0.479 3.39E-05 0.284 SBP rs2232340 3 53866007 T 0.040 C -1.026 0.432 1.75E-02 0.690 -2.897 0.719 5.61E-05 0.604 SBP rs2232351 3 53874239 G 0.040 A -1.073 0.431 1.28E-02 0.624 -2.904 0.717 5.13E-05 0.595 SBP rs214070 11 17261893 A 0.057 T 0.830 0.392 3.42E-02 0.158 2.973 0.668 8.65E-06 0.238 SBP rs757081 11 17308259 G 0.066 C 0.813 0.370 2.78E-02 0.159 2.703 0.630 1.76E-05 0.074 SBP rs1150226 11 113350751 A 0.318 G 0.615 0.191 1.30E-03 0.236 1.312 0.322 4.51E-05 0.184 SBP rs1150219 11 113365953 G 0.302 C 0.597 0.193 2.02E-03 0.647 1.264 0.325 9.95E-05 0.073 SBP rs2764626 13 32676469 C 0.240 T -0.751 0.203 2.15E-04 0.544 -1.405 0.343 4.24E-05 0.746 SBP rs2012318 19 19069240 C 0.350 T 0.579 0.186 1.90E-03 0.087 1.417 0.314 6.42E-06 0.032

SBP rs10423742 19 19089136 G 0.315 A 0.567 0.192 3.13E-03 0.062 1.374 0.323 2.05E-05 0.162

85

b. European Americans CARe meta-analysis , DBP CARe meta-analysis , SBP Effect Effect allele Other Heterogeneity Heterogeneity BP Trait SNP ID Chr Position allele freq allele Beta s.e. P P Beta s.e. P P DBP rs3769072 2 238312883 G 0.232 C 0.491 0.123 6.86E-05 0.814 0.275 0.207 1.85E-01 0.777 DBP rs3739038 2 238337442 G 0.233 C 0.493 0.123 6.39E-05 0.748 0.283 0.207 1.72E-01 0.824 DBP rs17561000 5 74045083 G 0.097 A -0.729 0.174 2.89E-05 0.506 -0.868 0.293 3.04E-03 0.243 DBP rs11190478 10 102092122 C 0.382 G 0.490 0.106 3.84E-06 0.738 0.607 0.178 6.29E-04 0.288 DBP rs735877 10 102094511 T 0.382 C 0.488 0.106 4.18E-06 0.783 0.610 0.178 6.05E-04 0.299 DBP rs1502593 10 102099192 A 0.397 G 0.477 0.106 6.33E-06 0.633 0.575 0.177 1.16E-03 0.320 DBP rs11190483 10 102103639 T 0.293 C 0.489 0.114 1.83E-05 0.202 0.417 0.191 2.91E-02 0.034 DBP rs3071 10 102104453 C 0.300 A 0.453 0.113 6.43E-05 0.344 0.381 0.190 4.50E-02 0.110 DBP rs1401982 12 88513730 G 0.402 A -0.410 0.105 9.49E-05 0.038 -0.727 0.176 3.59E-05 0.525 DBP rs11105354 12 88550654 G 0.169 A -0.551 0.137 5.77E-05 0.159 -0.984 0.230 1.93E-05 0.156 DBP/SBP rs3184504 12 110368991 T 0.497 C 0.503 0.103 9.14E-07 0.003 0.923 0.172 7.99E-08 0.480 DBP/SBP rs17696736 12 110971201 G 0.448 A 0.469 0.104 5.74E-06 0.016 0.747 0.174 1.68E-05 0.263 DBP rs2472304 15 72831291 G 0.357 A 0.426 0.108 7.87E-05 0.535 0.481 0.181 7.94E-03 0.034 DBP rs8061228 16 52439872 C 0.184 T 0.526 0.134 8.57E-05 0.998 0.868 0.225 1.16E-04 0.786 DBP rs2280788 17 31231518 C 0.025 G -1.468 0.329 8.11E-06 0.441 -1.583 0.548 3.87E-03 0.662 SBP rs7563565 2 66570096 T 0.030 C 0.438 0.303 1.49E-01 0.766 2.319 0.512 5.84E-06 0.363 SBP rs661348 11 1861868 C 0.428 T 0.306 0.104 3.29E-03 0.775 0.682 0.175 9.94E-05 0.346 SBP rs12861167 13 35003745 G 0.129 A 0.346 0.154 2.45E-02 0.543 1.012 0.258 9.05E-05 0.337 SBP rs1371386 15 87824056 A 0.380 G -0.199 0.107 6.24E-02 0.919 -0.699 0.179 9.40E-05 0.697 SBP rs2071410 15 89221944 G 0.325 C 0.350 0.111 1.57E-03 0.413 0.754 0.185 4.77E-05 0.122 SBP rs4796105 17 31110080 C 0.175 A -0.405 0.137 3.14E-03 0.239 -0.917 0.230 6.79E-05 0.302 SBP rs1994182 17 31111817 G 0.179 C -0.383 0.136 4.75E-03 0.324 -0.894 0.228 8.64E-05 0.261 SBP rs4795090 17 31147710 G 0.171 A -0.403 0.138 3.55E-03 0.296 -0.903 0.231 9.54E-05 0.205

SBP rs870355 17 73677791 G 0.382 A 0.266 0.107 1.26E-02 0.211 0.714 0.178 6.32E-05 0.689

86

A. DBP

B. SBP

Figure 4-4 Manhattan plots and QQ plots of genome-wide SNPs for three phenotypes of blood pressure in African Americans from meta-analysis of IBC arrays.

87

A. DBP

B. SBP

Figure 4-5 Manhattan plots and QQ plots of genome-wide SNPs for three phenotypes of blood pressure in European Americans from meta-analysis of IBC arrays.

88

4.4 Discussion

This study represents the largest genome-wide association study of blood pressure

in African Americans to date. In a meta-analysis across five large U.S. community-based cohorts using the Affymetrix 6.0, 1 million-SNP chip, we found three novel SNPs located on 1, 5 and 21, respectively that reached genome-wide significance. We also identified several top SNPs using the IBC array.

4.4.1 Top GWA SNPs for Affy 6.0 million SNP chip

We identified a SNP on chromosome 5 (rs10474346) that reached genome-wide significance for DBP. There is no current data on the prediction of this SNP effect on any . Genes in the region include ARRDC3 (arrestin C), PPAR-gamma ligand and

PPAR-gamma activator. It is established that PPAR-gamma activators appear to play a role not only in hypertension but also cardiovascular disease including atherosclerosis.

PPARs are a family of nuclear receptors that are activated by nutrient molecules and their derivatives [Duan, et al. 2009]. One unifying theme that has been suggested linking

PPARs to hypertension is through modification of inflammation and the innate immune system. Also found in this region is the gene GPR98 (G protein coupled receptor) that is known to be expressed in the central nervous system. Mutations in this gene are typically associated with deafness, blindness and febrile seizures [Hilgert 2009].

A second SNP of interest is rs1990151 on chromosome 1. This is an intronic SNP in the protein beta (IPO13). Importin beta is a nuclear transport protein that modifies nuclear availability of glucocorticoids through nucleocytoplasmic shuttling [Tao,

89

et al. 2006]. There is a potential link proposed between early-onset glucocorticoid exposure and hypertension through changes in gene expression and function in the kidney

[Dodic, et al. 2002]. Of note, another importin beta protein (IPO7) was identified by

Adeyemo et al in a genome wide association analysis of a normotensive subset of 1,017

African Americans [Adeyemo, et al. 2009].

Finally the SNP on chromosome 21 that reached genome wide significance is in tight LD with rs2824495. The SNP is an intronic SNP in the gene c21orf91. Other genes in that area include CXADR (Coxsackie and Adenovirus receptors). These are type 1 membrane receptors. They are expressed in the endothelial cells. Mutations within these genes are associated with dilated cardiomyopathies.

4.4.2 Top SNPs from the meta analysis of the IBC array

We identified three top SNPs for blood pressure on . One gene in the region of particular interest is SLC25A42. This gene codes for a mitochondrial carrier

protein and was identified in ICBP African Americans.

Two SNPs, rs4930130 and rs1791926 both on chromosome 11, were associated

with DBP with p values of 10-5. The rs4930130 SNP was in proximity to gene KCNQ1.

This gene encodes a protein for a voltage-gated potassium channel required for the repolarization phase of the cardiac action potential. The gene product is associated with

hereditary long QT syndrome, Romano-Ward syndrome, Jervell and Lange-Nielsen syndrome and familial atrial fibrillation [Henrion, et al. 2009]. Polymorphisms in this

gene are in close proximity to genes which have been associated with cancer.

90

The rs1791926 is in proximity to gene P2RY2, purinergic receptor P2Y, G

coupled 2 , which mediates vasoactive and proliferative stimuli. It is located on

chromosome 11q13.5-q14.1 and contains three exons. There is evidence that the

purinergic system may affect the activity of epithelial sodium channel (ENaC) in renal

collecting ducts that is responsible for reabsorption of sodium [Hummler and Vallon

2005; Lehrmann, et al. 2002]. Genetic defects in this channel in humans have been

associated with hypertension in Liddle’s syndrome. A recent case control association

study by Wang et al. showed an association of P2RY2 SNP rs4944831 with hypertension

in Japanese men [Wang, et al. 2009b]. Another two case control studies by Wang et al.

demonstrated an association of rs10898909 with myocardial infarction in Japanese men

and an association of a T-A-G haplotype P2RY2 rs1783596-rs4382936-rs10898909 with

cerebral infarction in Japanese subjects [Wang, et al. 2009a; Wang, et al. 2009b].

Three SNPs were associated with SBP with p values of 10-5 to 10-6.Two of the

three SNPs, rs 214070 and rs757081, are in proximity to the NUCB2 gene nucleobindin.

This protein is a downstream substrate for caspases in the apoptotic pathway [Birney, et

al. 2007]. Despite their importance in various signal pathways, their role in regulation of biological functions is not known. The third SNP, rs1150226, is in proximity to HTR3A, a serotonin receptor gene which may play a role in major psychiatric illness. The product of this gene belongs to the ligand-gated ion channel receptor superfamily. This gene encodes subunit A of the type 3 receptor for 5-hydroxytryptamine (serotonin), a biogenic hormone that functions as a neurotransmitter, a hormone, and a mitogen. Alternatively spliced transcript variants encoding different isoforms have been identified. A recent

91

study showed an association of the T/T HTR3A polymorphism at rs1062613 to be

associated with treatment resistant schizophrenia [Ji, et al. 2008b].

4.4.3 Comparison with published SNPs from previous African Americans studies in

the CARe African Americans

In addition, a GWAS analysis with modest sample size also reported top 10 loci

that may be associated with SBP or DBP in African-American population [Adeyemo, et

al. 2009]. We examined whether these published SNPs can be replicated in our sample

(Table 4-5).

Briefly, Adebowale et al. performed the first GWAS for blood pressure

phenotypes among African Americans, a case-control study of hypertension among 1,017

African Americans from the Washington, D.C. No SNP reached genome wide significance for association with hypertension or DBP. However among the 508 normotensive controls, six SNPs, reached genome-wide significance with SBP. The

SNPs were in or near the following genes: PMS1, AL365265.23 (a pseudogene),

SLC24A, YWHAZ, IPO7, and CACNA1H. Subsequently, the investigators performed two replication studies. In a hypertension case-control study consisting of 980 West

Africans, three of the six SNPs were monomorphic. The other three SNPs were not associated with blood pressure phenotypes. However, in analyses combining the two studies, the SNP located in SLC24A4, rs11160059, had a combined P = 0.0003 with the same direction of association in the two studies. In the Diabetes Genetics Initiative, a

European ancestry sample, the investigators performed in silico replication and out of the

5 genes that contained genome-wide significant SNPs for SBP, three contained SNPs

92 associated with SBP with low p values. These genes included SCL24A, IPO7, and PMS1.

In summary, this African American GWAS identified six SNPs with genome-wide significant association with SBP. SLC24A may replicate in both a West African and

European derived study populations, and PMS1 and IPO7 may replicate in the European sample.

We found that SNP rs12279202 was reported by Adeyemo et al. [Adeyemo, et al.

2009] to be associated with SBP (P=4.80E-08) and this SNP has a one-sided P =0.033 in our study. For DBP, the loci reported by Adeymo et al. [Adeyemo, et al. 2009] were not replicated in our study.

93

Table 4-5 Lookup result of published SNPs from previous African Americans studies [Adeyemo, et al. 2009]. Results of SNP names labeled with * are from imputed SNPs.

CARe meta-anlayis , DBP CARe meta-anlayis , SBP Distance

to gene Effect Other Effect Other Rank SNP Chr Position Type Closest gene (kb) Allele MAF P allele Allele Effect P allele Allele Effect P Systolic BP

1 rs5743185 2 190446083 INTRONIC PMS1 0 T 0.1418 2.09E-11 A G 1.475 0.539 A G 2.300 0.498

2 rs16877320 6 16031005 INTERGENIC AL365265.23 12 G 0.1316 3.42E-09 C T -1.120 0.071 C T -1.264 0.227

3 rs11160059 14 91877083 INTRONIC SLC24A4 0 A 0.1782 1.54E-08 T C 0.205 0.633 T C 0.686 0.342

4 rs17365948* 8 102026053 INTRONIC YWHAZ 0 A 0.1125 1.59E-08 T C 0.533 0.528 T C 0.682 0.638

5 rs12279202 11 9388666 INTRONIC IPO7 0 A 0.1231 4.80E-08 T C 1.171 0.162 T C 3.032 0.033 NON_SYNONY 6 rs3751664 16 1194370 MOUS_CODING CACNA1H 0 T 0.1093 6.71E-08 NA NA NA NA NA NA NA NA

7 rs11659639* 18 56318592 INTERGENIC MC4R 127 C 0.09771 2.13E-07 G T -1.414 0.413 G T -3.044 0.277

8 rs4613079 16 79201458 INTRONIC CDYL2 0 T 0.1766 5.06E-07 A G 0.044 0.935 A G 1.241 0.208

9 rs13201744 6 6071844 INTERGENIC F13A1 17 A 0.16 1.12E-06 A C -1.195 0.029 A C -0.294 0.760 RP11- 10 rs2183737 9 70431453 INTERGENIC 274B18.3 15 T 0.4592 1.21E-06 A G 0.291 0.126 A G 0.296 0.360 Diastolic BP

1 rs1867226 15 89324717 INTRONIC PRC1 0 C 0.4636 5.80E-07 C G -0.148 0.430 C G -0.108 0.736

2 rs9590141 13 94401623 INTERGENIC ABCC4 68 A 0.1224 8.76E-07 A G -0.024 0.935 A G 0.230 0.646

3 rs10135446 14 79479231 INTERGENIC NRXN3. 79 A 0.1298 4.47E-06 T G 0.286 0.276 T G 0.858 0.054

4 rs11120313 1 212647829 INTRONIC PTPN14 0 A 0.1608 4.53E-06 A C 0.093 0.710 A C 0.253 0.550 RP11- 5 rs16848861 1 235211537 INTERGENIC 182B22.4 0 G 0.2008 4.73E-06 G A 0.192 0.415 G A 0.868 0.030

6 rs11846013 14 46002041 INTERGENIC RPL10L 188 A 0.136 4.99E-06 T G 0.513 0.063 T G 0.205 0.662

7 rs16853574 3 170562457 INTERGENIC MDS1 19 C 0.03982 5.10E-06 C T 0.644 0.183 C T 1.592 0.050

8 rs2823756 21 16664201 INTRONIC AP000473.2 0 T 0.4389 5.73E-06 A G 0.096 0.610 A G -0.129 0.685

9 rs8039294 15 89544863 INTRONIC SV2B 0 G 0.4828 6.29E-06 G T -0.139 0.467 G T -0.177 0.583

10 rs9301196 13 106645433 INTRONIC FAM155A 0 T 0.1159 6.66E-06 T C 0.098 0.747 T C 0.046 0.929

94

4.4.4 Comparison with published SNPs from previous European American studies

in CARe African Americans

Two large scale GWAS in European populations have been published and 24 loci

have been shown association evidence to blood pressure [Levy, et al. 2009; Newton-Cheh,

et al. 2009].

The CHARGE Consortium consists of 29,136 participants of European descent

from six population-based cohort studies. Forty-three SNPs representing 7 loci on 4

chromosomes were associated with a blood pressure phenotype with a P value < 4 x 10-7.

Six SNPs from 4 of the loci were tested for replication in 34,433 individuals representing the Global BPgen Consortium. Five of the six SNPs representing three loci achieved criteria consistent with external replication. All three loci were located on chromosome

12 and include ATP2B1, TBX3-TBX5, and a large block of linkage disequilibrium on

12q24 including SH2B3, ATXN2, TRAFD1 and other genes. Finally, 30 SNPs were tested for association with blood pressure phenotypes in a joint meta-analysis of

CHARGE and Global BPgen. Eleven SNPs representing eight loci attained a genome- wide significance level of P < 5 x 10-8. Genes at these loci included CYP17A1,

PLEKHA7, ATP2B1, SH2B3, CACNB2, CSK-ULK3, TBX3-TBX5, and ULK4.

The Global BPgen Consortium consisted of 34,433 individuals of European

ancestry from 17 cohorts ascertained through population-based sampling or from case-

control studies. The investigators identified 26 SNPs that were associated with a blood

pressure phenotype at the p < 1 x 10-8 level. Twelve SNPs were then carried forward in a

meta-analysis on up to 71,225 individuals of European descent. Five SNPs were

95

significant at p < 5 x 10-8. These SNPs are located in the genes MTHFR/NPPA,

CYP17A1, FGF5, ATXN2/SH2B3, and CYP1A2. Twenty SNPs were then tested for replication in the 29,136 individual of European descent representing the CHARGE

Consortium. 10 SNPs were significant at the 5 x 10-8 level. These SNPs are located in

CYP17A1, CYP1A2, FGF5, SH2B3/ATXN2, MTHFR/NPPA, c10orf107, HAA0, MDS1,

ZNF652, and PLCD3.

For the SNPs reported in European populations, we replicated rs653178 (ATXN2,

one side P=0.008), rs381815, (PLEKHA7, one side P=0.051), rs3184504 (SH2B3,

P=4.6E-3), rs2384550 (TBX3-TBX5, P=0.037) and rs6495122 (CSK-ULK3, P=0.011)

(Table 4-6).

96

Table 4-6 Lookup result of published SNPs from previous European Americans studies [Levy, et al. 2009; Newton-Cheh, et al. 2009]. Results of SNPs names labeled with * are from imputed SNPs. a. Global BPgen [Newton-Cheh, et al. 2009]

CARe meta-analysis , DBP CARe meta-analysis , SBP

BP Effect Other Effect Other trait SNP Chr Position Type Closest gene Allele Effect se P allele Allele Effect P allele Allele Effect P

SBP rs17367504 1 11797044 INTRONIC MTHFR G -0.85 0.11 2.00E-13 G A 0.165 0.582 G A -0.379 0.509

SBP rs11191548 10 104836168 INTERGENIC CNNM2/NT5C2 T 1.16 0.12 7.00E-24 C T -1.056 0.016 C T -0.775 0.296

SBP rs12946454* 17 40563647 INTRONIC PLCD3 T 0.57 0.1 1.00E-08 T A -0.172 0.350 T A -0.571 0.339

DBP rs16998073* 4 81541520 UPSTREAM FGF5 T 0.5 0.05 1.00E-21 T A 0.573 0.141 T A 0.630 0.342

DBP rs1530440 10 63194597 INTRONIC c10orf107 T -0.39 0.06 1.00E-09 T C -0.376 0.415 T C 0.248 0.751

DBP rs653178 12 110470476 INTRONIC ATXN2 T -0.46 0.05 3.00E-18 C T 0.341 0.422 C T 1.748 0.016

DBP rs1378942 15 72864420 INTRONIC CSK C 0.43 0.04 1.00E-23 A C -0.032 0.916 A C -0.803 0.119

DBP rs16948048 17 44795465 UPSTREAM ZNF652 G 0.31 0.05 5.00E-09 G A 0.092 0.626 G A -0.223 0.487

97

b. The CHARGE Consortium [Levy, et al. 2009]

CHARGE + Global BPgen meta-analysis CARe meta-analysis , DBP CARe meta-analysis , SBP Alleles SNP (coded/ Effect Other Effect Other identifier Chr Position Nearest gene other) Beta s.e. P value allele Allele Effect P allele Allele Effect P SNPs in boldface attained P < 5X10-8 in meta-analysis of CHARGE and Global BPgen. Systolic blood pressure rs12046278* 1 10,722,164 CASZ1 T/C -0.53 0.12 4.77E-06 C T 0.401 0.234 C T 0.059 0.918 rs7571613* 2 190,513,907 PMS1 A/G -0.54 0.13 1.90E-05 G A -0.157 0.464 G A -0.277 0.445 rs448378 3 170,583,593 MDS1 A/G -0.51 0.10 1.18E-07 A G 0.151 0.417 A G 0.130 0.680 rs2736376* 8 11,155,175 MTMR9 C/G -0.48 0.15 9.15E-04 C G -0.314 0.127 C G -0.469 0.180 rs1910252* 8 49,569,915 EFCAB1 T/C -0.43 0.13 6.13E-04 T C 0.025 0.902 T C -0.227 0.505 rs11014166* 10 18,748,804 CACNB2 A/T 0.50 0.10 7.03E-07 T A -0.199 0.444 T A -0.297 0.500 rs1004467* 10 104,584,497 CYP17A1 A/G 1.05 0.16 1.28E-10 G A 0.010 0.968 G A -0.307 0.479 rs381815* 11 16,858,844 PLEKHA7 T/C 0.65 0.11 1.89E-09 T C 0.407 0.069 T C 0.620 0.102 rs2681492 * 12 88,537,220 ATP2B1 T/C 0.85 0.13 3.76E-11 C T -0.177 0.536 C T -0.222 0.646 rs3184504* 12 110,368,991 SH2B3 T/C 0.58 0.10 4.52E-09 T C 0.532 0.142 T C 1.612 9.13E-03 Diastolic blood pressure GPR73- rs13423988* 2 68,764,770 ARHGAP25 T/C 0.33 0.08 5.00E-05 T C -0.097 0.624 T C 0.082 0.807 rs13401889* 2 190,618,804 MSTN T/C -0.31 0.08 4.82E-05 T C -0.014 0.944 T C 0.163 0.626 rs9815354* 3 41,887,655 ULK4 A/G 0.49 0.08 2.54E-09 A G 0.163 0.490 A G 0.264 0.399 rs7016759 8 49,574,969 EFCAB1 T/C 0.30 0.08 2.29E-04 C T -0.268 0.495 C T -0.636 0.435 rs11014166* 10 18,748,804 CACNB2 A/T 0.37 0.06 1.24E-08 T A -0.199 0.444 T A -0.297 0.500 rs11024074 11 16,873,795 PLEKHA7 T/C -0.33 0.07 1.20E-06 C T 0.453 0.202 C T 0.515 0.133 rs2681472* 12 88,533,090 ATP2B1 A/G 0.50 0.08 1.47E-09 G A -0.397 0.204 G A -0.612 0.246 rs3184504* 12 110,368,991 SH2B3 T/C 0.48 0.06 2.58E-14 T C 0.532 0.142 T C 1.612 9.13E-03 rs2384550 12 113,837,114 TBX3-TBX5 A/G -0.35 0.06 3.75E-08 A G -0.354 0.074 A G -0.747 0.026 rs6495122* 15 72,912,698 CSK-ULK3 A/C 0.40 0.06 1.84E-10 C A -0.506 0.021 C A -1.321 3.74E-04

98

4.4.5 Independent replication of top CARe SNPs in African and European

American cohorts

Follow-up replication studies are very important in confirmation of initial GWAS

results. We have chosen 7 top SNPs from Affymetrix chip results, among which 6 SNPs

are chosen from meta-analysis results (3 for DBP and 3 for SBP). SNP rs17610514 on chromosome 11 is the top SNP from CARDIA cohort GWAS results for DBP as there are many signals identified on chromosome 11 for DBP in CARDIA but not in any other cohort. A list of the 7 chosen SNPs was delivered to different studies (Maywood, Nigeria,

Rotimi’s group [Adeyemo, et al. 2009], ICBP and GENOA). Samples of four studies are

African Americans except ICBP study, which has European American samples. Table 4-

7 shows the replication results in all the 5 replication studies. Unfortunately, we didn’t

identify any significance in the replication results after multiple comparison correction.

Furthermore, many replication results exhibit an opposite sign in effect sizes.

The possible reasons for the failure in the replication studies were investigated by

power calculation for each study based on sample size and the proportion of phenotypic

variance accounted by each identified SNP. The power calculation was conducted on the

top SNPs of DBP and SBP at a significance level of 0.05. SNP rs10474346 on

chromosome 5 is the top SNP for DBP with a meta-analysis p-value of 3.56×10-8 and the

proportion of DBP phenotype variation explained by this SNP is 0.43%. As shown in table 4-8, the power for the 4 African American replication studies to identify the same

signal is around or below 62%. Similar power calculation was conducted for SNP

rs2258119 on chromosome 21, the top SNP for SBP with a meta-analysis p-value of

4.69×10-8. The proportion of DBP phenotype variation explained by this SNP is 0.39%.

99

As shown in table 4-8, the power for the 4 African American replication studies to identify the same signal is around or below 58%. The median of the power to replicate the rest 4 top SNPs is around 58%. Although the power may be overestimated because of the Winner’s curse, the power calculation shows that even for the most significant SNPs identified in CARe GWAS, the replication studies have insufficient power to confirm the signal, limited by the small sample size.

ICBP study has a sample size of 69,092 individuals and thus can achieve enough power to replicate our findings but still failed. There could be many explanations to the failure of ICBP replication study. First of all, the participants of ICBP study are European

Americans while CARe study is comprised of African Americans. The basis of association studies is the existing linkage disequilibrium (LD) between the genotyped

SNPs and un-genotyped disease marker. This LD pattern can be greatly different between

African Americans and European Americans, which could result in the difference of

ICBP study and CARe study results. In addition, the disease risk may be contributed by different genes or different alleles of the same gene for different populations. This locus heterogeneity could also account for the failure of ICBP in replicating CARe results.

100

Table 4-7 Replication results from different studies. SNP name with * represents a SNP chosen from CARDIA cohort results. All the other 6 SNPs are chosen from the top SNP list of meta-analysis results.

CARE Maywood Nigeria Rotimi's ICBP GENOA

Trait SNP CHR BP Beta P Beta P Beta P Beta P Beta P Beta P

SBP rs1990151 1 44186879 3.4831 7.39E-07 1.858 3.70E-01 0.878 6.36E-01 1.222 5.26E-01 -0.147 3.60E-01 -0.032 9.88E-01

SBP rs13413144 2 153183499 4.282 5.55E-07 -0.847 7.40E-01 -5.881 2.70E-01 NA NA 0.025 8.44E-01 0.254 9.24E-01

SBP rs592582 2 157481632 1.660 4.46E-07 1.119 2.86E-01 -0.926 3.70E-01 -1.096 2.79E-01 -0.198 1.67E-01 0.387 7.30E-01

DBP rs7709572 5 90592789 1.095 7.41E-08 -0.301 6.82E-01 -0.414 4.65E-01 -0.935 1.58E-01 -0.016 8.64E-01 1.711 1.68E-02

DBP rs10474346 5 90599895 1.104 3.56E-08 -0.496 4.95E-01 -0.344 5.40E-01 NA NA -0.016 8.69E-01 1.528 2.81E-02

DBP rs17610514* 11 55652374 10.490 3.95E-09 1.380 5.93E-01 1.869 4.33E-01 0.716 7.41E-01 0.086 4.40E-01 -0.613 7.84E-01

SBP rs2258119 21 18089350 1.8406 4.69E-08 -1.12 3.25E-01 -0.355 7.30E-01 -0.185 8.61E-01 -0.008 9.45E-01 1.544 1.72E-01

101

Table 4-8 Power calculation of replication studies.

Maywood Nigeria Rotimi ICBP GENOA

sample sample sample sample sample Trait SNP CHR BP Beta P size Power size Power size Power size Power size Power

DBP rs10474346 5 90599895 1.104 3.56E-08 743 43.02% 1188 61.63% 1016 55.02% 69092 100% 845 47.71%

SBP rs2258119 21 18089350 1.8406 4.69E-08 743 40.07% 1188 57.93% 1016 51.50% 69092 100% 845 44.51%

DBP rs7709572 5 90592789 1.095 7.41E-08 743 40.32% 1188 58.26% 1016 51.82% 69092 100% 845 44.79%

SBP rs592582 2 157481632 1.660 4.46E-07 743 36.61% 1188 53.42% 1016 47.28% 69092 100% 845 40.72%

SBP rs13413144 2 153183499 4.282 5.55E-07 743 47.43% 1188 66.89% 1016 60.13% 69092 100% 845 52.45%

SBP rs1990151 1 44186879 3.4831 7.39E-07 743 34.06% 1188 49.95% 1016 44.10% 69092 100% 845 37.91%

102

4.4.6 Future direction

In summary, we reported a meta-analysis of GWAS in 5 African American cohorts for blood pressure and identified SNPs reaching genome-wide association significance. The findings provide a set of candidate genes to be evaluated in-depth in future studies. Although our replication studies didn’t replicate our findings, further replication especially in independent African American samples with thousands to tens of thousands individuals are needed. Fine mapping and haplotype analysis for the candidate genes will be essentially important to future analysis.

Our analysis is based on more than 7000 individuals and only has identified two

SNPs at genome-wide association significance level. Give the fact that there is little success in previous GWA studies of hypertension, it is reasonable to think the underlying genetic architecture of blood pressure is different and much more complicated than other common diseases which have achieved substantial success in GWAS. Therefore, other strategies like detecting rare variants such as proposed in this dissertation may help us in searching for risk loci of blood pressure.

103

BIBLIOGRAPHY

2003. The International HapMap Project. Nature 426(6968):789-96.

2004. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science

306(5696):636-40.

Adeyemo A, Gerry N, Chen G, Herbert A, Doumatey A, Huang H, Zhou J, Lashley K,

Chen Y, Christman Met al. 2009. A genome-wide association study of

hypertension and blood pressure in African Americans. PLoS Genet

5(7):e1000564.

Barrett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, Rioux JD, Brant SR, Silverberg

MS, Taylor KD, Barmada MMet al. 2008. Genome-wide association defines more

than 30 distinct susceptibility loci for Crohn's disease. Nat Genet 40(8):955-62.

Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH,

Weng Z, Snyder M, Dermitzakis ET, Thurman REet al. 2007. Identification and

analysis of functional elements in 1% of the human genome by the ENCODE

pilot project. Nature 447(7146):799-816.

Bodmer W, Bonilla C. 2008. Common and rare variants in multifactorial susceptibility to

common diseases. Nat Genet 40(6):695-701.

Brinza D, Zelikovsky A. 2006. 2SNP: scalable phasing based on 2-SNP haplotypes.

Bioinformatics 22(3):371-3.

Browning SR, Browning BL. 2007. Rapid and accurate haplotype phasing and missing-

data inference for whole-genome association studies by use of localized haplotype

clustering. Am J Hum Genet 81(5):1084-97.

Chakravarti A. 1999. Population genetics--making sense out of sequence. Nat Genet 21(1

Suppl):56-60. 104

Cohen JC, Boerwinkle E, Mosley TH, Jr., Hobbs HH. 2006. Sequence variations in

PCSK9, low LDL, and protection against coronary heart disease. N Engl J Med

354(12):1264-72.

Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH. 2004.

Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science

305(5685):869-72.

Collins R, Winkleby MA. 2002. African American women and men at high and low risk

for hypertension: a signal detection analysis of NHANES III, 1988-1994. Prev

Med 35(4):303-12.

Consortium WTCC. 2007. Genome-wide association study of 14,000 cases of seven

common diseases and 3,000 shared controls. Nature 447(7145):661-78. de Bakker PI, Burtt NP, Graham RR, Guiducci C, Yelensky R, Drake JA, Bersaglieri T,

Penney KL, Butler J, Young Set al. 2006. Transferability of tag SNPs in genetic

association studies in multiple populations. Nat Genet 38(11):1298-303.

Dickson SP, Wang K, Krantz I, Hakonarson H, Goldstein DB. Rare variants create

synthetic genome-wide associations. PLoS Biol 8(1):e1000294.

Dodic M, Moritz K, Koukoulas I, Wintour EM. 2002. Programmed hypertension: kidney,

brain or both? Trends Endocrinol Metab 13(9):403-8.

Duan SZ, Usher MG, Mortensen RM. 2009. PPARs: the vasculature, inflammation and

hypertension. Curr Opin Nephrol Hypertens 18(2):128-33.

Dudbridge F, Koeleman BP. 2003. Rank truncated product of P-values, with application

to genomewide association scans. Genet Epidemiol 25(4):360-6.

105

Easton DF, Pooley KA, Dunning AM, Pharoah PD, Thompson D, Ballinger DG,

Struewing JP, Morrison J, Field H, Luben Ret al. 2007. Genome-wide association

study identifies novel breast cancer susceptibility loci. Nature 447(7148):1087-93.

Edgington ES. 1972. An additive model for combining probability values from

independent experiments. J Psychol 80:351-63.

Fisher RA. 1932. Statistical methods for research workers. New York: John Wiley &

Sons.

Frayling TM, Timpson NJ, Weedon MN, Zeggini E, Freathy RM, Lindgren CM, Perry

JR, Elliott KS, Lango H, Rayner NWet al. 2007. A common variant in the FTO

gene is associated with body mass index and predisposes to childhood and adult

obesity. Science 316(5826):889-94.

Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW,

Boudreau A, Hardenbol P, Leal SMet al. 2007. A second generation human

haplotype map of over 3.1 million SNPs. Nature 449(7164):851-61.

Gold B, Kirchhoff T, Stefanov S, Lautenberger J, Viale A, Garber J, Friedman E, Narod

S, Olshen AB, Gregersen Pet al. 2008. Genome-wide association study provides

evidence for a breast cancer risk locus at 6q22.33. Proc Natl Acad Sci U S A

105(11):4340-5.

Gudbjartsson DF, Walters GB, Thorleifsson G, Stefansson H, Halldorsson BV,

Zusmanovich P, Sulem P, Thorlacius S, Gylfason A, Steinberg Set al. 2008.

Many sequence variants affecting diversity of adult human height. Nat Genet

40(5):609-15.

106

Gudmundsson J, Sulem P, Manolescu A, Amundadottir LT, Gudbjartsson D, Helgason A,

Rafnar T, Bergthorsson JT, Agnarsson BA, Baker Aet al. 2007. Genome-wide

association study identifies a second prostate cancer susceptibility variant at 8q24.

Nat Genet 39(5):631-7.

Halperin E, Eskin E. 2004. Haplotype reconstruction from genotype data using Imperfect

Phylogeny. Bioinformatics 20(12):1842-9.

Hao K, Chudin E, McElwee J, Schadt EE. 2009. Accuracy of genome-wide imputation of

untyped markers and impacts on statistical power for association studies. BMC

Genet 10:27.

Henrion U, Strutz-Seebohm N, Duszenko M, Lang F, Seebohm G. 2009. Long QT

syndrome-associated mutations in the voltage sensor of I(Ks) channels. Cell

Physiol Biochem 24(1-2):11-6.

Herbert A, Gerry NP, McQueen MB, Heid IM, Pfeufer A, Illig T, Wichmann HE,

Meitinger T, Hunter D, Hu FBet al. 2006. A common genetic variant is associated

with adult and childhood obesity. Science 312(5771):279-83.

Hertz RP, Unger AN, Cornell JA, Saunders E. 2005. Racial disparities in hypertension

prevalence, awareness, and management. Arch Intern Med 165(18):2098-104.

Hilgert N. 2009. Novel human pathological mutations. Gene symbol: GPR98. Disease:

Usher syndrome 2C. Hum Genet 125(3):342.

Hoh J, Wille A, Ott J. 2001. Trimming, weighting, and grouping SNPs in human case-

control association studies. Genome Res 11(12):2115-9.

Howard G, Prineas R, Moy C, Cushman M, Kellum M, Temple E, Graham A, Howard V.

2006. Racial and geographic differences in awareness, treatment, and control of

107

hypertension: the REasons for Geographic And Racial Differences in Stroke study.

Stroke 37(5):1171-8.

Hummler E, Vallon V. 2005. Lessons from mouse mutants of epithelial sodium channel

and its regulatory . J Am Soc Nephrol 16(11):3160-6.

Iles MM. 2008. What can genome-wide association studies tell us about the genetics of

common disease? PLoS Genet 4(2):e33.

Ioannidis JP, Trikalinos TA, Khoury MJ. 2006. Implications of small effect sizes of

individual genetic variants on the design and interpretation of genetic association

studies of complex diseases. Am J Epidemiol 164(7):609-14.

Iyengar SK, Elston RC. 2007. The genetic basis of complex traits: rare variants or

"common gene, common disease"? Methods Mol Biol 376:71-84.

Ji W, Foo JN, O'Roak BJ, Zhao H, Larson MG, Simon DB, Newton-Cheh C, State MW,

Levy D, Lifton RP. 2008a. Rare independent mutations in renal salt handling

genes contribute to blood pressure variation. Nat Genet 40(5):592-9.

Ji X, Takahashi N, Saito S, Ishihara R, Maeno N, Inada T, Ozaki N. 2008b. Relationship

between three serotonin receptor subtypes (HTR3A, HTR2A and HTR4) and

treatment-resistant schizophrenia in the Japanese population. Neurosci Lett

435(2):95-8.

Kathiresan S, Melander O, Guiducci C, Surti A, Burtt NP, Rieder MJ, Cooper GM, Roos

C, Voight BF, Havulinna ASet al. 2008. Six new loci associated with blood low-

density lipoprotein cholesterol, high-density lipoprotein cholesterol or

triglycerides in humans. Nat Genet 40(2):189-97.

108

Keating BJ, Tischfield S, Murray SS, Bhangale T, Price TS, Glessner JT, Galver L,

Barrett JC, Grant SF, Farlow DNet al. 2008. Concept, design and implementation

of a cardiovascular gene-centric 50 k SNP array for large-scale genomic

association studies. PLoS One 3(10):e3583.

Lander ES. 1996. The new genomics: global views of biology. Science 274(5287):536-9.

Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K,

Doyle M, FitzHugh Wet al. 2001. Initial sequencing and analysis of the human

genome. Nature 409(6822):860-921.

Lehrmann H, Thomas J, Kim SJ, Jacobi C, Leipziger J. 2002. Luminal P2Y2 receptor-

mediated inhibition of Na+ absorption in isolated perfused mouse CCD. J Am Soc

Nephrol 13(1):10-8.

Lettre G, Jackson AU, Gieger C, Schumacher FR, Berndt SI, Sanna S, Eyheramendy S,

Voight BF, Butler JL, Guiducci Cet al. 2008. Identification of ten loci associated

with height highlights new biological pathways in human growth. Nat Genet

40(5):584-91.

Levy D, Ehret GB, Rice K, Verwoert GC, Launer LJ, Dehghan A, Glazer NL, Morrison

AC, Johnson AD, Aspelund Tet al. 2009. Genome-wide association study of

blood pressure and hypertension. Nat Genet.

Li B, Leal SM. 2008. Methods for detecting associations with rare variants for common

diseases: application to analysis of sequence data. Am J Hum Genet 83(3):311-21.

Li J, Ji L. 2005. Adjusting multiple testing in multilocus analyses using the eigenvalues

of a correlation matrix. Heredity 95(3):221-7.

109

Li M, Li C, Guan W. 2008. Evaluation of coverage variation of SNP chips for genome-

wide association studies. Eur J Hum Genet 16(5):635-43.

Li Y, Willer C, Sanna S, Abecasis G. 2009. Genotype imputation. Annu Rev Genomics

Hum Genet 10:387-406.

Liu PY, Zhang YY, Lu Y, Long JR, Shen H, Zhao LJ, Xu FH, Xiao P, Xiong DH, Liu

YJet al. 2005. A survey of haplotype variants at several disease candidate genes:

the importance of rare variants for complex diseases. J Med Genet 42(3):221-7.

Madsen BE, Browning SR. 2009. A groupwise association test for rare mutations using a

weighted sum statistic. PLoS Genet 5(2):e1000384.

Marchini J, Cardon LR, Phillips MS, Donnelly P. 2004. The effects of human population

structure on large genetic association studies. Nat Genet 36(5):512-7.

McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn

JN. 2008. Genome-wide association studies for complex traits: consensus,

uncertainty and challenges. Nat Rev Genet 9(5):356-69.

Newton-Cheh C, Johnson T, Gateva V, Tobin MD, Bochud M, Coin L, Najjar SS, Zhao

JH, Heath SC, Eyheramendy Set al. 2009. Genome-wide association study

identifies eight loci associated with blood pressure. Nat Genet.

Parkes M, Barrett JC, Prescott NJ, Tremelling M, Anderson CA, Fisher SA, Roberts RG,

Nimmo ER, Cummings FR, Soars Det al. 2007. Sequence variants in the

autophagy gene IRGM and multiple other replicating loci contribute to Crohn's

disease susceptibility. Nat Genet 39(7):830-2.

110

Parra EJ, Marcini A, Akey J, Martinson J, Batzer MA, Cooper R, Forrester T, Allison DB,

Deka R, Ferrell REet al. 1998. Estimating African American admixture

proportions by use of population-specific alleles. Am J Hum Genet 63(6):1839-51.

Pritchard JK. 2001. Are rare variants responsible for susceptibility to complex diseases?

Am J Hum Genet 69(1):124-37.

Pritchard JK, Cox NJ. 2002. The allelic architecture of human disease genes: common

disease-common variant...or not? Hum Mol Genet 11(20):2417-23.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P,

de Bakker PI, Daly MJet al. 2007. PLINK: a tool set for whole-genome

association and population-based linkage analyses. Am J Hum Genet 81(3):559-

75.

Risch NJ. 2000. Searching for genetic determinants in the new millennium. Nature

405(6788):847-56.

Romeo S, Pennacchio LA, Fu Y, Boerwinkle E, Tybjaerg-Hansen A, Hobbs HH, Cohen

JC. 2007. Population-based resequencing of ANGPTL4 uncovers variations that

reduce triglycerides and increase HDL. Nat Genet 39(4):513-6.

Rosamond W, Flegal K, Friday G, Furie K, Go A, Greenlund K, Haase N, Ho M, Howard

V, Kissela Bet al. 2007. Heart disease and stroke statistics--2007 update: a report

from the American Heart Association Statistics Committee and Stroke Statistics

Subcommittee. Circulation 115(5):e69-171.

Samani NJ. 2003. Genome scans for hypertension and blood pressure regulation. Am J

Hypertens 16(2):167-71.

111

Samani NJ, Erdmann J, Hall AS, Hengstenberg C, Mangino M, Mayer B, Dixon RJ,

Meitinger T, Braund P, Wichmann HEet al. 2007. Genomewide association

analysis of coronary artery disease. N Engl J Med 357(5):443-53.

Scheet P, Stephens M. 2006. A fast and flexible statistical model for large-scale

population genotype data: applications to inferring missing genotypes and

haplotypic phase. Am J Hum Genet 78(4):629-44.

Stouffer SA, Suchman, E.A., DeVinney, L.C., Star, S.A., Williams, R.M. Jr. 1949. The

American soldier, vol 1. Adjustmentduring army life. Princeton: Princeton

University Press.

Stratton MR, Rahman N. 2008. The emerging landscape of breast cancer susceptibility.

Nat Genet 40(1):17-22.

Tao T, Lan J, Lukacs GL, Hache RJ, Kaplan F. 2006. Importin 13 regulates nuclear

import of the glucocorticoid receptor in airway epithelial cells. Am J Respir Cell

Mol Biol 35(6):668-80.

Thomas G, Jacobs KB, Yeager M, Kraft P, Wacholder S, Orr N, Yu K, Chatterjee N,

Welch R, Hutchinson Aet al. 2008. Multiple loci identified in a genome-wide

association study of prostate cancer. Nat Genet 40(3):310-5.

Thomson W, Barton A, Ke X, Eyre S, Hinks A, Bowes J, Donn R, Symmons D, Hider S,

Bruce INet al. 2007. Rheumatoid arthritis association at 6q23. Nat Genet

39(12):1431-3.

Tobin MD, Sheehan NA, Scurrah KJ, Burton PR. 2005. Adjusting for treatment effects in

studies of quantitative traits: antihypertensive therapy and systolic blood pressure.

Stat Med 24(19):2911-35.

112

Todd JA, Walker NM, Cooper JD, Smyth DJ, Downes K, Plagnol V, Bailey R, Nejentsev

S, Field SF, Payne Fet al. 2007. Robust associations of four new chromosome

regions from genome-wide analyses of type 1 diabetes. Nat Genet 39(7):857-64.

Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M,

Evans CA, Holt RAet al. 2001. The sequence of the human genome. Science

291(5507):1304-51.

Walsh T, King MC. 2007. Ten genes for inherited breast cancer. Cancer Cell 11(2):103-5.

Wang WY, Barratt BJ, Clayton DG, Todd JA. 2005. Genome-wide association studies:

theoretical and practical concerns. Nat Rev Genet 6(2):109-18.

Wang Z, Nakayama T, Sato N, Yamaguchi M, Izumi Y, Kasamaki Y, Ohta M, Soma M,

Aoi N, Ozawa Yet al. 2009a. Purinergic receptor P2Y, G-protein coupled, 2

(P2RY2) gene is associated with cerebral infarction in Japanese subjects.

Hypertens Res 32(11):989-96.

Wang ZX, Nakayama T, Sato N, Izumi Y, Kasamaki Y, Ohta M, Soma M, Aoi N,

Matsumoto K, Ozawa Yet al. 2009b. Association of the purinergic receptor P2Y,

G-protein coupled, 2 (P2RY2) gene with myocardial infarction in Japanese men.

Circ J 73(12):2322-9.

Ward R. 1995. Familial aggregation and genetic epidemiology of blood pressure. Laragh

JH BB, eds., editor. New York: Raven Press.

Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, Mangino M, Freathy RM,

Perry JR, Stevens S, Hall ASet al. 2008. Genome-wide association analysis

identifies 20 loci that influence adult height. Nat Genet 40(5):575-83.

113

Willer CJ, Sanna S, Jackson AU, Scuteri A, Bonnycastle LL, Clarke R, Heath SC,

Timpson NJ, Najjar SS, Stringham HMet al. 2008. Newly identified loci that

influence lipid concentrations and risk of coronary artery disease. Nat Genet

40(2):161-9.

Xiong M, Zhao J, Boerwinkle E. 2002. Generalized T2 test for genome association

studies. Am J Hum Genet 70(5):1257-68.

Zaitlen N, Kang HM, Eskin E, Halperin E. 2007. Leveraging the HapMap correlation

structure in association studies. Am J Hum Genet 80(4):683-91.

Zaykin DV, Zhivotovsky LA, Westfall PH, Weir BS. 2002. Truncated product method

for combining P-values. Genet Epidemiol 22(2):170-85.

Zeggini E, Rayner W, Morris AP, Hattersley AT, Walker M, Hitman GA, Deloukas P,

Cardon LR, McCarthy MI. 2005. An evaluation of HapMap sample size and

tagging SNP performance in large-scale empirical and simulated data sets. Nat

Genet 37(12):1320-2.

Zeggini E, Weedon MN, Lindgren CM, Frayling TM, Elliott KS, Lango H, Timpson NJ,

Perry JR, Rayner NW, Freathy RMet al. 2007. Replication of genome-wide

association signals in UK samples reveals risk loci for type 2 diabetes. Science

316(5829):1336-41.

Zhu X, Bouzekri N, Southam L, Cooper RS, Adeyemo A, McKenzie CA, Luke A, Chen

G, Elston RC, Ward R. 2001. Linkage and association analysis of angiotensin I-

converting enzyme (ACE)-gene polymorphisms with ACE concentration and

blood pressure. Am J Hum Genet 68(5):1139-48.

114

Zhu X, Cooper RS. 2007. Admixture mapping provides evidence of association of the

VNN1 gene with hypertension. PLoS One 2(11):e1244.

Zhu X, Fejerman L, Luke A, Adeyemo A, Cooper RS. 2005a. Haplotypes produced from

rare variants in the promoter and coding regions of angiotensinogen contribute to

variation in angiotensinogen levels. Hum Mol Genet 14(5):639-43.

Zhu X, Feng T, Li Y, Lu Q, Elston RC. Detecting rare variants for complex traits using

family and unrelated data. Genet Epidemiol 34(2):171-87.

Zhu X, Li S, Cooper RS, Elston RC. 2008. A unified association analysis approach for

family and unrelated samples correcting for stratification. Am J Hum Genet

82(2):352-65.

Zhu X, Luke A, Cooper RS, Quertermous T, Hanis C, Mosley T, Gu CC, Tang H, Rao

DC, Risch Net al. 2005b. Admixture mapping for hypertension loci with genome-

scan markers. Nat Genet 37(2):177-81.

115