The Role of Haplotypes in Candidate Gene Studies

Genetic Epidemiology 27: 321–333 (2004) The Role of Haplotypes in Candidate Gene Studies Andrew G. Clarkn Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York Human geneticists working on systems for which it is possible to make a strong case for a set of candidate genes face the problem of whether it is necessary to consider the variation in those genes as phased haplotypes, or whether the one-SNP- at-a-time approach might perform as well. There are three reasons why the phased haplotype route should be an improvement. First, the protein products of the candidate genes occur in polypeptide chains whose folding and other properties may depend on particular combinations of amino acids. Second, population genetic principles show us that variation in populations is inherently structured into haplotypes. Third, the statistical power of association tests with phased data is likely to be improved because of the reduction in dimension. However, in reality it takes a great deal of extra work to obtain valid haplotype phase information, and inferred phase information may simply compound the errors. In addition, if the causal connection between SNPs and a phenotype is truly driven by just a single SNP, then the haplotype- based approach may perform worse than the one-SNP-at-a-time approach. Here we examine some of the factors that affect haplotype patterns in genes, how haplotypes may be inferred, and how haplotypes have been useful in the context of testing association between candidate genes and complex traits. Genet. Epidemiol. & 2004 Wiley-Liss, Inc. Key words: haplotype inference; haplotype association testing; candidate genes; linkage equilibrium Grant sponsor: NIH; Grant numbers: GM65509, HL072904. n Correspondence to: Andrew G. Clark, Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853. E-mail: [email protected] Received 23 June 2004; Accepted 29 June 2004 Published online 14 September 2004 in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/gepi.20025 WHY STUDY HAPLOTYPES? is found in a population is the result of the past transmission of that variation through the popula- The primary focus of this review is on the tion, and this historical past produces a structure problem of identifying DNA sequence variation to the SNP variation that can be of considerable that is segregating in a population and has a value in trying to solve the primary goal of causal connection to variation in risk of complex finding variants associated with disease risk. The disease. One could perform tests of this associa- three primary reasons for considering the haplo- tion without consideration of the fact that the type organization of variation discussed here are: SNPs are not independent of one another, but the 1) that the unit of biological function, the protein- determination of critical values must be made in coding gene, produces proteins whose sequences the context of linkage disequilibrium among correspond to maternal and paternal haplotypes, SNPs. Fortunately, permutation tests accomplish 2) that variation in a population is in fact this reasonably well [Churchill and Doerge, 1994; structured into haplotypes that are likely to be Doerge and Churchill, 1996]. For each SNP, the transmitted as a unit, and 3) that regardless of the distribution of phenotypes would be compared population genetic reasons, haplotypes serve to among the 2 or 3 genotypes observed in the reduce the dimensionality of the problem of sample. The statistical inference would need to testing association, and so they may increase the take into account the multiplicity of tests being power of those tests. Let’s look at these three ideas performed, and the biological interpretation in more detail. would need to account for the fact that some SNPs are in linkage disequilibrium with one HAPLOTYPES DEFINE FUNCTIONAL UNITS another. But the basic idea of testing association OF GENES with individual SNPs is often where the analysis For each protein-coding gene, regardless of the begins. In reality, the DNA sequence variation that number of heterozygous amino-acid sites an & 2004 Wiley-Liss, Inc. 322 Clark individual has, he or she will only produce at HAPLOTYPES REDUCE THE DIMENSION OF most two polypeptide chains, one corresponding ASSOCIATION TESTS, AND MAY GAIN to the maternal and one to the paternal haplo- STATISTICAL POWER type (ignoring for now the complexities of This is easiest to see by example. Suppose a alternative splicing, RNA editing, posttransla- gene has 8 SNPs, and you want to test for tional modification, etc.). The point is that the associations in a way that allows any and all of folding kinetics, stability, or any other physical these SNPs to interact in their effects on disease properties of the protein may depend on inter- causation. You may start by testing each of the 8 actions between pairs or higher-order combina- SNPs, asking whether the incidence of cases and tions of amino-acid sites. If these interactions are controls (or measures of a physiological risk important, then haplotypes are of direct biologi- factor) differs among the 3 genotypes for each cal relevance. Relatively few cases of directly SNP. You may then test all pairs of SNPs, and for showing functional interaction exist, but there are each pair you construct a 3 Â 3 table of genotypes convincing ways to demonstrate such interac- {(AABB, AABb, Aabb), (AaBB, AaBb, Aabb), (aaBB, tions from population-level data. ApoE is one aaBb, aabb)}, and within each cell of this table you example of a protein whose function is influ- assemble the observed phenotypic distribution (or enced by a pair of polymorphic amino acids. The counts of cases and controls). Any of a number of major alleles E2, E3, and E4 differ at two amino- standard statistical tests would let you ask acid residues, but the fourth haplotype is missing whether the two SNPs impact the phenotype from the population [Fullerton et al., 2000]. The and whether they do so independent of one two-site haplotypes that do exist have functional another or in a synergistic manner. Note that the differences that are best described by the E2, E3, true genotypic complexity is greater than this, and E4 allelic classes rather than by the indivi- because the test just described pools the cis- and dual SNPs in the gene. Similarly, many trans- trans-phase doubly heterozygous genotypes into membrane xenobiotic transporters exhibit the AaBb class. In other words, it ignores the structural interactions between two or more linkage phase. One can systematically and ex- residues [Leabman et al., 2003]. haustively test the null hypothesis of equal phenotypic means for all partitions of the genotype classes, and this approach, known as the GENETIC VARIATION IN POPULATIONS IS combinatorial partitioning method [Nelson et al., INTRINSICALLY ORGANIZED INTO 2001], has an appeal for its exhaustiveness. This HAPLOTYPES procedure can be continued: with three SNPs Each new mutation arises on a particular there are now 27 genotypic classes, and so on. By haplotype background. The haplotype bearing the time you consider 8 SNPs, there are 38 possible the novel mutation may rise to high frequency genotype classes, and the test has a very large by random genetic drift, and it may subse- number of degrees of freedom. quently be cleaved into segments by recombina- By the time one considers 4 or more SNPs at a tion. The combination of mutation, drift, time, most of the genotypic classes will have an selection, migration/population mixing, and observed count of zero, and so testing the recombination results in genetic variation that significance of interactions is difficult. Rather has a strong segmentwise haplotype structure to than thinking of this problem as a contingency it. Population geneticists have a good handle on table, it makes more sense to realize that the SNPs the determinants of linkage disequilibrium, but do not arise independently, but rather there is an predictions about haplotypes go beyond infer- intrinsic dependency of SNPs one with another ences of pairwise LD. Of course, any factor that due to the population history of their entry into erodes pairwise LD will also break up haplotype the population. By explicitly considering the structuring, but the two features are not haplotype structure of the SNPs, and how they inseparable. Strong haplotype structure arises arose in the population through a genealogical from multiple sites having shared ancestry, and process (a gene tree), one no longer needs to this is best imagined by considering the topol- consider this astronomical number of potential ogy of the gene genealogy. We will return to genotypes. In this way, using haplotype informa- this, but the point here is that haplotypes arise tion may collapse the dimensionality of the as an intrinsic attribute of population genetic statistical test, and thereby gain statistical power variation. over tests that do not reduce dimensions first Role of Haplotypes in Candidate Gene Studies 323 [Templeton et al., 1987]. In addition, Morris and organized in blocks where information about Kaplan [2002] showed that haplotype-based tests any SNP in the block applied, in a statistical can have greater power than unphased tests of correlation sense, to the whole block. The haplo- association in the case when the disease locus has type block idea has also stimulated theoretical multiple disease-causing alleles. work on the inference of haplotype phase, identification of haplotype blocks, identifying

The Role of Haplotypes in Candidate Gene Studies

Association Between Haplotypes and Specific Mutations in Swiss Cystic Fibrosis Families

Genome-Wide Association Identifies Candidate Genes That Influence The

Linkage Disequilibrium

Expression Quantitative Trait Loci and the Phenogen

Haplotype Tagging Reveals Parallel Formation of Hybrid Races in Two Butterfly Species

Haplotype Inference from Pedigree Data and Population Data

Candidate-Gene Criteria for Clinical Reporting: Diagnostic Exome Sequencing Identifies Altered Candidate Genes Among 8% of Patients with Undiagnosed Diseases

Statistical Genetics – Part I

Cancer Genomics Terminology

Transferability of Tag Snps to Capture Common Genetic Variation in DNA Repair Genes Across Multiple Populations Paul I.W. De

Glossary/Index

Efficient Design and Analysis of Genome-Wide