Journal of Human Hypertension (2000) 14, 361–367  2000 Macmillan Publishers Ltd All rights reserved 0950-9240/00 $15.00 www.nature.com/jhh REVIEW ARTICLE Genetic association studies in complex diseases

B Keavney Department of Cardiovascular Medicine, University of Oxford, John Radcliffe Hospital, Oxford OX3 7BN, UK

Genetic association studies are the most frequent type possible absence of between of study performed in the investigation of the genetic marker and disease locus, and testing of multiple basis of complex cardiovascular conditions. While rela- hypotheses are discussed and the likely place for tively easy to perform, and having previously correctly association studies in a strategy involving studies of a identified genetic effects subsequently proven to be due variety of designs geared to finding genetic determi- to , interpretation of the results of these nants of disease susceptibility is addressed. studies is not always straightforward. Issues such as Journal of Human Hypertension (2000) 14, 361–367 population stratification, data-driven subgroup analysis,

Keywords: genetics; complex diseases; association studies

Introduction Mendelian diseases, which arise from the inter- action of a number of genes and environmental The search for genes predisposing to the common influences, has been considerably slower. As the diseases which are responsible for the bulk of popu- mode of inheritance is unknown and a number of lation morbidity and mortality is perhaps the most genes with segregating alleles in each generation is intensively studied and heavily funded activity in likely to be involved, studies of small numbers of late 20th century biomedicine. The human genome, extended pedigrees containing large numbers of comprising some 3000000000 base pairs, has been affected individuals would not be expected to yield mapped to a resolution of less than 1000000 base the results that have been obtained in simpler dis- pairs using polymorphic microsatellite markers, and ease.4 It is possible that the genes which cause cer- a collaborative international venture to extend the tain rare Mendelian forms of cardiovascular diseases number of markers available to over 300000 by the when seriously disrupted (examples in the case of incorporation of single nucleotide polymorphisms, hypertension being Liddle’s syndrome, or glucocort- which occur on average every 1000 base pairs, is icoid remediable hyperaldosteronism) may when well advanced. The entire human genomic sequence influenced by less extreme variation, possibly in is expected to be available by 2003, although a com- regulatory sequences or sequences influencing mess- parably complete state of knowledge about genetic age stability and thus levels of protein expression, variation at the anticipated 100000 human genes or contribute to the population burden of disease, but insights into gene function related to this variation 1 so far support for this attractive hypothesis is scanty. is likely to lag far behind this schedule. Thus, other study designs must be used to attempt Molecular genetic methods have had spectacular to find genes causing complex diseases in human success in contributing to the understanding of a populations. A common design is one in which number of diseases that are determined by the action affected sibling pairs are enrolled and the degree of of a single gene; occasionally, the insights gained genetic similarity between these individuals at a from these investigations have informed the investi- number of genetic markers, either focused at candi- gation and therapy of the ‘common’ variety of dis- date genes or randomly distributed throughout the ease: a good example would be the path that led genome, is assessed: should the degree of similarity from the identification of the LDL receptor as the deviate significantly from that expected under Men- culprit for familial hypercholesterolaemia to the sta- delian segregation, greater than expected similarity tins, developed largely as a result of the better being observed at a particular marker or markers, understanding of the biology of lipid metabolism this would constitute evidence in favour of genetic afforded by these genetic studies, and of proven linkage of the disease to the marker involved. Such therapeutic efficacy in any patient with coronary 2,3 affected sibling pair studies are in general difficult to heart disease. However, the progress in non- carry out, requiring: considerable access to potential probands (usually via extensive collaboration), as Correspondence: Bernard Keavney, Department of Cardiovascular suitable families are in general few and the number Medicine, University of Oxford, John Radcliffe Hospital, Oxford of families required to have power to detect the OX3 7BN, UK probably modest effects expected is high; access to Genetic association studies in complex diseases B Keavney 362 high-throughput genotyping technology and expens- ive reagents; and statistical and computer expertise to handle and interpret the large amount of data (in excess of 500000 data points in a typical genome screening experiment) generated. The commonest study design is the association study, in large measure because it is a simple design which enables the relatively rapid collection of a study population that can be tested without any particularly specialised knowledge of statistical gen- etics. In its most usual form, this closely resembles a case-control study of the type seen in the clinical literature; genotypes in cases and controls at a gen- etic marker of interest are compared and difference in the allele frequencies between cases and controls interpreted as evidence in favour of association of one or other allele at the marker of interest with dis- ease. This type of study can also be carried out in family groups, as outlined below. Such studies are now commonplace in many clinical journals of high quality. However, there are important additional issues to be borne in mind when interpreting a genetic associ- ation study as distinct from a standard clinical Figure 1 Recombination between two loci a/A and d/D. Geno- investigation. The aim of this brief review is to state types have been determined at two loci, and it is required to esti- mate their genetic distance apart. Either intact paternal chromo- these issues in non-technical language for the bene- somal segments between the two typed markers (a-d) and (A-D) fit of clinical investigators, and to suggest a role for have been inherited by the non-shaded individuals or there have association studies, complementary to other study been double recombination events in both of those individuals. designs, in the future of complex disease genetics. The shaded individual has inherited a recombinant chromosome; Central to the argument will be a clear understand- somewhere between the two loci a/A and d/D an odd number of crossovers (most likely to be 1) has occurred. ing of the difference between genetic linkage and association or linkage disequilibrium, which is explored below. of the parents (which may, of course, include the same alleles). These problems can be overcome with Genetic linkage more statistically sophisticated approaches, but will not be further dealt with here. Humans have 22 autosomes and two sex chromo- In Figure 2, an autosomal dominant disease is somes. At meiosis, random ‘shuffling’ of the parental present in the family shown. Again we assume that autosomes and sex chromosomes occurs, which is a inheritance of the alleles derived from the father can good way of preserving biological diversity. be determined unequivocally. There is no recombi- Additional diversity is generated by the occurrence nation in this family between allele a and disease, of crossovers between homologous chromosomes. therefore, there is some support for the genetic The probability of a crossover between two loci model that the disease locus is linked to the marker depends on the distance along a chromosome that locus a/A. In this family, the disease allele d is on the two loci are apart. For loci on different chromo- the same chromosome as marker allele a. somes, clearly the probability of co-inheritance of a particular allele at locus 2 given the allele inherited Association at locus 1 is 0.5. In practice, the chance of a cross- over (recombination) between two loci on the same In genetics terminology, association means simply chromosome is about 1% per 1000000 base pairs that the occurrence of alleles at one genetic locus (centimorgan, cM) the two loci are apart. We test in a particular population can be predicted to some whether two loci are linked, and how close together degree from knowledge of alleles at another locus; they are likely to be, by counting the frequency of that is, the probability of a given joint genotype is recombination between them. Consider the family in different from that which would be predicted from Figure 1; here, there has been one recombination the allele frequencies at the individual loci. Con- event between the loci a/A and d/D in three meioses. sider the population (consisting of only four chro- If this frequency of recombination were observed in mosomes!) in Figure 3. Here, at the a/A locus, the a number of families, then it would indicate that loci allele frequency of a is 0.75 and of A is 0.25. Simi- a/A and d/D were not close together (recombination larly, at the d/D locus, the allele frequency of d is frequency ෂ33% in this one family). In practice, the 0.75 and of D is 0.25. So, given independence situation is considerably more complicated than in between alleles at a/A and d/D, the expected fre- Figure 1; we have assumed that it is possible to quency of a and d together on a chromosome (the unequivocally assign the phase of the loci (ie, which a-d) in the population is 0.75 × 0.75 = of a/A is with which of d/D in the parental 0.56. But the frequency of this haplotype is observed generation), and to disregard the genotypes of one at 0.75. In such a small population, the difference

Journal of Human Hypertension Genetic association studies in complex diseases B Keavney 363

Figure 2 Genetic linkage of an autosomal dominant disease. Genotypes have been determined at locus a/A and the location of the disease locus d/D is unknown. The disease is known to be autosomal dominant with complete penetrance. There is no recombination between marker a/A and disease in this family. Thus there is evidence for genetic linkage of marker a/A and dis- ease locus d/D. between observed and expected haplotype fre- quencies is not statistically significant, but clearly in a large population it would be. This is an example of association between alleles, which is also (rather confusingly for the non-geneticist) known as link- age disequilibrium. The first important distinction between linkage and association (or linkage disequilibrium) is that linkage is a property of loci, which are separated by the same genetic distances in all populations, but association is a property of alleles, which is there- fore dependent on the population under study. This may be appreciated by considering the two families shown in Figure 4. Here, the assumptions made pre- viously about mode of inheritance and penetrance are made again. It can be seen that there is no recom- bination in these families between locus a/A and disease, which supports linkage between the marker locus and the unknown disease locus d/D. In family 1, allele a is on the same chromosome as the disease allele d, whereas in family 2, allele A is on the same chromosome as the disease allele d. Therefore, there Figure 4 Linkage to disease without association. Assumptions as is no association between alleles at the a/A locus previously. There is no recombination between the loci a/A and and disease in these two families. If these families disease. Thus, there is support for linkage between the locus a/A are representative of the population studied, then in and the disease locus d/D. But, there is no association between alleles at a/A and d/D in the ‘population’ of two families.

50% of cases the disease allele will be found on chromosomes bearing A and in 50% it will be found on chromosomes bearing a. So, although locus a/A is linked to disease, there will be no association between the presence of either allele A or a and dis- ease. Figure 3 Association between two loci. Associations arise as a result of events in a popu-

Journal of Human Hypertension Genetic association studies in complex diseases B Keavney 364 lation’s history; such events are simply summarised in Figure 5. A new mutation will arise on a chromo- some bearing a particular haplotype; in the next gen- eration, there will be complete association between that new mutation and the background haplotype (although not, of course, between that haplotype and the mutation). For the example in Figure 5, the pres- ence of allele D enables the prediction of allele A with 100% certainty in generation G1. Over the course of time, assuming no further mutation events take place, the association (linkage disequilibrium) will be broken down by recombination. The opport- unity for recombination to break down associations, and thus whether there is still any association between allele D and allele A in generation G N (today) will depend on the genetic distance between the two loci considered in Figure 5, and the number of opportunities over the population’s history for recombination. Thus, in relatively young isolated populations with few founders and a slow rate of growth following foundation, linkage disequilib- Figure 6 Design for a case-control study among unrelated indi- rium around variants introduced after the founding viduals. of the population would be expected to extend over a greater distance, whereas in older populations without population bottlenecks in their history, Association due to linkage linkage disequilibrium would be expected to be less In attempting to map genes for complex disease strong. This consideration explains the interest in using association analysis, we are only interested in carrying out association studies in certain popu- associations which are present because of physical lations such as that of Finland which to some extent proximity of the marker to the causative gene. falls into the former group. Association may arise for other reasons, one of It is important to appreciate that the direction as which is population stratification. In case-control well as the magnitude of associations may be differ- studies using unrelated individuals (Figure 6), it is ent in different populations. If in Figure 5 the not possible to control for this, as we do not have mutation of allele d to allele D had taken place on transmission data from parents. An example of a chromosome bearing allele a, then the resulting stratification is shown in Figure 7. Imagine two association would be between alleles D and a rather island populations which have contrasting inci- than D and A. So, if D were a disease-causing allele, dences of a disease for purely culturally transmitted the associated alleles in each of these two hypotheti- reasons. It also happens that the allele frequencies cal populations would be different. In an admixed at a random genetic polymorphism A/a are quite population made up of approximately equal num- markedly different between the populations. No bers from each hypothetical population, no associ- gene near A/a has anything to do with disease. The ation between disease and any allele at a/A would frequency of allele A in the high risk population is be distinguishable, even in G1. high, and the frequency of A in the low risk popu- lation is low. These two island populations merge to form an admixed population. A genetic association study using unrelated cases and controls is then car-

Figure 5 Generation and decay of association. At generation G 0, mutation D at the d/D locus arises for the first time in that popu- lation on a chromosome which bears allele A. In the next gener- ation, all chromosomes which have allele D also have allele A (complete association). As time passes, (G N representing the situ- Figure 7 Population stratification. If a case-control study using ation at the present time) recombination breaks down the associ- polymorphism a/A as a marker is carried out in the admixed ation to some extent and allele D is found on chromosomes bear- population, a positive result is likely. The A allele is associated ing both allele a and A. with disease, but this association is not due to linkage.

Journal of Human Hypertension Genetic association studies in complex diseases B Keavney 365 ried out, and the a/A locus is selected as a genetic no need to collect affected sibling-pairs; however, marker. As the majority of the cases come from the for late-onset diseases such as hypertension and cor- first population with a high frequency of allele A, onary artery disease the problems of collecting such strong support for association between A and dis- ‘trios’ where both parents are alive are not trivial. ease would be expected. However, this association In this study design, the transmission of alleles at is not due to linkage, and therefore does not tell us a marker locus by heterozygous parents to affected anything about the genetic mechanisms of disease. children is examined. In Figure 8, allele A at the If there are other phenotypic characteristics to dis- marker locus a/A and the disease allele d at the dis- tinguish the two populations, then the data could be ease locus d/D are associated. Both father and analysed in a stratified manner; this would enable mother had a 50% probability of transmitting allele the recognition of the initial stratification that had A to the affected offspring, but in fact allele A was produced the positive result. Of course, this hypo- transmitted in both cases. Were this pattern of thetical situation is an extreme, selected for clarity, excess transmission of allele A observed in a large but there are well-known examples in the genetics number of trio pedigrees of this sort, there would be literature of association studies confounded by evidence of association of allele A with disease. stratification: for example, the association claimed Since this association is based on transmission, it between Gm and NIDDM in Pima Indi- must be due to linkage, thus circumventing the ans,5 and the association claimed between alleles at problem of population stratification. However, as the dopamine D2 receptor and alcoholism.6 Some was shown in Figure 4, loci can be linked without investigators have stated that they believe the prob- association between alleles. If, for example, 50% of lem of stratification to be so endemic in association the chromosomes bearing the disease allele in a studies that they would not believe any association particular population also bore the A allele, and study that did not include familial ‘controls’ as 50% bore the a allele, then despite close linkage of described below.7 While this viewpoint may be the two loci, the family-based association study rather extreme, it is certainly true to say that greater would not detect any effect at the a/A locus. Risch awareness of this issue, and incorporation of data- and Merikangas8 in a highly influential paper in sets including families into association studies, is 1996, showed that given certain assumptions about necessary. the extent of linkage disequilibrium, greater power to map common disease genes would be obtained by Family-based association studies the study, by genome-wide approaches, of such trios It is of course possible to use family samples to carry using association analysis than could be obtained by out association studies. Such a design is shown in the study of even very large numbers of affected sib- Figure 8. This design is attractive because there is ling-pairs using linkage analysis. However, this was at a cost of a very considerably increased intensity of genotyping, since the genetic distance over which useful linkage disequilibrium is thought to be present is much smaller than that over which link- age can be detected. The need for large numbers of markers is not in itself a problem, as construction of very dense maps of single nucleotide polymor- phisms (SNPs) which can be typed by novel high- throughput methods is under way. SNPs useful for association analysis are plentiful, occurring about every 1 Kb throughout the genome (thus, as many as 3000000 may be available at some time in the future).

The $64000 question: how much disequilibrium? Since Risch and Merikangas’8 observations, this question has been the focus of intense effort and speculation. Overall, the answer seems to be that disequilibrium is patchy and its distribution throughout the genome is unknown. It has been shown that the disequilibrium assumptions made by Risch and Merikangas are probably rather more favourable than the real situation, and that even quite small differences in the magnitude of disequi- librium result in a rapid escalation of the number of genotypes (to the order of hundreds of thousands) Figure 8 Family-based association study: The hypothetical dis- that would have to be scored genome-wide to obtain ease allele d is associated with allele A at the a/A polymorphism. 9 10 Alleles transmitted at a/A by a heterozygous parent to an affected adequate coverage. Recently, Kruglyak carried out child are examined: in this case, a preponderance of A alleles simulation studies which suggested that useful lev- were transmitted. els of disequilibrium are unlikely to extend beyond

Journal of Human Hypertension Genetic association studies in complex diseases B Keavney 366 an average distance of about 3 Kb in the general present, for example in the case of the ACE I/D poly- population, implying the need for approximately morphism and susceptibility to MI.15 A small study 500000 SNPs in genome-screening projects.10 in which a novel polymorphism is typed remains a Direct evidence regarding the extent of disequilib- small study. In epidemiological investigations, post rium has come from the sequencing studies of Nick- hoc analyses driven by the data are very rarely per- erson and colleagues.11 At the LPL gene, these inves- formed, whereas in case-control genetic association tigators sequenced some 10 Kb of genomic DNA in studies they are almost the norm. However, the stat- a number of individuals and found that the relative istical considerations which warn of the likely gen- frequency of mutation and recombination over time eration of false positive results when multiple analy- were comparable.11 This meant that, at this locus, ses are done in a data-dependent fashion apply the degree of useful linkage disequilibrium was equally to genetic as environmental investigations. small, and that disequilibrium was not uniformly Such exploratory analyses should ideally be the present between variants within the same gene. result of a priori hypotheses, carefully justified, and These observations suggest that, in subject to appropriate statistical corrections to their analysis using association methods, at least in some P-values and confidence intervals. genomic regions, it may be necessary to detect and genotype every polymorphism in a particular gene in order to eliminate the possibility of a false nega- Role of association studies tive result due to lack of disequilibrium (thanks to Association studies have, despite their drawbacks, the play of chance) between the variant chosen for an important role to play in the understanding of typing and the causative variant at that locus. In genetic mechanisms of disease. If genetic screening contrast, at the ACE gene, these investigators for complex disease is ever to be a reality, robust sequenced some 25 Kb of DNA, and confirmed the 12 population associations will have to be present. The findings of Keavney and colleagues that, in Cauca- same is true of pharmacogenetic approaches to indi- sians at least, there is a considerable extent of strong vidually tailor therapy based on genotype. The con- Ј 13 disequilibrium across the 3 end of this gene. The siderations above about sample size and the typing discordant nature of the results at these two loci of credible polymorphisms should always be con- emphasise the likely patchy nature of disequilib- sidered by the general reader in evaluating a new rium and underscore the difficulty of interpreting study. Ideally, associations should be readily con- any association study result that has typed only one firmed in a second population (possibly prior to polymorphism of a candidate gene without knowl- publication), and should also be tested in family edge of the disequilibrium pattern in the region. panels, possibly of the trio variety. Quantitative genetic investigations using linkage Causative polymorphisms designs in extended pedigrees have the potential to advance our knowledge about a number of traits One way to remove the disequilibrium factor is to which may be intermediate phenotypes for cardio- type causative variants in candidate genes, or if it is vascular disease (for example, plasma and urinary not possible to be certain about this, to type variants steroid levels, plasma angiotensinogen and fibrino- which have been shown to be themselves associated gen, plasma leptin). Polymorphisms that exhibit with some intermediate phenotype such as the level some regulatory effect on these intermediate pheno- of mRNA produced, the level of a serum protein, types would be ideal candidates for testing with or a clinical outcome. A good example of such an relation to endpoints in case-control and family- intermediate phenotype would be plasma ACE based association panels. activity, which has been shown in numerous associ- The role of chance in the detection of association ation studies in both unrelated individuals and fam- due to linkage will be minimised by the interface of ilies to be associated with genotype at the ACE I/D 14 linkage and association designs in this fashion. The polymorphism. Thus, the I/D polymorphism has availability of high-throughput genotyping method- credibility as a candidate polymorphism to be inves- ologies capable of determining many thousands if tigated with reference to clinical endpoints. not hundreds of thousands of genotypes on an indi- vidual sample together with the international efforts Lessons from clinical trials to catalogue human DNA variation via SNP mapping will undoubtedly ensure that adequately-powered The great success of large randomised controlled association studies will contribute significantly to trials in cardiovascular medicine and in other areas the understanding of this uniquely important field. has the potential to inform the conduct of genetic association studies. A priori, the effects we expect to be studying in these complex cardiovascular References phenotypes are small, based on what is known of the genetic and environmental epidemiology of disease. 1 Collins FS et al. The human genome project. Science Thus, sample sizes of the order of several thousands 1996; 282: 682–689. 2 Anonymous. Randomised trial of cholesterol lowering would be expected to be necessary to reach statisti- in 4444 patients with coronary heart disease: the Scan- cally robust conclusions. Investigations claiming dinavian Simvastatin Survival Study (4S). Lancet extreme relative risks with very small sample sizes 1994; 344: 1383–1389. are very likely to represent the play of chance, and 3 Goldstein JL, Brown MS. Progress in understanding publication bias in favour of such results is clearly the LDL receptor and HMG CoA reductase, two mem-

Journal of Human Hypertension Genetic association studies in complex diseases B Keavney 367 brane proteins that regulate the plasma cholesterol. equilibrium mapping of common disease genes. Nat J Lipid Res 1984; 25: 1450. Genet 1999; 22: 139–144. 4 Lander E, Schork N. Genetic dissection of complex 11 Nickerson DA et al. DNA sequence diversity in a traits. Science 1994; 265: 2037–2048. 9.7 kb region of the human lipoprotein lipase gene. 5 Knowler WC et al.Gm3;5,13,14 and type 2 diabetes mel- Nat Genet 1988; 19: 233–240. litus: an association in American indians with genetic 12 Keavney B et al. Measured haplotype analysis of the admixture. Am J Hum Genet 1988; 43: 520–526. angiotensin-1 converting enzyme gene. Hum Mol 6 Gelertner J, Goldman D, Risch N. The A1 allele at the Genet 1998; 7: 1745–1751. D2 dopamine receptor gene and alcoholism: a re- 13 Reider MJ et al. Sequence variation in the human appraisal. JAMA 1993; 269: 1673–1677. angiotensin-converting enzyme. Nat Genet 1999; 22: 7 Altshuler D, Kruglyak L, Lander E. Genetic polymor- 59–62. phisms and disease [letter; comment]. N Engl J Med 14 Tiret L et al. Evidence, from combined segregation and 1998; 338: 1626. linkage analysis, that a variant of the angiotensin I- 8 Risch N, Merikangas K. The future of genetic studies converting enzyme (ACE) gene controls plasma ACE of complex human diseases. Science 1996; 273: levels. Am J Hum Genet 1992; 51: 197–205. 1516–1517. 15 Keavney B et al. Large-scale test of hypothesised 9 Camp NJ. Genomewide transmission/disequilibrium associations between the angiotensin-converting testing – consideration of the genotype relative risks at enzyme insertion/deletion polymorphism and myocar- disease loci. Am J Hum Genet 1998; 61: 1424–1430. dial infarction in about 5000 cases and 6000 controls. 10 Kruglyak L. Prospects for whole-genome linkage dis- Lancet 2000; 355: 434–442.

Journal of Human Hypertension