arXiv:1010.5095v1 [stat.ME] 25 Oct 2010 ee rf,Eetei egn n onP .Ioannidis A. P. John and Zeggini Eleftheria Kraft, Peter Association Genome-Wide Studies in Replication 09 o.2,N.4 561–573 4, DOI: No. 24, Vol. 2009, Statistical otn ascuet,UAe-mail: USA Medicine, Massachusetts, of School Boston, University Tufts Medical and Tufts Modeling, Center and Center Genetic and for Institute Science Translational and Clinical Tufts Director, and Greece Ioannina, Technology-Hellas, and Research for Foundation Institute, Research Biomedical Greece, Ioannina, of School Medicine, Ioannina of and University Epidemiology, Hygiene of Department Molecular Unit, is and Epidemiology Ioannidis Clinical A. Chairman, P. and John Professor UK. University Oxford, Oxford, Genetics, of Human for and Centre UK Trust Hinxton, Wellcome Institute, Sanger Genetics, Trust Statistical Wellcome Applied Leader, Group Eleftheria is USA. Zeggini Massachusetts, Boston, Health, of Public School Harvard , and Epidemiology

ee rf sAscaePoesr eatet of Departments Professor, Associate is Kraft Peter c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint nttt fMteaia Mathematical of Institute 10.1214/09-STS290 nttt fMteaia Statistics Mathematical of Institute , meta-analysis. phrases: acros and words modifiers effect Key or disequilibrium linkage in occasionall differences although usua association, proposed studies initially follow-up the well-powered studies. in many replication by no of shared may lack biases it to due association, associations greatly genotype-phenotype spurious from a can of arise replication credibility that consistent challenges While frequentist and collaborations. studies, replication, multiple for across artifa inferences synthesis an of exact or d of ods for finding disadvantages and prerequisites chance advantages heterogeneity, discuss a of We issues not biases. is rep uncontrolled and study to (GWA) association association credible genome-wide a a in observed ation Abstract. 09 o.2,N.4 561–573 4, No. 24, Vol. 2009, 2009 , elcto ep nueta eoyepeoyeassoci genotype-phenotype a that ensure helps Replication [email protected] eoewd soito td,replication, study, association Genome-wide This . in . 1 seiec htteascaini o nartifact an not is association the [ taken that typically is evidence methods as and designs us- different populations, ing different differ- in by teams, covariates investigative ent between obser- associations repeated of the rule, vation the are where epidemiology, conditions In variable method. scientific the of part h o ro rbblt fascainit account, into association of probability take prior not low re- did the that conflicting thresholds significance these standard reliance on inappropriate for following: reasons the including of sults, number a were There replicate. to failed associations geno- type-phenotype reported most studies, (GWA) genome- association of advent wide the Before way. hard the replication uncertainty. their risk and the studied of measures being effect study.factor the more single of provide estimates a accurate can affecting evidence accumulated bias Moreover, uncontrolled not pro- to is association populations due the that and replica- evidence second, qualitative designs vides alone; different chance across to tion due asso- not the is that ciation evidence quantitative adds servation 22 erdcblt a ogbe osdrdakey a considered been long has Reproducibility eei pdmooylandteiprac of importance the learned epidemiology Genetic ,frtopicplraos is,rpae ob- repeated First, reasons. principal two for ], .INTRODUCTION 1. tmypitto point may it y l invalidates lly ffrn meth- ifferent s Bayesian vs. mrv the improve Conversely, replication, multi-team eliminate t studies. s resents tdue ct - 2 P. KRAFT, E. ZEGGINI AND J. P. A. IOANNIDIS small sizes, and failure to measure the same 2.1 Convincing Statistical Evidence for variant(s) across different studies [23, 27, 64]. In re- Association sponse, the field moved toward more stringent re- To date most individual GWA studies do not have quirements for reporting associations, explicitly em- enough power to detect true associations at the con- phasizing replication [7]. Many high-profile journals servative significance levels necessary to distinguish now will not publish genotype-phenotype associa- tions without concrete evidence of replication [1]. false positives from false negatives. This point has In this article we review the requirements for repli- typically been made by referencing the large num- cating associations discovered via GWA studies in ber of tests conducted in a GWA study and the con- light of recent developments: in particular, the in- sequent severe adjustment of the p-value threshold creasing role of the consortia of multiple GWA stud- in order to control -wide Type I error ies. Prospective meta-analysis of multiple genome- rate. Empirical estimates of the threshold needed wide studies (conducted by different investigative to preserve the genome-wide Type I error rate in teams, in different populations, using different tech- studies of European-ancestry subjects using current genotyping arrays from 5 10−7 to 1 10−8 nologies and different designs) can satisfy the re- × × quirement for replication in the context of gene dis- [18, 24, 51, 59]. These thresholds are different in covery, without the need to genotype yet more sam- other populations; for example, they are even lower ples in yet further studies, as long as the combined in African or African–American samples, due to the evidence for association is strong and consistent [71]. greater genetic diversity in these populations. Even This is an important point, since very large sample these stringent thresholds take into account only the sizes are required to reliably identify common vari- complexity of the genetic architecture, and they do ants with modest effects, and formal replication of not adjust for the potential complexity of the pheno- an association—that is, genotyping the initially dis- typic architecture, that is, when targeting multiple covered genetic variant in a new, completely inde- phenotypes. pendent sample of sufficient size—may be too expen- In the framework of the Bayes theorem, the prob- sive in terms of time, money and available samples. ability that an observed association truly exists in Indeed, for some rare diseases (e.g., Creutzfeldt– the sampled population depends not only on the ob- Jakob disease) or relatively uncommon diseases (e.g., served p-value for association, but also on the power pancreatic cancer), most if not all samples with to detect the association (a function of minor al- readily-available DNA may be genotyped as part of lele , effect size and sample size), the prior initial GWA studies used at the discovery stage. probability that the tested variant is associated with We describe the goals of replication and statistical the trait under study, and the anticipated effect size rules of thumb for distinguishing chance from true [27, 64, 65]. We illustrate this in Figure 1, where we associations in the first section of this article. We plot the for association (versus no as- then discuss the importance of exact replication— sociation) as a function of p-value, sample size and seeing a consistent association with the same risk al- minor allele frequency [29]. The Bayes Factor is the lele using the same analytic methods across multiple ratio of the probability of the data under the alterna- studies—and describe analytic methods for combin- tive hypothesis (association with the tested variant) ing evidence across multiple studies, along with their to the probability of the data under the null hypoth- relative advantages and disadvantages. We close by esis (no association). (Others define the Bayes Fac- discussing why an association may fail to replicate tor as the inverse of this ratio [29].) The posterior and place replication efforts in the wider picture of odds of true association given the data are equal contemporary genetic epidemiology, with its focus to the Bayes Factor times the prior odds of asso- on large-scale collaborations and data sharing. ciation. In Figure 1 the dashed line represents the Bayes Factor needed to achieve posterior odds for an 2. GOALS OF REPLICATION association of 3 : 1, assuming prior odds of associa- There are two primary reasons replication is es- tion of 1 : 99,999 (i.e., roughly 100 out of 10,000,000 sential to confirm associations discovered via GWA variants are truly associated with the studied trait). studies: to provide convincing statistical evidence Note that all p-values are not created equal: for a for association, and to rule out associations due to given p-value, the evidence for association increases biases. Another possible aim of replication is to im- with increasing sample size and depends on risk al- prove effect estimation. lele frequency. Increasing overall sample size not only REPLICATION IN GWAS 3

Fig. 1. The relationship between the Bayes Factor and p-value for different sample sizes and minor allele frequencies (left panel: minor allele frequency of 40%; right panel: 5%). The dashed line represents the Bayes Factor necessary to achieve pos- terior odds in favor of association of 3 : 1 or greater, assuming the prior odds of association are 1 : 99,999. Bayes Factors were calculated for a case-control study with equal numbers of cases and controls, assuming the expected value of the absolute value of the log is log(1.15), and assuming a “spike and smear” prior. Calculations use equations (4) and (5) from [29], 2 −1 − −1 −1 −1 −1 with σ = Iββ Iαβ [Iαα ] Iβα , where I is the Fisher information from simple log(odds) = α + βGadditive calculated under the null (β = 0). increases power to detect common risk variants with Conversely, if we assume that the average effect modest effects, but it can also increase the cred- is an odds ratio of ORav = 1.02 (consistent with the ibility of the observed associations. Differences in theory of infinitesimal effects, each having an almost credibility by sample size for similar p-values are a imperceptible contribution [14]), then the prior odds consequence of the fact that these calculations take of association will be much higher, because a much assumptions about the expected magnitude of the larger set of (infinitesimally) associated variants are true allelic odds ratio into account. In particular, anticipated and a smaller Bayes Factor would be they assume that the true allelic odds ratio is un- needed to reach a 3 : 1 posterior. Moreover, large ef- likely (probability < 2.1%) to be smaller than 0.5 fects emerging from small studies would be incredi- or bigger than 2. Since small p-values can only be ble, regardless of their p-value, while very small ef- achieved in small sample sizes if the estimated ef- fects emerging with modest p-values from large stud- fect is large, these results are perceived to be less ies would provide credible evidence for association. credible in this framework. We should acknowledge that the distribution of ef- In other words, the Bayes Factor and thus the fect sizes of true associations is unknown, and there credibility of an association depends explicitly on is no guarantee that they would be similar for dif- what we assume for the typical magnitude of likely genetic effects. For example, if we assume that the ferent traits. The difficulty of arriving at the true average effect is not an odds ratio of 1.15 as in Fig- causal variants (which may have larger effect sizes ure 1, but an odds ratio of ORav = 1.5, then the than their markers) adds another layer of complex- prior odds of association will be less, because fewer ity. Moreover, given simple power considerations, it variants—with larger effects than in the ORav = is expected that a large proportion of the large ef- 1.15 scenario—would suffice to explain the genetic fects have been identified, while only a small pro- variability. A larger Bayes Factor would be needed to portion of the smaller effects and a negligible pro- reach a 3 : 1 posterior. Moreover, large effects emerg- portion of the tiny and infinitesimal effects are al- ing from small studies will be more credible than in ready discovered. With these caveats, most evidence the ORav = 1.15 scenario, while very small effects from GWA studies to date is more compatible with emerging with similar p-values from large studies the scenarios of ORav being in the range of 1.15 will be less credible [29]. [45], but the 1.02 scenario is not implausible, and 4 P. KRAFT, E. ZEGGINI AND J. P. A. IOANNIDIS for some traits the 1.5 scenario may be operating, notably population stratification bias. For this rea- but we still have not identified the true variants. son, it is important to see the association in other Many research groups cannot afford to genotype studies conducted using a similar (but not identical) the large sample sizes needed to reliably detect ge- study base. In principle, careful design and anal- netic markers that are weakly associated with a trait ysis should eliminate or greatly reduce bias due to using a genome-wide platform. This has sparked in- population stratification in association studies using terest in multistage designs, where a subset of avail- unrelated individuals [17, 44, 52]—and, in practice, able samples are genotyped using the genome-wide these methods have effectively removed some wor- platform, and then a subset of the “most promis- rying systematic inflation in association statistics ing” markers (typically those with lowest p-values) [19]. Family-based designs can provide additional are genotyped using a custom platform. These de- evidence that an observed association is not due signs are reviewed in more detail elsewhere in this to population stratification bias, but these designs issue [60]. We should note that the primary motiva- are not cost-efficient, and have their own unique tion of multistage designs is not to increase power sources of bias. For example, nondifferential geno- by testing fewer hypotheses in the second stage sam- typing error can inflate Type I error rates in some ples, and hence paying a smaller penalty for multi- family-based analyses, although it does not change ple testing at the second stage. Rather the primary the Type I error rate [47]. goal of multistage designs is to save genotyping costs 2.3 Improving Effect Estimates or maximize power given a fixed genotyping bud- get. If genotyping costs were not an issue, then the Another reason to conduct replication studies is multistage approach is less powerful than simply to extend the generalizablity of the association. It testing all markers in the entire available sample is important to know if the association exists and [33, 58, 60]. As genotyping costs decrease, and as has similar magnitude in different environmental or more samples have been genotyped as part of pre- genetic backgrounds. It is particularly interesting to vious GWA analyses, single-stage analyses become know how these associations play out in populations more common [33]. of non-European ancestry, considering most GWA The appropriate threshold for claiming associa- studies to date have been conducted in European- tion depends also on the context and the relative ancestry samples. Differences in allele frequencies costs for false positive and false negative results. For and local linkage disequilibrium (LD) patterns across example, re-sequencing a region and conducting in populations present both challenges and opportuni- vivo and in vitro functional studies is quite expen- ties for replication and fine mapping. On the one sive, and will require convincing evidence that the hand, a marker allele that is strongly associated with observed association is true. On the other hand, in- a trait in one population may not have a detectable cluding a region in a predictive genetic risk score is association in another, as the allele frequency may relatively inexpensive, so a less stringent threshold be smaller or the LD with the (unknown) causal might suffice. This approach to replication is intu- variant may be much weaker. Thus, initial replica- itively Bayesian (although it need not use formal tion studies should focus on populations with ge- Bayesian methods): each successive study serves to netic ancestry similar to that sampled in the study update the prior for association in subsequent stud- that first observed the marker-trait association, us- ies. ing the exact strategy outlined in the next section. Once credible evidence for this association has been 2.2 Ruling Out Association Due to Artifact established, replication efforts in other populations Even when the initial association is unlikely to be should type not only the marker known to be asso- a stochastic artifact due to multiple testing, it may ciated in the original population, but other markers still be an artifact due to bias. For common vari- that “tag” common variation in a region surround- ants, the anticipated effects are modest—for binary ing the marker. For fine mapping, differences in LD traits, odds ratios smaller than 1.5; for continuous patterns across populations—notably the lower lev- traits, percent explained less than 0.5%— els of LD in African-ancestry populations—might and very similar in magnitude to the subtle biases lead to refined estimates of the position of causal that may affect genetic association studies—most variants [55, 62]. REPLICATION IN GWAS 5

Replication may also be useful in identifying a SNP in the region (not necessarily the SNP that was more reliable estimate of the effect size for the asso- statistically significant in other studies) would have ciation. Signals selected based on statistical signifi- p< 0.05, and this would be (incorrectly) proclaimed cance thresholds in underpowered settings are likely replication [49]. In response to this problem, guide- to have (on average) inflated effects due to the win- lines for replication in genetic association studies ner’s curse phenomenon [28, 69, 70, 75]. Replication now call for exact replication. The same marker— should take this into account during the sample size or, if technical difficulties preclude this, a perfect or calculations for the replication efforts; the effect es- near-perfect proxy for the original marker—should timate from the initial study may be inflated, lead- be genotyped across all studies and analyzed using ing to an under-estimate of the number of subjects the same genetic model. In this section we discuss needed to reliably detect it [69, 70, 75, 76]. Analytic prerequisites for exact replication. We use the term methods are available to adjust for winner’s-curse “exact replication” cautiously, recognizing that this bias, but studying the marker in additional samples is an unattainable goal in epidemiology (e.g., stud- (beyond those used to initially identify the marker) ies conducted by different investigators at different will help produce more unbiased estimates of the ge- times, let alone places, will sample from different netic effect. Accurate estimates of marker risks are populations) and that in some sense it is the “inex- actness” of replication studies that increases credi- important (even if the marker is only a surrogate for bility of the observed association (it is less likely to the as yet unknown causal variant), as they may be be an artifact due to a bias that is unique to the used for personalized predictive purposes [34, 68]. initial study). We use the term to emphasize the Finally, when there are several putative associa- danger of “moving the goalposts” so far that claims tion signals in a region of high LD, dense genotyp- of replication carry little weight. ing in replication studies may help elucidate whether they represent independent loci, each with its own 3.1 Test the Same Marker effect in the trait, or whether one or all are “passen- This should be done preferably by directly geno- ger” markers, which have no effect conditional on typing this marker. Currently-available imputation the true underlying causal variant. Detailed discus- methods are powerful and quite accurate for filling sion of fine mapping issues is beyond the scope of in information on missing common SNPs [12, 38, this review, but in light of the effort involved, such 39, 46, 50]. Even then, further confirmation by di- “fine mapping” efforts should arguably be reserved rect genotyping would be very useful. (In fact, to for loci with credible evidence for association, for ex- rule out technical artifact, some have argued that ample, loci with markers that have been replicated an associated SNP should be genotyped using two exactly, as discussed in the next section [8]. different genotyping technologies, or that a second SNP in the region that is in [near-] perfect LD with 3. PREREQUISITES FOR EXACT the associated SNP be genotyped [7].) Great caution REPLICATION OF A PUTATIVE is needed when “replicating” an association by find- ASSOCIATION FROM A GWA STUDY ing an association with a (different) nearby marker: if the new marker does not have perfect or almost One of the early difficulties in replicating genetic perfect LD with the previously discovered one, this associations observed in candidate gene studies was cannot be considered replication. Moreover, even for the fact that different groups would study different markers with seemingly perfect LD in a given sam- markers in the same region. Because the LD among ple, the LD may be far less than perfect in a different these markers was poorly understood, results from population and it may break completely in popula- multiple studies could increase rather than decrease tions of different ancestry. When a panel of markers confusion. The initial study may have seen an asso- spanning the whole locus is pursued (e.g., after rese- ciation with SNP A, but the second study did not quencing and fine mapping), different markers and genotype that SNP, and instead saw an association haplotypes may be found to be associated in dif- with SNP B, which was not genotyped in the original ferent populations. Evidence from different markers study. As the number of SNPs typed per region in- and haplotypes should not be combined in the same creased, “moving the goalposts” in this fashion con- meta-analysis. The consistency of each association tributed to the problem of persistent false positives can be formally assessed separately (see the section in the candidate gene literature; by chance, some on statistical heterogeneity). 6 P. KRAFT, E. ZEGGINI AND J. P. A. IOANNIDIS

3.2 Use the Same Analytic Methods 3.3 Try to Use the Same Phenotype

If the initial results found an increased risk per For many traits, phenotype definitions may vary copy of, say, the A allele (additive model), then a considerably across studies, or there may be many significant increased risk for carriers of the T allele different options for defining the phenotypes of in- (dominant model, in other direction) does not con- terest within each study. Some of this variability is stitute replication. It is in principle possible that the unavoidable and results from differences in measure- direction of association can change due to differences ment protocols across studies. For example, disease in linkage disequilibrium across study populations. may be self-reported in some studies or clinician- However, this “flip flop” phenomenon can occur only diagnosed in others; waist:hip ratios may be self- reported or measured in a clinic, using different op- in very specific situations that are unlikely when the erational definitions of “waist”; etc. Characteristics study populations have similar continental ancestry of studied phenotype may also differ across stud- [41]. The burden of proof is on investigators to show ies: for example, because of the widespread use of evidence for how difference in LD in their study pop- Prostate-Specific Antigen (PSA) screening in the ulations could produce a “flip flop” if they wish to United States since the early 1990s, the proportion claim replication, even though different alleles are of early-stage prostate cancer cases in the US is associated with risk. Merely citing the possibility of higher than in Europe, where PSA screening is not “flip-flopping” does not suffice. as common. In the context of a prospective meta- Other analytical options include the statistical analysis, study investigators can discuss these issues model (e.g., for a binary outcome, whether it is and reach consensus on how to define phenotype so treated as simply yes/no or the time-to-event is also as to maximize relevant information while ensur- taken into account), the use of any covariates (e.g. ing as many studies can provide data as possible. for age, gender or topic-specific variables) and the In general, there is a trade off between more accu- use of corrections for relatedness. Usually, the im- rate (but more expensive and perhaps more inva- pact of these options is not major, but it can make sive) on fewer people and less accu- a difference for borderline associations which may rate (but cheaper) measurements on more people. seem to pass or not pass a desired p-value threshold. For example, although the Fagerstom Test may be This that both for GWA studies and subse- a “” measure of nicotine dependence, quent investigations, one should carefully report the currently only a few studies with available genome- methods in sufficient detail so they can be indepen- wide genotype data have collected data on this test; dently replicated by other researchers [42]. on the other hand, many studies have collected in- Modeling can have a much more profound im- formation about the number of cigarettes smoked pact in more complex associations than go beyond per day (a component of the Fagerstrom score) [6]. single markers, for example, with approaches that To maximize sample size, investigators may agree to analyze cigarettes per day (which then raises further try to model dozens and hundreds of gene variants issues such as what scale to use, whether and how that form a “pathway” [5, 35]. Such complex mod- to transform the raw data, how to reconcile contin- els may be built by MDR, kernel machines, step- uous with categorical data, etc.). Prospective meta- wise logistic regression or a diversity of other meth- analyses for height, BMI and fasting glucose have ods and it is important for the replication process dealt with the issue of phenotype harmonization in to use the same exact steps as the model build- a trait-by-trait basis [37, 43, 53, 67]. Other consortia ing. Even then, because these models are so flexi- and projects such as the Public Population Project ble, it is unclear whether a “significant” finding in a in Genomics (http://www.p3gconsortium.org/) second data set constitutes replication; the associa- and PhenX (www.phenx.org) aim to facilitate broad tion may be driven by different sets of SNPs in the collaboration among existing and future genome- different studies. Researchers who conduct complex wide association studies by making recommenda- model-selection/model-building analyses should re- tions for standard phenotyping protocols for many port their “final” model in as much detail as possi- diseases and traits. Still, despite best efforts to har- ble, so other investigators can judge the fit of that monize measures, some differences model in other data sets. across studies will persist, and investigators should REPLICATION IN GWAS 7 be aware of these as possible sources of heterogene- Some other caveats should be mentioned. The win- ity (see Sections 4 and 5). ner’s curse in the magnitude of the effect in the dis- Requiring that replication studies use the same covery phase may introduce spuriously inflated het- phenotype definition used in the initial study also erogeneity, when the discovery data are combined helps avoid false positives due to “data dredging,” with subsequent replication studies. In such two- the temptation to generate small p-values by testing stage approaches, between-study heterogeneity many different traits (different case subtypes, con- should best be estimated excluding the discovery tinuous traits dichotomize using different, arbitrary data. Conversely, if all datasets are measured with cut points, etc.) [27]. When many phenotypes or genome-wide platforms and GWA scan meta-analysis phenotype definitions and analyses are used, there is performed in all gene variants, this is no longer should be a penalty for multiple testing. Applying an issue. In fact, if the GWA scan meta-analysis this penalty is not always straightforward, given that uses random effects (see below), the emerging top hits from the GWA scan meta-analysis are likely most of the phenotypes and analyses are usually cor- to have, on average, deflated observed heterogene- related or even highly correlated. However, the dan- ity compared with the true heterogeneity. This is ger exists for an association to be claimed replicated, because underestimation of the between-study het- after searching through repeated modifications of erogeneity favors a variant to come to the top of the the phenotypes and analyses thereof. A p-value that list, since it does not get penalized by wider confi- has been obtained through such an iterative search- dence intervals in the random effects setting. ing path is not the same as one that was obtained However, we caution that when the number of from a single main analysis of a single phenotype. studies is relatively small, association tests based on random-effects meta-analysis may be deflated, as 4. REPLICATION METHODS AND the between-study variance τ 2 will be poorly esti- PRESENTATION OF RESULTS mated. This is illustrated in Figure 2, which shows quantile-quantile plots for fixed-effect and random- 4.1 Statistical Heterogeneity Across Datasets effects meta-analyses of data from PanScan collabo- There are several tests and metrics of between- ration, which involves 13 studies in the initial GWAS dataset heterogeneity, borrowed from applications scan. For the random effects analysis, the genomic- of meta-analysis in other fields. The most popular control “inflation factor” is in this case more aptly are Cochran’s Q test of homogeneity [9], the I2 met- named a “deflation factor”: λGC = 0.84, indicating ric [obtained by (Q-degrees of freedom)/Q] and the that the random effects p-values are larger than ex- between-study variance estimator τ 2 [21]. There are pected under the assumption that the vast major- shortcomings to all of them [26]. The Q test is un- ity of SNPs are not associated with pancreatic can- derpowered in the common situation where there are cer. Fixed-effect meta analysis is arguably more ap- propriate as an initial screening test for associated few datasets and may be overpowered when there markers, although because fixed-effect analysis can are many, large datasets. There are now readily- be highly significant when only one (relatively large) available approaches that can be used to compute study shows evidence for association, analyses that the power of the Q test to detect a given tau-squared 2 incorporate effect heterogeneity such as random ef- [4]. When the Q test is underpowered, the I met- fects meta-analysis should be reported for highly sig- ric has large uncertainty and this can be readily vi- nificant markers from fixed-effect analyses. sualized by computing its 95% confidence intervals 2 Finally, lack of demonstrable heterogeneity may [30]. Similarly, estimates of τ may have large uncer- be perceived as a criterion of credible replication tainty. One potentially useful approach may be to [31]. However, one should note that tests and mea- estimate the magnitude of between-study variabil- sures of heterogeneity address whether effect sizes ity compared with the observed effect size θ, that across different datasets vary, not whether they are 2 is, h = τ/θ. For a small effect size, even small τ consistently on the same side of the null. Dataset- may question the generalizability of the conclusion specific effects could vary a lot, but they may all that there is an association across all datasets. This still point to the same direction of effect. Given the conclusion would not be as easily challenged in the potential diversity of LD structure across popula- presence of a large effect size. tions, and differences in phenotype definitions and 8 P. KRAFT, E. ZEGGINI AND J. P. A. IOANNIDIS

the uncertainty. Commonly, random effects are esti- mated with a 95% CI that captures the uncertainty about the effect, but ideally one should also examine the uncertainty of the distribution of effects across populations. This is provided by the predic- tion interval. An approximate (1 α)% for the effect in an unspecified− study can be obtained from the estimate of the mean effectµ ˆ , its estimated and the estimate of the between-study varianceτ ˆ2 by α 2 c 2 µˆ tk−2√ τˆ + SE(ˆµ) , ± { } α where tk−2 is the 100(1 α/2)% of the t-distribution with k 2− degrees of freedom [20]. It becomes implicit that− when an association has been probed in only a few datasets, then the prediction interval will be wider than the respective confidence Fig. 2. interval, even if there is no demonstrable between- Quantile-quantile plots for fixed-effect and random– 2 effect meta-analyses of the 13 studies in the initial PanScan study variance (i.e., τ = 0). Table 1 summarizes genome-wide association study of pancreatic cancer. The ge- some issues that arise in selecting, interpreting and nomic control inflation factors λGC for the fixed-effect and comparing the properties and results of various com- random effect analyses were 0.84 and 1.00, respectively. λGC monly used meta-analyses methods. was calculated as the observed chi-squared test statis- tic divided by the median of a chi-squared distribution with one degree of freedom. 5. REASONS FOR NONREPLICATION [T]here are often two or more hypothe- measurements across studies, between-study hetero- ses which account for all the known facts geneity should not dismiss an association because on some subject, and although, in such the effect sizes are not consistent, if the evidence for cases, men [sic] of science endeavour to rejection of the null hypothesis is strong. find facts which will rule out all the hy- potheses except one, there is no reason why 4.2 Models for Synthesis of Data from Multiple they should always succeed. — Bertram Replication Studies Russell [54] Data across studies can be combined at the level A variant observed to be associated with a trait of either p-values (probability pooler methods) or in an initial GWA may not be associated with the effect sizes (effect size meta-analysis) [12, 32, 61]. trait in subsequent studies, even though the original When p-values are combined, at a minimum one association was (nearly) “genome-wide significant.” should take into account also the direction of effects, There are a number of potential reasons for this non- but the magnitude of the effects is not taken into replication. account. When effect sizes are used, there are sev- eral models that can be used, depending on whether (a) The original observation was a false positive between-study heterogeneity is taken into account or due to error. This is the default explana- not, and if the former, how this is done. In general, tion, until proven otherwise. This is more likely for fixed effects approaches that ignore between-study associations that were not (or just barely) “genome- heterogeneity are better powered than random ef- wide significant” than for observations that were ex- fects approaches and thus more efficient for discov- tremely statistically significant. ery purposes. However, there is a trade-off for in- (b) The follow-up study had insufficient power. creased chances of false-positives. For effect estima- This problem can be avoided by ensuring the follow- tion and predicting what effects might be expected up study is large enough to reliably detect the ob- in future similar populations, random effects are in- served effect (after accounting for inflation due to tuitively superior in capturing better the extent of “winner’s curse”) [69, 70, 75, 76]. Moreover, if we REPLICATION IN GWAS 9

Table 1 Different methods for meta-analysis in the genome-wide association setting

Issues and caveats P-value Effect size meta-analysis meta-analysis Fixed effects Random effects

Direction of effect is consid- In some methods Yes Yes ered Effect size is considered No Yes Yes Summary p-value is obtained Yes Yes Yes Summary effect is obtained No Yes Yes Summary result can be No Yes Yes converted to credibility based on priors for the anticipated effect sizes Between-study heterogeneity No No Yes can be taken into account Between-study heterogeneity No Yes Yes can be estimated/tested Consensus on if/how datasets No Yes Yes should be weighted Commonly used weights None, SQRT(N), N Inverse vari- Inverse vari- ance ance Prior assumptions on the effect No In Bayesian In Bayesian size can be used meta-analysis meta-analysis Prior uncertainty on No No In Bayesian heterogeneity can be meta-analysis accommodated Prior uncertainty on the ge- No In Bayesian In Bayesian netic model can be accommo- M-A meta-analysis dated Normality assumptions typically Yes Yes Yes made within each study Normality assumptions within Yes, rarely done Yes, rarely each study easily testable done Normality assumptions for No effects assumed Single common Not easily distribution of effects across effect assumed testable studies easily testable (assumption may be visibly wrong) Heavy-tail alternative methods No Yes, rarely Yes, rarely exist used used Use with uncommon alleles Need to use exact methods Quite robust Between-study (small genotype groups, or variance even zero allele counts in estimation 2 × 2 tables) unstable Power for discovery Good Good Less than oth- ers False-positives from single Susceptible Susceptible Less suscepti- biased dataset ble False-positives when evidence Susceptible Susceptible More suscepti- from small studies is most ble biased False-positives when evidence Susceptible Susceptible Less suscepti- from large studies is most biased ble Can predict range of effect sizes No Too narrow Appropriate in future similar populations confidence with predictive intervals intervals Can convey uncertainty for Useless Inappropriate Most ap- practical applications (e.g., to propriate be used in clinical prediction with prediction test) intervals 10 P. KRAFT, E. ZEGGINI AND J. P. A. IOANNIDIS consider the cumulative evidence (both the original studies. Post hoc explanations for subgroup differ- data plus the follow-up data) as an updated meta- ences, interactions and effect modification may be analysis, the cumulative evidence may still pass overfit to the observed data and may require fur- genome-wide significance or a sufficient Bayes Fac- ther prospective replication in further datasets be- tor threshold, even though the follow-up data are fore they can be relied upon. not formally (highly) significant, when seen in iso- lation. 6. THE WIDER PICTURE OF REPLICATION (c) The genotypic coding used in the initial study EFFORTS: CONSORTIA, DATA AVAILABILITY may not accurately reflect the true underlying asso- AND FIELD SYNOPSES ciation, leading to a loss of power. Ideally the follow- With the recent successes of GWA studies, the up study should be well powered to detect associa- field has realized that increasingly large sample sizes tions based on different genetic models (e.g., reces- are required to identify and replicate the increas- sive, dominant) that are consistent with the results ingly small effect sizes at common variants that re- observed in the first study. main undetected. Even wider networks will be re- (d) The variant may be a poor marker for the quired to facilitate the study of variation at the trait due to differences in linkage-disequilibrium lower end of the frequency spectrum (be it single structure between the studies. This is more likely base changes, copy number variants or otherwise). if the study populations have different ethnic back- Collaboration and data sharing are invaluable tools grounds. When discussing this as a possible rea- in achieving the necessary sample sizes for son for nonreplication, investigators should make a well-powered replication studies. The past few years good-faith effort to provide empirical data on how have witnessed a rapid rise in international con- linkage-disequilibrium patterns differ (e.g., using sortium formation and collaboration has taken a HapMap data) and how these differences would lead most prominent role in conducting research. Consor- to inconsistencies across studies. tia allow investigators to make some design choices (e) Differences in design or trait definition may up front (if only deciding which SNPs to attempt lead to inconsistencies. See Sections 6.1 and 6.2 for to replicate), and to work together to harmonize examples of how different or ascertain- phenotypes and analyses [71]. Several examples of ment schemes can affect estimates of marker-trait notable successes of consortium-coordinated efforts association. Again, when citing this as a reason for have started to emerge in the literature [2, 10, 66, nonreplication, investigators should as far as possi- 67, 74]. ble present arguments for the likelihood and mag- In silico replication of association signals has been nitude of differences due to design or measurement further facilitated by initiatives making genetic asso- differences. ciation study results and/or raw data publicly avail- (f) The absence of an association in the subse- able (or available through application to an access quent studies may be due to true etiologic hetero- committee), for example, the Wellcome Trust Case geneity. Sometimes, this may be driven by gene–gene Control Consortium (www.wtccc.org.uk), dbGAP or gene–environment . If cases in the orig- (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap) inal study were required to have a family history of and the European Genotype Archive (EGA, disease, for example, or required to have a relatively http://www.ebi.ac.uk/ega). Several emerging con- rare exposure profile (e.g., male lifetime never smok- siderations, for example, with respect to the ers), then subsequent studies that do not impose anonymity of data [25], avenues for communication these restrictions may not see the association, if the between primary investigators and secondary users association is restricted to subgroups with a par- to facilitate a better understanding of the datasets ticular genetic or exposure background. However, and their appropriate uses, and suitable accredita- to date, gene–gene and gene–environment interac- tion of involved parties, require resolution in order tions have been notoriously difficult to document to optimize the use of publicly available raw data. robustly. Replication undoubtedly constitutes an evolving For the last three explanations, it is useful to clar- practice. The need to incorporate new data aris- ify if the explanation was offered a posteriori af- ing from further GWA scans, other replication stud- ter observing the inconsistent results in different ies, meta-analyses or all of the above leads to the REPLICATION IN GWAS 11

Table 2 Cumulative power to detect association (α = 5 × 10−8) at a risk allele with frequency 0.20 and 0.40, and allelic odds ratios of 1.1 and 1.2, given sample sizes for the WTCCC, DGI and FUSION studies

Studies Riskallele Allelic Cumulativen Cumulativen Power frequency odds ratio cases controls

WTCCC 0.20 1.10 1924 2938 0.0002 WTCCC + DGI 0.20 1.10 3388 4405 0.0011 WTCCC + DGI + FUSION 0.20 1.10 4549 5579 0.0033 WTCCC 0.40 1.10 1924 2938 0.0007 WTCCC + DGI 0.40 1.10 3388 4405 0.0054 WTCCC + DGI + FUSION 0.40 1.10 4549 5579 0.0166 WTCCC 0.20 1.20 1924 2938 0.0333 WTCCC + DGI 0.20 1.20 3388 4405 0.2078 WTCCC + DGI + FUSION 0.20 1.20 4549 5579 0.4426 WTCCC 0.40 1.20 1924 2938 0.1336 WTCCC + DGI 0.40 1.20 3388 4405 0.5468 WTCCC + DGI + FUSION 0.40 1.20 4549 5579 0.8219

emerging of conglomerate analyses. Field the FTO gene [13]. This inconsistency in observed synopses, for example, are efforts to integrate data associations could be ascribed to study design and, from diverse sources (GWA studies, consortia, sin- specifically, to matching cases and controls for BMI gle published studies) in the published literature (DGI study). The FTO signal was quickly identi- and to make them publicly available in electronic fied as the first robustly replicating association with databases that can be updatable. Examples include obesity, mediating its effect on T2D through BMI. A the field synopses on Alzheimer’s disease (AlzGene truly genome-wide meta-analysis of the three scans database), schizophrenia (SzGene database) and ensued, with large-scale replication efforts in inde- DNA repair genes [3, 63]. The results of the meta- pendent datasets of T2D cases and controls, all of analyses on the accumulated data can then also be European origin. This effort led to the identification graded for their epidemiological credibility, for ex- of further novel T2D susceptibility loci [74]. Table 2 ample, as proposed by the Venice criteria [31]. demonstrates the gains in power afforded by increas- ing sample size from a single scan to the synthesis 6.1 Example from the Field of Type 2 Diabetes of all three studies for a realistic common complex Researchers in the field of Type 2 diabetes (T2D) disease susceptibility locus. genetics were among the first to lead the way in 6.2 Anthropometrics and the Analysis of distributed collaborative networks, exemplified by “Secondary Traits” early efforts such as the International Type 2 Dia- betes Linkage Analysis Consortium and the Inter- The meta-analyses of body mass index and height national Type 2 Diabetes 1q Consortium [11, 16, conducted by the Genetic Investigation of ANthro- 72]. The advent of GWA scans was met by pre- pometric Traits (GIANT) consortium raised addi- publication data sharing between three large-scale tional issues [36, 43, 67]. Specifically, unlike the dia- studies, the WTCCC, DGI and FUSION scans [56, betes consortia, where each participating study was 57, 59, 73], leading to the formation of the DIA- designed with diabetes as its primary outcome, the GRAM Consortium (Diabetes Genetics Replication studies involved in GIANT were not originally de- and Meta-analysis). By exchanging information on signed to study determinants of BMI and height, top signals, the three studies obtained in silico repli- rather they were originally case-control studies of cation of individual scan findings and then further diabetes, prostate and breast cancers, and other dis- pursued de novo replication in additional sets of in- eases [15, 40]. In principle, if the studied trait is dependent samples. This endeavor additionally high- associated with disease risk, then conditioning on lighted examples of statistical heterogeneity across case-control status can create a spurious association the studies, notably with respect to one of the between a marker and the trait. In practice, only a WTCCC study’s strongest signals, residing within small number of markers will have an inflated Type 12 P. KRAFT, E. ZEGGINI AND J. P. A. IOANNIDIS

I error rate—namely, those markers that are asso- [6] Caporaso, N. et al. (2009). Genome-wide and candidate ciated with disease risk but not directly with the gene association study of cigarette smoking behav- secondary trait—and the magnitude of the inflation iors. PLoS ONE 4 e4653. Chanock depends on both the strength of the association be- [7] , S. J. et al. (2007). Replicating genotype- phenotype associations. Nature 447 655–660. tween the secondary trait and disease (which could [8] Clarke, G. M. et al. (2007). Fine mapping versus repli- be modest or controversial, as in the case of smok- cation in whole-genome association studies. Am. J. ing and breast or prostate cancer, or quite strong, Hum. Genet. 81 995–1005. as in the case of BMI and T2D or smoking and lung [9] Cochran, W. (1954). The combination of estimates cancer) and the strength of the association between from different . Biometrics 10 101–129. MR0067428 the marker and disease (typically relatively weak) [10] Cooper, J. D. et al. (2008). Meta-analysis of genome- [15, 48]. Moreover, the risk of false positives may be wide association study data identifies additional further ameliorated by diversity of designs among type 1 diabetes risk loci. Nat. Genet. 40 1399–1401. the participating studies—some may have originally [11] Das, S. K. and Elbein, S. C. (2007). The search for type been case-control studies of different diseases, oth- 2 diabetes susceptibility loci: The chromosome 1q ers may have been cohort or cross-sectional stud- story. Curr. Diab. Rep. 7 154–164. [12] de Bakker, P. I. et al. (2008). Practical aspects of ies. Although there are analytic methods that can imputation-driven meta-analysis of genome-wide eliminate spurious association or bias due to case- association studies. Hum. Mol. Genet. 17 R122– control ascertainment in particular situations and R128. under particular assumptions [40, 48], these should [13] Frayling, T. M. et al. (2007). A common variant in the not replace careful consideration of potential biases FTO gene is associated with body mass index and and evaluation of heterogeneity in effect measures predisposes to childhood and adult obesity. Science 316 889–894. across studies with different designs. [14] Gibson, G. (2009). Decanalization and the origin of com- plex disease. Nat. Rev. Genet. 10 134–140. ACKNOWLEDGMENTS [15] Greenland, S. (2003). Quantifying biases in causal models: Classical vs collider- EZ is supported by the Wellcome Trust stratification bias. Epidemiology 14 300–306. (WT088885/Z/09/Z). We thank Mandy van Hoek [16] Guan, W. et al. (2008). Meta-analysis of 23 type 2 dia- for contributing to power calculations. Scientific sup- betes linkage studies from the International Type port for this project was provided through the Tufts 2 Diabetes Linkage Analysis Consortium. Hum. Clinical and Translational Science Institute (Tufts Hered. 66 35–49. [17] Guan, W. et al. (2009). Genotype-based matching to CTSI) under funding from the National Institute correct for population stratification in large-scale of Health/National Center for Research Resources case-control genetic association studies. Genet. Epi- (UL1 RR025752). Points of view or opinions in this demiol. 33 508–517. paper are those of the authors and do not necessar- [18] Gusnato, A. and Dudbridge, F. (2007). Estimating ily represent the official position or policies of the genome-wide significance levels for association. In Tufts CTSI. European Mathematical Genetics Meetings, Heidel- berg. [19] Han, J. et al. (2008). A genome-wide association study REFERENCES identifies novel alleles associated with hair color and skin pigmentation. PLoS Genet. 4 e1000074. [1] (1999). Freely associating. Nat. Genet. 22 1–2. [20] Higgins, J. (2009). A re-evaluation of random-effects [2] Barrett, J. C. et al. (2008). Genome-wide association meta-analysis. J. Roy. Statist. Soc. Ser. A 172 137– defines more than 30 distinct susceptibility loci for 159. Crohn’s disease. Nat. Genet. 40 955–962. [21] Higgins, J. and Thompson, S. (2002). Quantifying het- [3] Bertram, L. et al. (2007). Systematic meta-analyses of erogeneity in a meta-analysis. Stat. Med. 21 1539– Alzheimer disease genetic association studies: The 1558. AlzGene database. Nat. Genet. 39 17–23. [22] Hill, A. B. (1965). The environment and disease: Associ- [4] Biggerstaff, B. J. and Jackson, D. (2008). The exact ation or causation? Proc. R. Soc. Med. 58 295–300. distribution of Cochran’s heterogeneity in [23] Hirschhorn, J. N. and Altshuler, D. (2002). Once one-way random effects meta-analysis. Stat. Med. and again-issues surrounding replication in genetic 27 6093–6110. association studies. J. Clin. Endocrinol. Metab. 87 [5] Breitling, L., Steyerberg, E. and Brenner, H. 4438–4441. (2009). The novel genomic pathway approach to [24] Hoggart, C. J. et al. (2008). Genome-wide significance complex diseases: A reason for (over-) optimism? for dense SNP and resequencing data. Genet. Epi- Epidemiol. 20 500–507. demiol. 32 179–185. REPLICATION IN GWAS 13

[25] Homer, N. et al. (2008). Resolving individuals contribut- [44] Luca, D. et al. (2008). On the use of general control ing trace amounts of DNA to highly complex mix- samples for genome-wide association studies: Ge- tures using high-density SNP genotyping microar- netic matching highlights causal variants. Am. J. rays. PLoS Genet. 4 e1000167. Hum. Genet. 82 453–463. [26] Huedo-Medina, T. B. et al. (2006). Assessing hetero- [45] Manolio, T. A., Brooks, L. D. and Collins, F. S. geneity in meta-analysis: Q statistic or I2 index? (2008). A HapMap harvest of insights into the ge- Psychol. Methods 11 193–206. netics of common disease. J. Clin. Invest. 118 1590– [27] Ioannidis, J. P. (2005). Why most published research 1605. findings are false. PLoS. Med. 2 e124. MR2216666 [46] Marchini, J. et al. (2007). A new multipoint method [28] Ioannidis, J. P. (2008). Why most discovered true asso- for genome-wide association studies by imputation ciations are inflated. Epidemiology 19 640–648. of genotypes. Nat. Genet. 39 906–913. [29] Ioannidis, J. P. (2008). Effect of formal statistical sig- [47] Mitchell, A. A., Cutler, D. J. and Chakravarti, A. nificance on the credibility of observational associ- (2003). Undetected genotyping errors cause appar- ations. Am. J. Epidemiol. 168 374–383; discussion ent overtransmission of common alleles in the trans- 384–390. mission/disequilibrium test. Am. J. Hum. Genet. 72 [30] Ioannidis, J. P., Patsopoulos, N. A. and Evan- 598–610. Monsees Tamimi Kraft gelou, E. (2007). Uncertainty in heterogeneity es- [48] , G., , R. and , P. (2009). timates in meta-analyses. Bmj. 335 914–916. Genome-wide association scans for secondary traits [31] Ioannidis, J. P. et al. (2008). Assessment of cumulative using case-control studies. Genet. Epidemiol. 33 evidence on genetic associations: Interim guidelines. 717–728. Mutsuddi Int. J. Epidemiol. 37 120–132. [49] , M. et al. (2006). Analysis of high-resolution [32] Kavvoura, F. K. and Ioannidis, J. P. (2008). Methods HapMap of DTNBP1 (Dysbindin) suggests no con- for meta-analysis in genetic association studies: A sistency between reported common variant associ- 79 review of their potential and pitfalls. Hum. Genet. ations and schizophrenia. Am. J. Hum. Genet. 903–909. 123 1–14. [50] Nothnagel, M. et al. (2009). A comprehensive evalua- [33] Kraft, P. and Cox, D. G. (2008). Study designs for tion of SNP genotype imputation. Hum. Genet. 125 genome-wide association studies. Adv. Genet. 60 163–171. 465–504. [51] Pe’er, I. et al. (2008). Estimation of the multiple testing [34] Kraft, P. et al. (2009). Beyond odds ratios— burden for genomewide association studies of nearly communicating disease risk based on genetic pro- all common variants. Genet. Epidemiol. 32 381–385. files. Nat. Rev. Genet. 10 264–269. [52] Price, A. L. et al. (2006). Principal components analy- [35] Lesnick, T. G. et al. (2007). A genomic pathway ap- sis corrects for stratification in genome-wide associ- proach to a complex disease: Axon guidance and ation studies. Nat. Genet. 38 904–909. Parkinson disease. PLoS Genet. 3 e98. [53] Prokopenko, I. et al. (2009). Variants in MTNR1B in- [36] Lettre, G., Lange, C. and Hirschhorn, J. N. (2007). fluence fasting glucose levels. Nat. Genet. 41 77–81. Genetic model testing and statistical power in [54] Russell, B. (1959). The Problems of Philosophy. Oxford population-based association studies of quantitative Univ. Press, London. traits. Genet. Epidemiol. 31 358–362. [55] Saccone, N. L. et al. (2008). In search of causal variants: Lettre [37] , G. et al. (2008). Identification of ten loci associ- Refining disease association signals using cross- ated with height highlights new biological pathways population contrasts. BMC Genet. 9 58. in human growth. Nat. Genet. 40 584–591. [56] Saxena, R. et al. (2007). Genome-wide association anal- Li Abecasis [38] , Y. and , G. (2006). Mach 1.0: Rapid haplo- ysis identifies loci for type 2 diabetes and triglyc- type reconstruction and missing genotype inference. eride levels. Science 316 1331–1336. Am. J. Hum. Genet. S79 2290. [57] Scott, L. J. et al. (2007). A genome-wide association Li [39] , Y. et al. (2009). Markov model for rapid haplotyping study of type 2 diabetes in Finns detects multiple and genotype imputation in genome-wide studies. susceptibility variants. Science 316 1341–1345. To appear. [58] Skol, A. D. et al. (2006). Joint analysis is more effi- [40] Lin, D. Y. and Zeng, D. (2008). Proper analysis of sec- cient than replication-based analysis for two-stage ondary phenotype data in case-control association genome-wide association studies. Nat. Genet. 38 studies. Genet. Epidemiol. 33 256–265. 209–213. [41] Lin, P. I. et al. (2007). No gene is an island: The flip–flop [59] The Wellcome Trust Case Control Consortium phenomenon. Am. J. Hum. Genet. 80 531–538. (2007). Genome-wide association study of 14,000 [42] Little, J. et al. (2009). STrengthening the REporting cases of seven common diseases and 3,000 shared of Genetic Association Studies (STREGA): An ex- controls. Nature 447 661–678. tension of the STROBE statement. PLoS Med. 6 [60] Thomas, D. et al. (2009). Methodological issues in multi- e22. stage genome-wide association studies. Statist. Sci. [43] Loos, R. J. et al. (2008). Common variants near MC4R 24 414–429. are associated with fat mass, weight and risk of obe- [61] Trikalinos, T. A. et al. (2008). Meta-analysis methods. sity. Nat. Genet. 40 768–775. Adv. Genet. 60 311–334. 14 P. KRAFT, E. ZEGGINI AND J. P. A. IOANNIDIS

[62] Udler, M. S. et al. (2009). FGFR2 variants and [70] Yu, K. et al. (2007). Flexible design for following up breast cancer risk: Fine-scale mapping using African positive findings. Am. J. Hum. Genet. 81 540–551. American studies and analysis of chromatin confor- [71] Zeggini, E. and Ioannidis, J. P. (2009). Meta-analysis mation. Hum. Mol. Genet. 18 1692–1703. in genome-wide association studies. Pharmacoge- [63] Vineis, P. et al. (2009). A field synopsis on low- nomics 10 191–201. penetrance variants in DNA repair genes and cancer [72] Zeggini, E. et al. (2006). Variation within the gene en- susceptibility. J. Natl. Cancer. Inst. 101 24–36. coding the upstream stimulatory factor 1 does not [64] Wacholder, S. et al. (2004). Assessing the probabil- influence susceptibility to type 2 diabetes in sam- ity that a positive report is false: An approach for ples from populations with replicated evidence of molecular epidemiology studies. J. Natl. Cancer. linkage to chromosome 1q. Diabetes 55 2541–2548. Inst. 96 434–442. [73] Zeggini, E. et al. (2007). Replication of genome-wide [65] Wakefield, J. (2007). A Bayesian measure of the prob- association signals in UK samples reveals risk loci ability of false discovery in genetic epidemiology for type 2 diabetes. Science 316 1336–1341. studies. Am. J. Hum. Genet. 81 208–227. [74] Zeggini, E. et al. (2008). Meta-analysis of genome-wide [66] Weedon, M. N. et al. (2008). Genome-wide associa- association data and large-scale replication identi- tion analysis identifies 20 loci that influence adult fies additional susceptibility loci for type 2 diabetes. height. Nat. Genet. 40 575–583. Nat. Genet. 40 638–645. [67] Willer, C. J. et al. (2009). Six new loci associated with [75] Zhong, H. and Prentice, R. L. (2008). Bias-reduced body mass index highlight a neuronal influence on estimators and confidence intervals for odds ratios body weight regulation. Nat. Genet. 41 25–34. in genome-wide association studies. Biostatistics 9 [68] Wray, N. R., Goddard, M. E. and Visscher, P. 621–634. M. (2007). Prediction of individual genetic risk [76] Zollner, S. and Pritchard, J. K. (2007). Overcoming to disease from genome-wide association studies. the winner’s curse: Estimating penetrance parame- Genome. Res. 17 1520–1528. ters from case-control data. Am. J. Hum. Genet. 80 [69] Xiao, R. and Boehnke, M. (2009). Quantifying and cor- 605–615. recting for the winner’s curse in genetic association studies. Genet. Epidemiol. 33 453–462.