Estimating the number of unseen variants in the human

Iuliana Ionita-Laza1, Christoph Lange, and Nan M. Laird

Department of Biostatistics, Harvard School of Public Health, 655 Huntington Avenue, Boston, MA 02115b;

Edited by Peter J. Bickel, University of California, Berkeley, CA, and approved January 7, 2009 (received for review August 8, 2008) The different genetic variation discovery projects (The SNP Consor- not use. Specifically, if a new volume by Shakespeare were to be tium, the International HapMap Project, the 1000 Project, discovered, how many new words would we expect to see? Efron etc.) aim to identify as much as possible of the underlying genetic and Thisted (7) used a Gamma-Poisson model to address this variation in various human populations. The question we address in question. We adapt the approach in Efron and Thisted to the this article is how many new variants are yet to be found. This is an problem of predicting the number of genetic variants yet to be instance of the species problem in ecology, where the goal is to esti- identified in future studies. The method also allows calculation of mate the number of species in a closed population. We use a para- the number of individuals required to be sequenced in order to metric beta-binomial model that allows us to calculate the expected detect all (or a fraction of) the variants with a given minimum fre- number of new variants with a desired minimum frequency to be quency. In the following sections we develop the method and show discovered in a new dataset of individuals of a specified size. The applications to several available sequence datasets: ENCODE, method can also be used to predict the number of individuals nec- SeattleSNPs, and National Institute on Environmental Health Sci- essary to sequence in order to capture all (or a fraction of) the vari- ences (NIEHS) SNPs. Although these datasets contain only SNPs, ation with a specified minimum frequency. We apply the method the method can be applied to counting other types of variants, to three datasets: the ENCODE dataset, the SeattleSNPs dataset, including copy-number variants. and the National Institute of Environmental Health Sciences SNPs dataset. Consistent with previous descriptions, our results show 1. Methods that the African population is the most diverse in terms of the number of variants expected to exist, the Asian populations the First, we introduce some notation. For our purposes, an individ- least diverse, with the European population in-between. In addi- ual shows variation at a particular position if the correspond- tion, our results show a clear distinction between the Chinese and ing allele is different from the ancestral allele. We say that a the Japanese populations, with the Japanese population being the position is variable (or is a variant) in a sample if there is at less diverse. To find all common variants (frequency at least 1%) the least one individual in the sample that shows variation at that number of individuals that need to be sequenced is small (∼350) position. N and does not differ much among the different populations; our Suppose we have data on ind individuals; for example, each N data show that, subject to sequence accuracy, the 1000 Genomes of the ind individuals has been sequenced in a genomic region, Project is likely to find most of these common variants and a high and hence for each position we know whether an individual shows proportion of the rarer ones (frequency between 0.1 and 1%). The variation or not. data reveal a rule of diminishing returns: a small number of individ- We follow the notation in ref. 7. Suppose the total number of uals (∼150) is sufficient to identify 80% of variants with a frequency variable positions in the is an unknown, fixed num- N f of at least 0.1%, while a much larger number (>3,000 individuals) is ber, denoted here by .Let s be the unobserved frequency of s x necessary to find all of those variants. Finally, our results also show variable position , and let s be the number of times variable s N a much higher diversity in environmental response compared position has been observed in the ind individuals in our dataset. x ∼ N f with the average genome, especially in African populations. Then, s Bin( ind, s). Of course, we can only observe those variable positions with xs > 0. We assume that fs ∼ Beta(a, b). The Beta prior is not only mathematically convenient, but a good | beta-binomial model | CNVs | sequence data | SNP approximation for the distribution of allele frequencies at biallelic markers under a neutral selection and mutation-drift equilibrium main goal of the various human genome projects is to dis- model, as Wright (8) showed. A cover genetic variants in human genomes. The HapMap Let nx be the number of positions with exactly x individuals project (http://www.hapmap.org/) has contributed much to our showing variation at a position. Hence, n1 is the number of vari- understanding of the underlying genetic variation in diverse Nind ants that occur in only one individual, etc., and = nx is the total human populations and has facilitated the discovery of many loci x 1 number of variants observed. Also, let ηx = E(nx). robustly associated with common human diseases, such as dia- As in ref. 7, we want to estimate (t), i.e., the number of new betes, obesity, breast cancer, and many others (1–5). The recently variants to be found in the next t · Nind individuals (if we were to launched 1000 Genomes Project (http://www.1000genomes.org/), perform a new sequencing study of that size). For t ≥ 0, we can by sequencing 1,000 genomes, aims to discover much of the exis- write tent common variation (frequency at least 1%), including both single-nucleotide polymorphisms (SNPs) and the less explored copy-number variants (6). In this article, we provide a systematic way to predict the number Author contributions: I.I.-L. and N.M.L. designed research; I.I.-L. and N.M.L. performed of new variants with a specified minimum frequency to be iden- research; I.I.-L. and C.L. contributed new reagents/analytic tools; I.I.-L., C.L., and N.M.L. tified in future datasets of specified sizes. In particular, based on analyzed data; and I.I.-L. and N.M.L. wrote the paper. sequence data for a set of individuals, can we predict how many The authors declare no conflict of interest. more new variants will be found if we were to sequence a new This article is a PNAS Direct Submission. set of individuals of a given size? This question is related to the 1To whom correspondence should be addressed. E-mail: [email protected]. species problem in ecology, where one is interested in estimating This article contains supporting information online at www.pnas.org/cgi/content/full/ the number of species in a closed population. A particular exam- 0807815106/DCSupplemental. ple is estimating the number of words Shakespeare knew but did © 2009 by The National Academy of Sciences of the USA

5008–5013 PNAS March 31, 2009 vol. 106 no. 13 www.pnas.org / cgi / doi / 10.1073 / pnas.0807815106 Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 (t where hr D stecmltv itiuinfnto ftebeta the of function distribution cumulative the distribution. is CDF where ocpueesnilyalvrat ihfeunya least of value at smallest frequency the with finding to variants amounts all essentially capture to as function. ealddrvtosaesoni h upriginformation supporting the in shown (SI) are derivations Detailed nswt rqec tleast at frequency with ants when Ionita-Laza ∗ fixed small, some for h eadsrbto and distribution Beta the t a (t ealddrvtosaesonin shown are derivations Detailed · and ti lopsil opeittenme fidvdasnecessary individuals of number the predict to possible also is It netn h omfor form the Inserting nadto,w a calculate can we addition, In N t ) ) →∞ ind Appendix = = = =  t b  f niiul,a follows: as individuals, →∞ N N f (θ B(a, η rmteaalbedata. available the from f a (t (t 1 tal. et . )   ) N ) · 0 0 = → = ∗ 1 1 N b) . (1  oethat Note . sexpected, As ind θ − ×· 1 η η  a a−1 a − 1 N − 1 0 η + (1 a 1 · 1 ind θ · (1 θ · N b ) (t N −  · a−1 (1 N  − − ind − N ind et eso o eetmt parameters estimate we how show we Next, . ind f CDF (t − ) ind (1 N η 1 θ N a + f + → 1 f + ) (θ (0) B(a, ind θ N ind (θ − (t IAppendix SI + b · ) − b +1)·N 1)  )d (f ind b−1 f N noEq. into ) − η θ − b a B(a, 0 , xetdt efudi e e of set new a in found be to expected 1 θ ) ind b) − (t a,( − N 1 /B(a, = 1  · N − ind ind ) N  · = 1 + − N f · and 0 = t b) N (1 (t ind  +b−1 (1 f ind + · (t . b  N ,ie,tenme fnwvari- new of number the i.e., ), f (t  B(a,( 0  − stedniyfnto for function density the is b) (θ ) 1 − − N 1) 0 +  0 θ < 1 d 1 ind 1 )d 0 CDF CDF 1 a−1 (1 .Also ). · θ 1 b eobtain: we θ B(a, t N θ t  · a−1 − 1+ t uhthat such − 1 1 (1 ind B(a,( + (f − 1 (f θ (1−θ − N 1) + ) , , (1 (t a, ind θ B(a, a, +1)·N b)) · ) t − N b−1 N N + ) + (t ind ind ind θ N +1)·N 1) ind b) d ) ind θ N + + + f · ind stebeta the is (θ N b)) ind + b) b))  ind )d f f +b−1 b) This . (θ θ + )d [1] d b) θ θ H) nFg ,frec ouainw ltteepce number expected the plot we population sequenced each in for the sequenced 1, were Fig. of individuals In 7 CHB). 7 (only only dataset considered each for we individuals JPT), and 10). CHB, (9, selection CEPH, of absence SNP the a in at allele allele ancestral major the the allele because typically major reasonable, is the is be This to locus. allele the is ancestral at allele the consider ancestral we the known, where (http://www.ncbi.nlm.nih. not SNPs dbSNP those from For found gov/projects/SNP/). be can and known nontranscribed of chromosomes, mouse. values various with and conservation including density, general, in rates, recombination genome the of tive rbblt htexactly that probability estimate YI,1 EHErpa CP) a hns CB,and (CHB), (JPT). Chinese Japanese Han the 8 8 (CEPH), European of CEPH Yoruba 16 16 samples: (YRI), DNA regions unrelated 48 in sequenced were 500-kb genome Ten org/downloads/encode1.html.en). (http://www.hapmap. project ENCODE the from data sequencing for n ec ene ofi h eotuctdbt-ioildis- beta-binomial truncated this zero-truncated becomes: for distribution the function distribution fit probability to The tribution. need we hence and n h o-ieiodfnto is: function log-likelihood the and for Parameters the of Estimation 1.1. † Application. Data Encode 2.1. seulyapial oohrtpso ains nldn copy- including variants, available. are of data method types high-quality provided other the variants, to number SNPs, applicable concern several equally SNPs examples to is NIEHS these method and Although this SeattleSNPs, datasets. applying ENCODE, when datasets: results sequence present now We Applications 2. bias. systematic to lead look not region does the but in nearby markers smaller, of among number Dependence effective the equilibrium. makes markers linkage in are markers method. Newton–Raphson the through aos(Ls for (MLEs) mators H ouain aao ny7idvdaswr available. were individuals 7 only on data population, CHB o h R ouain aao ny8idvdaswr vial,adsmlryfrthe for similarly and available, were individuals 8 only on data population, YRI the For omk eut oprbears h ouain (YRI, populations 4 the across comparable results make To is allele ancestral the 88%, i.e., SNPs, of proportion large a For emaximize We oeta ntelklho ucinaoe easm that assume we above, function likelihood the in that Note x x ≥ ≥ .A earaymnind h eocasi o observed not is class zero the mentioned, already we As 0. .Telklho ucincnte ewitnas: written be then can function likelihood The 1. a P and x = = =    euemxmmlklho siain The estimation. maximum-likelihood use we b, LL(a, 0 N N † 1 PNAS x x ind ind LL(a, hs ein eecoe ob representa- be to chosen were regions These  a N

L(a, ooti h aiu-ieiodesti- maximum-likelihood the obtain to b) x x ind and B(x  niiul hwvraina oiinis: position a at variation show individuals P 0

b) 1 x t ac 1 2009 31, March θ b) efis ple h ehdt the to method the applied first We + θ = h aiiaini are out carried is maximization The b. x = x (1 +a−1 = a,  N a

B(a, x − =1 ind N N

x and x N =1 ind =1 P ind (1 θ ind n x ) b) B(a, x

N − − P b P log ind x x t θ fteBt Distribution. Beta the of x

−x ) n b) +

N x P f ind o.106 vol. (θ b) x t

−x )d +b−1 θ d o 13 no. θ 5009 To

STATISTICS GENETICS Fig. 1. YRI, CEPH, CHB, and JPT: Estimates of f (t) for values of t ∈[0.1, 16] and values of f ∈[0 − 0.1] for the four populations.

of new variants with a specified minimum frequency to be discov- common between the two datasets) sequenced at ∼1.6 Mb of refer- ‡ ered in future datasets, of sizes t · Nind with t ∈[0.1, 16]. Overall, ence sequence covering 76 genes related to inflammatory response the results are as expected, with the African population the most (11) (the sequenced regions are different from the regions in the diverse and the Asian populations the least diverse. Interestingly, ENCODE dataset). the Chinese population is predicted to be more diverse than the Given that the regions sequenced in this dataset were not over- Japanese population; we observe the same diversity pattern in lapping with the regions in the ENCODE dataset, we used the an independent dataset, the NIEHS SNPs data, in a following ENCODE dataset to predict the total number of SNPs expected   = to exist in the SeattleSNPs dataset. Tomake the two datasets com- section. The MLEs for a and b are: aYRI , bYRI (0.07, 0.97),   a , b = (0.14, 0.70), a , b = (0.22, 0.77), and pletely independent, we removed the 4 YRI samples and the 6 CEU CEU CHB CHB CEPH samples (shared between the two datasets) from the Seat-   = aJPT, bJPT (0.35, 0.85). tleSNPs dataset, leaving us with 20 YRI samples and 17 CEPH We also estimated the minimum number of individuals neces- samples. We ignored the small percentage of variations that were sary to sequence in order to detect all (or at least 80% of) the small diallelic insertion-deletions, and only considered SNPs. In variation with a specified minimum frequency in each of the four total, we identified 6,494 SNPs in the YRI samples and 3,876 in ENCODE populations (Table1). It is interesting to note the large the CEPH samples. difference between the number of individuals necessary to find all For YRI, given that 8 individuals were sequenced as part of the variants with frequency at least 0.001, e.g., 3, 521 for the CEPH ENCODE project, the predicted number of SNPs in 20 individuals population, and the number of individuals necessary to detect 80% is simply the observed number of SNPs in the 8 sequenced individ- of all variants, e.g., 154 for CEPH; finding 99% of the variants uals plus the estimated number of new variants to exist in 12 new in CEPH requires 1, 008 individuals. The reason is that, as we individuals, i.e., (1.5), which gives an estimated 12,973 SNPs. add individuals, new variants are identified. The discovery process Similarly, for CEPH, given that 16 individuals were sequenced, is very efficient in the beginning, but after many individuals are sequenced, each additional individual contributes less and less to the pool of the newly discovered variants (Fig. 2). This saturation effect is also noticeable in Fig. 1, especially for variants with a Table 1. The minimum number of individuals necessary to sequence frequency of at least 0.05 or 0.10. For rarer variants (frequency in order to capture all the variants with frequency at least f <0.05) the asymptote is not yet reached for t ≤ 16. Population f = 0.001 f = 0.005 f = 0.01 f = 0.05 f = 0.10 2.2. SeattleSNPs Data Application (Inflammatory Response Genes). We applied our method to a second dataset, the sequencing data avail- YRI (7) 3,892|175 798|63 406|42 91|21 49| < 14 able from the SeattleSNPs web site (http://pga.gs.washington.edu/). CEPH (7) 3,521|154 735|63 378|42 84|21 49| < 14 We used data on 24 Yoruban (YRI) and 23 CEPH unrelated CHB (7) 3,241|140 693|56 357|42 84|21 49| < 14 samples (mostly different from the HAPMAP samples in the JPT (7) 2,996|112 665|56 350|42 84|21 49| < 14 ENCODE dataset, with 4 YRI samples and 6 CEPH samples in Also shown is the minimum number of individuals necessary to sequence in order to capture 80% of the variants with a minimum frequency of f. The predictions are done based on 7 individuals already sequenced for each of ‡ We report these predicted numbers in a table in the SI Appendix. four populations in the ENCODE project.

5010 www.pnas.org / cgi / doi / 10.1073 / pnas.0807815106 Ionita-Laza et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 h rdce ubro Nsi 7idvdasi h observed the is individuals plus 17 individuals 16 in in SNPs SNPs of number of number predicted the respectively. individuals, 1,000 and 150 at drawn are lines horizontal The 0.001. 2. Fig. Ionita-Laza fteHpa rjc 1) h rdce ubro Nsin SNPs 10,225 of /0.9 12,973 individuals, CEPH number becomes predicted individuals the Phase , YRI in (12)] 20 SNPs Project QC+ HapMap of the proportion of the passed I was SNPs this all [as of checks QC 90% quality were that total SNPs Assuming observed the the filtered. represent region; (quality-control) the not in do found SNPs these of However, number SNPs. 10,225 of mate r l eae oiflmaoyrsos.Smith response. data inflammatory SeattleSNPs to the related in all sequenced are genes were recombination the various whereas regions regions, etc.), rates, the nongene and data, the gene of features (both ENCODE various sequenced genome encompass the to regions fashion, For broader the a in projects. of selected two the plausible the with One in do (13). before to noticed has been with explanation has compared data data data. SeattleSNPs ENCODE the SeattleSNPs in the the variants in rare of numbers excess observed An the than data smaller ENCODE the are on based predicted observation. numbers the interesting that an appears reveal It does it caveats, many to region, ceptible 2). (Table 1.6-Mb CEPH in a SNPs for 3,635 and hence, YRI in long; SNPs Mb 4,612 predict 5 we is project ENCODE the ihteaeaedvriyi h eoe o h R hnfor than YRI the for compared genome) (when the larger in CEPH. the even diversity average is diversity the that genes with suggest related may inflammatory CEPH in with compared (Seat- data) YRI the in SNPs tleSNPs between of number difference observed the larger and number predicted much gene The 35 considered. among diversity categories sequence highest and disequilibrium link- lowest age the have genes immune-related and inflammatory that lhuhsc rs-ape rs-eincmaio ssus- is comparison cross-region cross-sample, a such Although R,CP,CB n P:Tenme fidvdasncsayt eunei re odtc ecnaeo l h aito ihfeunya least at frequency with variation the all of percentage a detect to order in sequence to necessary individuals of number The JPT: and CHB, CEPH, YRI, tal. et /0.9 = 131 h einsqecdin sequenced region The 11,361. 00) eutn na esti- an in resulting (0.06), = 444adi h 17 the in and 14,414 tal. et 20)found (2005) in eesbtnilysalrta h orsodn estimates corresponding the than smaller substantially were tions neetn omninta h Lsof MLEs the perhaps that is mention it ENCODE context, to the this In interesting on sample. African based the in predictions especially SNPs data, the NIEHS in with the diversity in compared greater noticeable sample dataset dataset, is Asian SeattleSNPs genes the response the in environmental with 12,036 As and 2). sample used sample, (Table European We African the the sample. in in Asian SNPs 18,224 14,147 the predict to in in data 16,644 14,407 ENCODE sample, the and European sample, the Hispanic in SNPs 15,116 the 26,114 sample, identified African we the summary, In in gene. of each Asian descent, number in of the African found counted 24 We SNPs Japanese). and of 12 descent, individuals and Hispanic Chinese 27 of (12 descent 22 in descent, Mb) European 5.95 of 22 of length (for total sequenced a been have genes response environmental 293 (15), IH aaAin(24) Asian Data NIEHS (22) European Data NIEHS (27) African Data NIEHS (17) CEPH Data SeattleSNPs (20) YRI Data SeattleSNPs Dataset SNPs in NIEHS SNPs of and number SeattleSNPs predicted datasets: vs. two SNPs of number Observed 2. Table SNPs. NIEHS 2.3. ubro niiul o hc eunedt eeaalbei hw in shown is available were data sequence which parentheses. for individuals of number h rdcin r oebsdo h NOEdt.Frec ape the sample, each For data. ENCODE the on based done are predictions The nadfeetsuy(http://egp.gs.washington.edu/), study different a In PNAS ac 1 2009 31, March Observed 14,407 26,114 15,116 6,494 3,876 a o.106 vol. o h orpopula- four the for o 13 no. Predicted 18,224 14,147 12,036 4,612 3,635 5011

STATISTICS GENETICS Table 4. Cross-validation in the NIEHS SNPs dataset in the ENCODE dataset: aAfrican = 0.036, aEuropean = 0.076, and aAsian = 0.04 (unlike the MLEs for b, which are similar Dataset Observed Predicted between the two datasets). This observation can be interpreted ± ± as saying that on average variants in the regions sequenced in the NIEHS Data African (14/13) 5,090 179 4,704 85 NIEHS Data European (11/11) 2,584 ± 127 2,304 ± 84 NIEHS SNPs dataset have lower frequencies than those in the ± ± ENCODE dataset. Also, the Japanese population shows less vari- NIEHS Data Asian (12/12) 2,466 98 2,226 69 ation than the Chinese population, as we previously found with Observed number of new SNPs vs. predicted number of new SNPs in the ENCODE dataset (SI Appendix). the testing set (mean ± standard deviation). For each sample, the num- ber of individuals in the training set and the testing set are shown in 2.4. Cross-Validation. Validation of the method using different parentheses. datasets has inherent problems, and discrepancies may simply be a result of different underlying characteristics for the datasets, such as sequence data from different genomic regions, different show that our predictions are accurate as long as the Beta model sequencing accuracies, etc. To avoid these pitfalls, we tried to holds; the accuracy does decrease as t increases toward infinity assess the performance of the method, using the same data. Tothis and as the frequency threshold f decreases toward 0. As argued →∞ → end we performed cross-validation for the African, European, and in ref. 7, as t and f 0, f (t) is going to be an under- Asian populations in both the ENCODE and the NIEHS datasets. estimate of the true number of variants. This in turn implies We randomly split each sample into a training set and a testing that the estimated number of individuals necessary to sequence set of roughly equal sizes, a thousand times. For ENCODE, the to capture a certain fraction of the variation with minimum fre- → average error in prediction is between 2 and 5% for the three pop- quency f is going to be an underestimate, especially when f 0; ulations, whereas for NIEHS, the average error is between 8 and the bias gets smaller as we restrict attention to more common 12% (see Tables 3 and 4). variants. The Beta assumption is inherently a simplification of reality. 3. Discussion Deviations from this model, for example, due, to natural selection, will affect our predictions. In the regions sequenced in the NIEHS In this article, we proposed a method to predict the number of new, dataset, there is an excess of rare variants. As expected in this sit- i.e., not yet seen, variants with a specified minimum frequency to uation, our predictions represent underestimates of the real num- be identified in a new dataset, comprising sequence data for a bers. Indeed, cross-validation within the NIEHS dataset shows set of individuals. In addition, our method also allows prediction that our method systematically underestimates the true number of the minimum number of individuals necessary to sequence in of variants by ∼8–12% on average. By contrast, cross-validation order to detect at least a specified fraction of all the variants. This within the smaller ENCODE dataset leads to a smaller average is particularly relevant, because, in the context of genome-wide error (2–5%), with no systematic bias. It is important to emphasize association studies, once a genomic region has been identified that the sample sizes in our examples are fairly small (a few indi- to be associated with disease, subsequent sequencing studies are viduals for the ENCODE dataset), and that the prediction error is necessary to determine the causal variant. We applied the approach to three sequence datasets: expected to become smaller as the original number of sequenced ENCODE, SeattleSNPs, and NIEHS SNPs datasets. The results individuals becomes larger. In addition to cross-validation, we also are consistent with the expected diversity pattern, with the African predicted the number of variants in the two datasets, SeattleSNPs population having the most number of SNPs, the Asian popula- and NIEHS SNPs, based on the ENCODE data. Overall, the pre- dictions for the European and the Asian sample were ∼7–17% tions the least, and the European population being in-between. ∼ Notably, both the ENCODE and NIEHS SNPs datasets sup- lower, whereas for the African sample the prediction was 40% port the hypothesis that the Japanese are less diverse (in terms lower. Although cross-sample and cross-region predictions have of number of variants) than the Chinese. The number of indi- many caveats, here the underestimation in our predictions is, in viduals necessary to capture all common variation (frequency part, likely due to the expected higher diversity in environmental at least 1%) is small and the 1000 Genomes Project is likely response genes compared with the average genome. Moreover, to find most of them, subject to sequence accuracy. Even for the much larger prediction error for the African population may the rarer variants (frequency at least 0.1%), a large propor- be explained by African people being even more diverse in these tion of them can be found with small samples (in the low hun- regions, compared with European and Asian people, as a result dreds), but to find all of them, thousands of individuals are of geographically localized patterns of selection. This is especially necessary. plausible because the cross-validation error in the NIEHS dataset Our method is based on a parametric assumption, namely the for the African sample was much smaller, i.e., ∼8%, and not dif- beta distribution as an approximation of the frequency distribution ferent for the African population compared with the European or for biallelic variants. As we mentioned previously, this approxima- Asian populations. tion is reasonable following population-genetics arguments and Future extensions of this method include incorporation of selec- has been frequently used in literature (see for example ref. 16; see tion effects, sequencing errors, and nonparametric approaches. also the SI Appendix for goodness-of-fit results for the ENCODE The question is whether it is possible to remove the para- and NIEHS SNPs datasets). Simulation results (see SI Appendix) metric (beta model) assumption on the frequency distribu- tion and recalculate (t) in a nonparametric fashion. It would then be interesting to assess how the nonparametric approach Table 3. Cross-validation in the ENCODE dataset compares with the parametric approach we developed in this article. Dataset Observed Predicted Although we applied the approach to SNP data, the method ENCODE Data YRI (4/4) 2,073 ± 130 1,968 ± 111 applies equally well to counting other types of variants, including ENCODE Data CEU (8/8) 1,117 ± 127 1,142 ± 69 copy-number variants. This is particularly useful because currently ENCODE Data CHB+JPT (8/7) 688 ± 86 723 ± 61 much less is known about copy-number variants than about SNPs.

Observed number of new SNPs vs. predicted number of new SNPs in the ACKNOWLEDGMENTS. We thank two reviewers for comments that helped testing set (mean ± standard deviation). For each sample, the number of improve the manuscript. This work was supported by National Institutes of individuals in the training set and the testing set are shown in parentheses. Health/National Institute of Mental Health R01 MH59532.

5012 www.pnas.org / cgi / doi / 10.1073 / pnas.0807815106 Ionita-Laza et al. Downloaded by guest on September 26, 2021 Downloaded by guest on September 26, 2021 Ionita-Laza .WtesnG,GesH 17)I h otfeun leeteoldest? the allele frequent most the Is (1977) HA populations. Guess of GA, structure Watterson general The (1951) 9. S Wright words many How species: unknown 8. of number the Estimating (1976) R Thisted B, Efron 7. J, Sebat of study 6. association Genome-wide (2007) Consortium Control Case Trust Wellcome 5. DJ, Hunter 4. .HretA, Herbert 3. RJ, Klein 1. .DerRH, Duerr 2. Biol know? Shakespeare did Science controls. shared 3,000 and diseases common 678. seven of cases 14,000 cancer. breast postmenopausal sporadic of risk with associated hlho obesity. childhood namtr oe ies gene. disease bowel inflammatory degeneration. 11:141–160. 305:525–528. tal. et tal. et tal. et tal. et tal. et tal. et 20)Lresaecp ubrplmrhs ntehmngenome. human the in polymorphism number copy Large-scale (2004) Science 20)Cmlmn atrHplmrhs naerltdmacular age-related in polymorphism H factor Complement (2005) 20)Agnm-ieascainsuyietfisallsi FGFR2 in alleles identifies study association genome-wide A (2007) 20)Agnm-ieascainsuyietfisI2Ra an as IL23R identifies study association genome-wide A (2006) 20)Acmo eei ain sascae ihautand adult with associated is variant genetic common A (2006) Science 63:385–389. Biometrika 312:279–283. Science 63:435–437. 314:1461–1463. n Eugenics Ann a Genet Nat Nature 15:323–354. ho Popul Theor 39:870–874. 447:661– 6 oa ,Tn 20)Ipoigpplto-pcfi leefeunyestimates frequency allele population-specific Improving (2007) H Tang M, Coram overview An 16. (2005) RJ Livingston CC, Carlson DC, Crawford MJ, of Rieder regions DA, in Nickerson features Sequence (2005) 15. GR Abecasis HM, Munro DJ, Thomas AV, Smith using capture 14. Information (2007) MJ Caulfield PB, Munroe RJ, genome. Dobson human C, the of Wallace for map power haplotype 13. and A (2005) coverage Consortium Estimating HapMap (2008) International DA Nickerson 12. MJ, Rieder TR, Bhangale 11. 0 ai JG, Hacia 10. yaatn upeetldt:a miia ae approach. Bayes empirical an 479. data: supplemental adapting by 10.1289/ehp.7922. Research, Project. Genome Environmental the of disequilibrium. linkage strong and weak inflammatory of sample a estimates. in genome-wide from differs 17:1596–1602. regions chips gene-centric whole-genome cardiovascular and and HapMap from SNPs Nature data. variation near-complete using studies association genetic oyopim sn ihdniyoiouloiearrays. oligonucleotide high-density using polymorphisms 437:1299–1320. tal. et 19)Dtriaino neta lee o ua single-nucleotide human for alleles ancestral of Determination (1999) PNAS ac 1 2009 31, March saso h uueo niomna Health Environmental of Future the on Essays eoeRes Genome o.106 vol. 15:1519–1534. a Genet Nat a Genet Nat n plStat Appl Ann o 13 no. 22:164–167. eoeRes Genome 40:841–843. 2:459– 5013

STATISTICS GENETICS