Estimating the Number of Unseen Variants in the Human Genome
Total Page:16
File Type:pdf, Size:1020Kb
Estimating the number of unseen variants in the human genome Iuliana Ionita-Laza1, Christoph Lange, and Nan M. Laird Department of Biostatistics, Harvard School of Public Health, 655 Huntington Avenue, Boston, MA 02115b; Edited by Peter J. Bickel, University of California, Berkeley, CA, and approved January 7, 2009 (received for review August 8, 2008) The different genetic variation discovery projects (The SNP Consor- not use. Specifically, if a new volume by Shakespeare were to be tium, the International HapMap Project, the 1000 Genomes Project, discovered, how many new words would we expect to see? Efron etc.) aim to identify as much as possible of the underlying genetic and Thisted (7) used a Gamma-Poisson model to address this variation in various human populations. The question we address in question. We adapt the approach in Efron and Thisted to the this article is how many new variants are yet to be found. This is an problem of predicting the number of genetic variants yet to be instance of the species problem in ecology, where the goal is to esti- identified in future studies. The method also allows calculation of mate the number of species in a closed population. We use a para- the number of individuals required to be sequenced in order to metric beta-binomial model that allows us to calculate the expected detect all (or a fraction of) the variants with a given minimum fre- number of new variants with a desired minimum frequency to be quency. In the following sections we develop the method and show discovered in a new dataset of individuals of a specified size. The applications to several available sequence datasets: ENCODE, method can also be used to predict the number of individuals nec- SeattleSNPs, and National Institute on Environmental Health Sci- essary to sequence in order to capture all (or a fraction of) the vari- ences (NIEHS) SNPs. Although these datasets contain only SNPs, ation with a specified minimum frequency. We apply the method the method can be applied to counting other types of variants, to three datasets: the ENCODE dataset, the SeattleSNPs dataset, including copy-number variants. and the National Institute of Environmental Health Sciences SNPs dataset. Consistent with previous descriptions, our results show 1. Methods that the African population is the most diverse in terms of the number of variants expected to exist, the Asian populations the First, we introduce some notation. For our purposes, an individ- least diverse, with the European population in-between. In addi- ual shows variation at a particular position if the correspond- tion, our results show a clear distinction between the Chinese and ing allele is different from the ancestral allele. We say that a the Japanese populations, with the Japanese population being the position is variable (or is a variant) in a sample if there is at less diverse. To find all common variants (frequency at least 1%) the least one individual in the sample that shows variation at that number of individuals that need to be sequenced is small (∼350) position. N and does not differ much among the different populations; our Suppose we have data on ind individuals; for example, each N data show that, subject to sequence accuracy, the 1000 Genomes of the ind individuals has been sequenced in a genomic region, Project is likely to find most of these common variants and a high and hence for each position we know whether an individual shows proportion of the rarer ones (frequency between 0.1 and 1%). The variation or not. data reveal a rule of diminishing returns: a small number of individ- We follow the notation in ref. 7. Suppose the total number of uals (∼150) is sufficient to identify 80% of variants with a frequency variable positions in the human genome is an unknown, fixed num- N f of at least 0.1%, while a much larger number (>3,000 individuals) is ber, denoted here by .Let s be the unobserved frequency of s x necessary to find all of those variants. Finally, our results also show variable position , and let s be the number of times variable s N a much higher diversity in environmental response genes compared position has been observed in the ind individuals in our dataset. x ∼ N f with the average genome, especially in African populations. Then, s Bin( ind, s). Of course, we can only observe those variable positions with xs > 0. We assume that fs ∼ Beta(a, b). The Beta prior is not only mathematically convenient, but a good 1000 Genomes Project | beta-binomial model | CNVs | sequence data | SNP approximation for the distribution of allele frequencies at biallelic markers under a neutral selection and mutation-drift equilibrium main goal of the various human genome projects is to dis- model, as Wright (8) showed. A cover genetic variants in human genomes. The HapMap Let nx be the number of positions with exactly x individuals project (http://www.hapmap.org/) has contributed much to our showing variation at a position. Hence, n1 is the number of vari- understanding of the underlying genetic variation in diverse Nind ants that occur in only one individual, etc., and = nx is the total human populations and has facilitated the discovery of many loci x 1 number of variants observed. Also, let ηx = E(nx). robustly associated with common human diseases, such as dia- As in ref. 7, we want to estimate (t), i.e., the number of new betes, obesity, breast cancer, and many others (1–5). The recently variants to be found in the next t · Nind individuals (if we were to launched 1000 Genomes Project (http://www.1000genomes.org/), perform a new sequencing study of that size). For t ≥ 0, we can by sequencing 1,000 genomes, aims to discover much of the exis- write tent common variation (frequency at least 1%), including both single-nucleotide polymorphisms (SNPs) and the less explored copy-number variants (6). In this article, we provide a systematic way to predict the number Author contributions: I.I.-L. and N.M.L. designed research; I.I.-L. and N.M.L. performed of new variants with a specified minimum frequency to be iden- research; I.I.-L. and C.L. contributed new reagents/analytic tools; I.I.-L., C.L., and N.M.L. tified in future datasets of specified sizes. In particular, based on analyzed data; and I.I.-L. and N.M.L. wrote the paper. sequence data for a set of individuals, can we predict how many The authors declare no conflict of interest. more new variants will be found if we were to sequence a new This article is a PNAS Direct Submission. set of individuals of a given size? This question is related to the 1To whom correspondence should be addressed. E-mail: [email protected]. species problem in ecology, where one is interested in estimating This article contains supporting information online at www.pnas.org/cgi/content/full/ the number of species in a closed population. A particular exam- 0807815106/DCSupplemental. ple is estimating the number of words Shakespeare knew but did © 2009 by The National Academy of Sciences of the USA 5008–5013 PNAS March 31, 2009 vol. 106 no. 13 www.pnas.org / cgi / doi / 10.1073 / pnas.0807815106 Downloaded by guest on September 26, 2021 1 + · (t) = N 1 − (1 − θ)(t 1) Nind f (θ)dθ 1.1. Estimation of the Parameters a and b of the Beta Distribution. To a b 0 estimate and , we use maximum-likelihood estimation. The 1 probability that exactly x individuals show variation at a position is: − N 1 − (1 − θ)Nind f (θ)dθ 1 0 Nind − x Nind x 1 1 Px = θ (1 − θ) f (θ)dθ + · x = N (1 − θ)Nind f (θ)dθ − N (1 − θ)(t 1) Nind f (θ)dθ 0 1 x+a−1 N −x+b−1 0 0 Nind θ (1 − θ) ind [1] = dθ x B(a, b) 0 Nind B(x + a, Nind − x + b) where f (θ) = θ a−1 · (1 − θ)b−1/B(a, b) is the density function for = x B(a, b) = 1 a−1 − b−1 the Beta distribution and B(a, b) 0 θ (1 θ) dθ is the beta function. for x ≥ 0. As we already mentioned, the zero class is not observed Inserting the form for f (θ) into Eq. 1 we obtain: and hence we need to fit the zero-truncated beta-binomial dis- tribution. The probability distribution function for this truncated distribution becomes: 1 N − + − = a 1 − Nind b 1 (t) θ (1 θ) dθ t = Px B(a, b) 0 Px Nind 1 x=1 Px N − + · + − − a 1 − (t 1) Nind b 1 θ (1 θ) dθ for x ≥ 1. The likelihood function can then be written as: B(a, b) 0 + − N = η1 · Nind b 1 ind = t nx a Nind L(a, b) Px + − + · + x=1 − η1 · Nind b 1 · B(a,(t 1) Nind b) a Nind B(a, Nind + b) and the log-likelihood function is: N ind Detailed derivations are shown in the supporting information LL(a, b) = n log Pt (SI) Appendix. Note that (0) = 0 and x x x=1 η N + b − 1 We maximize LL(a, b) to obtain the maximum-likelihood esti- (t) → 1 · ind mators (MLEs) for a and b. The maximization is carried out a Nind through the Newton–Raphson method. Note that in the likelihood function above, we assume that when t →∞. markers are in linkage equilibrium. Dependence among nearby In addition, we can calculate f (t), i.e., the number of new vari- markers makes the effective number of markers in the region look ants with frequency at least f expected to be found in a new set of smaller, but does not lead to systematic bias.