Sequence Variations in the Public Human Genome Data Reflect a Bottlenecked Population History
Total Page:16
File Type:pdf, Size:1020Kb
Sequence variations in the public human genome data reflect a bottlenecked population history Gabor Marth*†, Greg Schuler*, Raymond Yeh‡, Ruth Davenport§, Richa Agarwala*, Deanna Church*, Sarah Wheelan*¶, Jonathan Baker*, Ming Ward*, Michael Kholodov*, Lon Phan*, Eva Czabarka*, Janos Murvai*, David Cutlerʈ, Stephen Wooding**, Alan Rogers**, Aravinda Chakravartiʈ, Henry C. Harpending**, Pui-Yan Kwok†,††, and Stephen T. Sherry*† *National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD 20894; ‡Department of Genetics, Washington University School of Medicine, St. Louis, MO 63130; §Division of Internal Medicine, Washington University School of Medicine, St. Louis, MO 63130; ¶Department of Molecular Biology and Genetics and ʈMcKusick–Nathans Institute of Genetic Medicine, The Johns Hopkins University School of Medicine, Baltimore, MD 21205; **Department of Anthropology, University of Utah, Salt Lake City, UT 84112; and ††Cardiovascular Research Institute and Department of Dermatology, University of California, San Francisco, CA 94143 Contributed by Henry C. Harpending, November 5, 2002 Single-nucleotide polymorphisms (SNPs) constitute the great ma- (density) distribution of genomic sequence variations. To this jority of variations in the human genome, and as heritable variable end, we built a set of reagents (pairwise sequence alignments landmarks they are useful markers for disease mapping and re- and their corresponding sets of variation) by analyzing the solving population structure. Redundant coverage in overlaps of overlapping regions of large-insert clones sequenced as part of large-insert genomic clones, sequenced as part of the Human the human genome project. These data provided marker Genome Project, comprises a quarter of the genome, and it is density observations grouped by overlap fragment length. representative in terms of base compositional and functional Extending previous methods (9, 10), we implemented simula- sequence features. We mined these regions to produce 500,000 tion and numerical techniques to estimate population genetic high-confidence SNP candidates as a uniform resource for describ- parameters that best describe these observed data. We report ing nucleotide diversity and its regional variation within the results indicating that both the effects of recombination and genome. Distributions of marker density observed at different substantial changes in effective population size are required to overlap length scales under a model of recombination and popu- fit models of neutral sequence evolution to observed marker lation size change show that the history of the population repre- densities. sented by the public genome sequence is one of collapse followed by a recent phase of mild size recovery. The inferred times of Methods collapse and recovery are Upper Paleolithic, in agreement with Overlap Detection, SNP Discovery, and Tabulation of Observed Marker archaeological evidence of the initial modern human colonization Density. The initial data consisted of genomic clones of either of Europe. finished or draft quality that were part of the September 5, 2000, genome data freeze. Regions of known human repeats and low nformation on the demographic history of a species is complexity sequence were masked with REPEATMASKER (Arian ͞͞ Iimprinted in the distribution of sequence variations in its Smit, http: repeatmasker.genome.washington.edu). Candidate genome. The completion of a draft sequence for the human sequence overlaps were determined by a fast initial similarity genome provides a useful substrate for both the detection of search with MEGABLAST (11), followed by pairwise alignment sequence variants and a study of their distribution. To date, the with the dynamic programming algorithm CROSS MATCH (Phil number of publicly available single-nucleotide polymorphisms Green, www.phrap.org). Draft quality sequence is often (SNPs) well exceeds two million (dbSNP build 105). The main composed of unordered fragments; hence an overlap between data sources for computational SNP discovery have been two such clones is broken up into a set of partial overlap expressed sequence tags (ESTs) (1, 2), genomic restriction fragments. Overlaps were retained for further analysis if: (i) both fragments (3), sequences aligned to genome both from the clones resided on the same chromosome, as could be determined Ͼ ends of bacterial artificial chromosomes (BACs) and from by physical mapping; and (ii) total overlap length was 6 kb, random shotgun sequences of clone sequence, and overlapping counting only overlap fragments longer than 3 kb in the total. regions of genomic clone sequences themselves (4, 5). Gen- Overlap fragments were analyzed with the POLYBAYES SNP- erally, SNPs from these data were detected in surveys of a few discovery program (12). An observed mismatch was called a chromosomes, an ascertainment strategy that biases allele candidate SNP if the corresponding POLYBAYES probability frequency patterns toward common variations (6), and thus value was at least 0.80, and there were no discrepancies in the five these data are expected to fall into a range that is unlikely to base pairs immediately flanking either side. To avoid false contain the majority of clinically important mutations (7, 8). positive predictions caused by the erroneous alignment of di- Under the ‘‘common disease, common allele’’ hypothesis, vergent copies of segmental duplications (sequence paralogy) we Ͼ however, these common variants may be of special importance. have excluded overlap fragments with 1 SNP per 400 nucleo- In either case, to assess the potential utility of these data for tides. This censorship procedure, necessary to maintain a high inferences of gene function or population history, one must quality for the candidate set, also removes overlap fragments in first understand its overall structure and distribution in the which the inherent polymorphism rate was genuinely high. The genome. Statistical power in such analyses requires a large resulting bias was estimated in subsequent analysis. An addi- amount of data, ascertained under uniform, well-characterized conditions. Clone overlaps and their derived variations are Abbreviations: SNPs, single-nucleotide polymorphisms; BAC, bacterial artificial chromo- well suited for this task, as long (up to 100 kb) regions of some. redundant sequence coverage distributed in roughly even Data deposition: SNPs discovered in this study are available from the dbSNP web site intervals (5), covering nearly a quarter of the genome. The fact (http:͞͞ncbi.nlm.nih.gov͞SNP), under the ‘‘KWOK’’ submitter handle (accession nos. that regions in a wide range of overlap length are available ss1566252–ss2075206). makes this set especially suited for studying the effects of †To whom correspondence and requests for materials may be addressed. E-mail: recombination and demographic size fluctuation on the spatial [email protected], [email protected], or [email protected]. 376–381 ͉ PNAS ͉ January 7, 2003 ͉ vol. 100 ͉ no. 1 www.pnas.org͞cgi͞doi͞10.1073͞pnas.222673099 Downloaded by guest on September 27, 2021 tional bias was introduced when regions of low-quality sequences were analyzed. These regions cannot be effectively evaluated for SNPs, as sequence differences are more likely to represent sequencing errors than true polymorphisms. We rectified this bias by adjusting the overlap interval to include only the high- quality portions of the overlaps [i.e., where the base quality value (13) was Ͼ35 in both sequences]. This procedure discarded Ϸ5% of the total overlap length. Integration with the Public Genome Assembly. To ensure an un- biased evaluation of the density distributions with respect to the reference genome sequence, we included only those portions of our overlaps that were also present in the genome assembly based on the September 5, 2000, data (14). We evaluated repeat content in the genome, as well as in the clone overlap regions by using REPEATMASKER. We used custom software to compute GϩC nucleotide and CpG dinucleotide content. SNP Validation and Estimation of Allele Frequency. The experimen- tal methods and conditions used to assess the candidate SNPs were fully described previously (15). Modeling Marker Density Distributions. Mismatch distributions de- scribe the likelihood of observing k (k ϭ 0, 1, 2, . .) polymorphic sites (mismatches) when n sample sequences of a given length, L, are compared (n ϭ 2 in this study). Traditionally, the opposing effects of meiotic recombination and co-ancestry have been studied under two simple, yet extreme, scenarios (Fig. 1a). A simple (first-order) model that ignores any structure imposed by demographic history and assumes complete independence be- tween the genealogies of neighboring sites because of recombi- nation (infinite recombination model) predicts a Poisson mis- match distribution driven solely by the mutation rate (16). Conversely, a first-order model that accounts for genealogical structure only through static demographic history and ignores recombination (zero-recombination model) predicts a geometric distribution of mutational differences (17). A detailed demographic history described by the time evolu- tion of effective population size, Ne, profoundly affects the distribution of polymorphic sites shared between individuals. In particular, a large increase of the effective population size yields an overabundance of new lineages that increase the likelihood that random sequence pairs will harbor one or more