Scan of Human Genome Reveals No New Loci Under Ancient Balancing Selection
Total Page:16
File Type:pdf, Size:1020Kb
Copyright Ó 2006 by the Genetics Society of America DOI: 10.1534/genetics.106.055715 Scan of Human Genome Reveals No New Loci Under Ancient Balancing Selection K. L. Bubb,*,1 D. Bovee,† D. Buckley,† E. Haugen,† M. Kibukawa,† M. Paddock,† A. Palmieri,† S. Subramanian,† Y. Zhou,† R. Kaul,† P. Green*,‡ and M. V. Olson* *Department of Genome Sciences, †University of Washington Genome Center, ‡Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195 Manuscript received January 12, 2006 Accepted for publication May 31, 2006 ABSTRACT There has been much speculation as to what role balancing selection has played in evolution. In an attempt to identify regions, such as HLA, at which polymorphism has been maintained in the human population for millions of years, we scanned the human genome for regions of high SNP density. We found 16 regions that, outside of HLA and ABO, are the most highly polymorphic regions yet described; however, evidence for balancing selection at these sites is notably lacking—indeed, whole-genome simulations indicate that our findings are expected under neutrality. We propose that (i) because it is rarely stable, long- term balancing selection is an evolutionary oddity, and (ii) when a balanced polymorphism is ancient in origin, the requirements for detection by means of SNP data alone will rarely be met. ENOMIC regions with high densities of single- 2005a). The elevated nucleotide diversity at HLA has G nucleotide polymorphisms (SNPs) may arise for been attributed to diversifying selection acting on one of three reasons: (1) a high mutation rate, (2) diver- functionally critical sites, coupled to extensive ‘‘hitch- sifying selection, or (3) chance. The latter category hiking’’ of neutral SNPs (Hughes and Yeager 1998; simply reflects neutral variation across the genome in Slatkin 2000). Similar explanations have been offered the time elapsed since the most recent common ances- for other known regions with high SNP densities, al- tor of a genomic segment (i.e., ‘‘deepest’’ coalescence though none is nearly as dramatic as HLA. For example, time). The level of diversity at any locus is often de- across the 25 kbp of ABO sequence, the most dissimi- scribed in terms of the average pairwise divergence, or lar haplotypes are roughly three times as divergent as the probability that two randomly chosen haplotypes expected under neutrality (6.25up compared to the neu- differ at a particular base in that locus (up). Population- tral expectation of 2up) [SeattleSNPs, National Heart, genetic theory predicts that, on average for neutrally Lung, and Blood Institute (NHLBI) Program for Geno- evolving regions, the most dissimilar haplotypes, cap- mic Applications, Seattle (http://pga.gs.washington.edu) turing the full depth of the evolutionary tree, will be (October 2005)]. At an ‘‘ancient’’ 900-kbp inversion about twice as divergent as a random pair of haplotypes on chromosome 17, spanning many genes, the maxi- (2up)(Chimpanzee Sequencing and Analysis mum pairwise difference is only 4.2up (Stefansson Consortium 2005). The variance of each measurement et al. 2005). This is an unusual case, because as recom- (average pairwise divergence and maximal pairwise bination is absent between inverted haplotypes, the divergence) decreases as the size of the locus examined depth of the coalescent is the same across the entire increases. Although there have been no systematic region. Across the genome, the average pairwise di- studies of regions of unusually high SNP density in the versity over tens of kilobase pairs rarely exceeds 4up human genome, a few such regions have been discov- [National Institute of Environmental Health Sciences ered during the analysis of variation in important genes. (NIEHS) Environmental Genome Project, University The most dramatic examples are in the HLA locus. In of Washington, Seattle (http://egp.gs.washington.edu) some regions of HLA, such as the DQB1–DQA1–DRB1 (October 2005)]. As a larger fraction of the genome is gene cluster, the sequence divergence between the most analyzed in multiple haplotypes, the number of regions dissimilar haplotypes across tens of thousands of base that, by chance alone, have ancient coalescence times pairs is nearly two orders of magnitude higher than the may overshadow the number of regions that are deep genome-average up (Stewart et al. 2004; Raymond et al. due to selection. This preponderance of false positives may be true for many tests of selection based on charac- teristics of the coalescent tree. 1Corresponding author: Department of Genome Sciences, University of In this study, we undertook a systematic search for Washington, Seattle, WA 98195. E-mail: [email protected] segments of the human genome that have high SNP Genetics 173: 2165–2177 (August 2006) 2166 K. L. Bubb et al. densities. In contrast to previous studies, our search putative SNP, reflecting the 10-fold higher mutation rate at was unbiased by functional considerations. We simply CpGs (Hwang and Green 2004). We identified 6395 align- ozen mined the same public databases of shotgun-sequencing ments that had $1% putative SNPs. Primer3 (R and Skaletsky 2000) was used on those 6395 regions to pick prim- reads that led to discovery of most of the SNPs presently ers for PCR amplification of the region; in 4255 of the cases, in the SNP database. By paying careful attention—- primers were successfully picked. To avoid sequencing the through computational filters and additional experi- same region multiple times, in cases where multiple reads im- mental sampling—to the many sources of false positives, plicated the same region as highly polymorphic, we PCR am- we identified 16 loci that, next to HLA and ABO, now plified only one arbitrarily chosen amplicon/10 kbp, for a total of 991 regions (see Figure 2a; see supplemental Table 1 at represent some of the highest SNP densities yet de- http://www.genetics.org/supplemental/ for list of primer se- scribed in the human genome across regions of many quences and expected amplicon lengths). thousands of base pairs. These regions do not seem to Experimental filtering (PCR and fosmid sequencing): PCR reflect the action of ancient balancing selection, but amplification was performed on a set of 10 DNA samples from rather illustrate the significant ‘‘noise’’ level of the self-identified African Americans, obtained from the Coriell repository (NA17031, NA17032, NA17033, NA17034, NA17035, variation in neutral coalescent depth. Our results NA17036, NA17037, NA17038, NA17039, NA17040). PCR pro- suggest a paucity of regions outside of HLA that are ducts amplified from these samples at the 991 test loci were under long-term balancing selection, as detectable by sequenced (see Figure 2b). Polyphred (Nickerson et al. 1997) polymorphism level. This result is in agreement with identified 208 regions that had more than three SNPs of rank 3 theoretical predictions that single-site balanced poly- or less, and the supporting trace data at these loci were manually examined. Loci in the HLA region, as defined by Stewart et al. morphisms would not be detectable due to recombina- (2004), were dropped from analysis at this point. We determined tional decay of the linked neutral region. Our analysis that 80 of the loci actually contained $1% SNPs with genotypes also highlights the various challenges associated with in approximate Hardy–Weinberg proportions (thereby indicat- genomewide scans for genomic segments under balanc- ing that they were not artifacts of co-amplified paralogous se- ing selection. quences), with a minimum of two occurrences of the minor allele for each SNP. To test whether the high SNP density extended over several MATERIALS AND METHODS kilobase pairs, for each locus with $1% SNPs in this sample population (again counting possible CpG mutations as one- Computational filtering: The SNP Consortium (Sachida- tenth SNP), we PCR amplified two flanking regions—one 3 kbp nandam et al. 2001) reads (release 10) were downloaded from upstream and one 3 kbp downstream from the initial read—- http://snp.cshl.org and ftp.ncbi.nih.gov. For each read, we re- from four individuals from our diversity panel (NA03715, quired one fasta file containing the base calls and one quality NA11373, NA11589, NA14660), three from the original file containing the base-by-base, log-transformed probability African-American sample (NA17031, NA17038, NA17039) that each call is incorrect (Ewing and Green 1998). In some and one from a chimpanzee (NA03646), as shown in Figure cases, paired fasta and quality files were downloaded, and in 2c. Loci at which at least one of the flanking PCR sites had some cases, the chromatogram was downloaded and fasta and $0.7% SNPs, with a minimum of two occurrences of the minor quality files were generated in-house using PHRED (Ewing allele per SNP, were considered further (80 loci). On the basis and Green 1998; Ewing et al. 1998). The reads were then of visual inspection, we selected one SNP from each originally trimmed for quality such that each read consisted only of a amplified locus as a ‘‘tag SNP’’ for purposes of defining a pair region with .95% Q20 bp. Reads with ,100 bp fitting this of diverged haplotypes; the tag SNP was required to be in criterion were removed from the pipeline, leaving a total of apparent linkage disequilibrium with other SNPs in both the 4,295,373 reads (see Figure 1 for filtering summary). originally amplified locus and the flanking regions, and in These trimmed reads were aligned to the human reference general it was the SNP with the highest minor allele frequency sequence (build30) using BLAST (Altschul et al. 1990). We (see asterisks in Figure 2). We sequenced additional individ- retained those reads that matched the reference for at least uals (NA01814, NA10471, NA10540, NA10543, NA10975, 300 bp (3,334,762 reads) and, in addition, disagreed with the NA10976, NA11321, NA11521, NA13838, NA14663) to iden- reference at 1% or more of the aligned base pairs (402,580 tify three samples from each of the two haplotypes defined as reads).