Purifying Selection Causes Widespread Distortions of Genealogical Structure on the Human X Chromosome
Total Page:16
File Type:pdf, Size:1020Kb
INVESTIGATION Purifying Selection Causes Widespread Distortions of Genealogical Structure on the Human X Chromosome Brendan O’Fallon1 Associated Regional and University Pathologists, Institute for Clinical and Experimental Pathology, Salt Lake City, Utah 84112 ABSTRACT The extent to which selective forces shape patterns of genetic and genealogical variation is unknown in many species. Recent theoretical models have suggested that even relatively weak purifying selection may produce significant distortions in gene genealogies, but few studies have sought to quantify this effect in humans. Here, we employ a reconstruction method based on the ancestral recombination graph to infer genealogies across the length of the human X chromosome and to examine time to most recent common ancestor (TMRCA) and measures of tree imbalance at both broad and very fine scales. In agreement with theory, TMRCA is significantly reduced and genealogies are significantly more imbalanced in coding regions and introns when compared to intergenic regions, and these effects are increased in areas of greater evolutionary constraint. These distortions are present at multiple scales, and chromosomal regions as broad as 5 Mb show a significant negative correlation in TMRCA with exon density. We also show that areas of recent TMRCA are significantly associated with the disease-causing potential of site as measured by the MutationTaster prediction algorithm. Together, these findings suggest that purifying selection has significantly distorted human genealogical structure on both broad and fine scales and that few chromosomal regions escape selection-induced distortions. NDERSTANDING the impact of natural selection on selection. Few studies, however, have sought to quantify Upatterns of genetic variation in humans has important patterns of genealogical variation on a large scale, and in- implications for disease research, population genetics, and vestigations are often limited to simulation and theory. anthropology. For example, identification of functionally Recent years have seen a significant expansion in our conserved regions aids in quantifying the effect of individual understanding of how natural selection affects genealogical variants on disease phenotypes (Cooper et al. 2010). Simi- structure. For example, strong positive selection is under- larly, delineation of regions that have recently experienced stood to decrease nucleotide variability, shorten the time to adaptive sweeps helps to illuminate traits underlying most recent common ancestor (TMRCA), and increase the uniquely human features. Characterizing the selective forces length of haplotype blocks (reviewed in Bamshad and operating across the genome requires knowledge of how Wooding 2003). Several statistical tests have been proposed selection alters patterns of not only nucleotide, but also that use these distortions to detect recent selective sweeps genealogical, variation. The shapes of gene genealogies (e.g., Sabeti et al. 2007; Voight et al. 2006). In contrast to and their recombinant analogs are intimately connected to positive selection, balancing selection lengthens TMRCA, the character of nucleotide variability in a region, and dif- increases nucleotide variability, and leads to genealogies fering forms of selection are known to produce distinct dis- with long internal branches (e.g., Barton and Navarro tortions to genealogical structure (e.g., Neuhauser and 2002; Andres et al. 2009). Finally, while purifying selection Krone 1997; Barton and Navarro 2002; Seger et al. 2010). was long thought to have negligible effects on genealogies, These selection-induced alterations affect linked neutral recent theoretical work has suggested that it may have a sig- variation, potentially generating a widespread signature of nificant impact (Maia et al. 2004; Comeron et al. 2008; Seger et al. 2010; O’Fallon et al. 2010; Walczak et al. Copyright © 2013 by the Genetics Society of America 2012). Simulations and modeling have suggested that selec- doi: 10.1534/genetics.113.152074 tion at many closely linked sites leads to genealogies that Manuscript received November 26, 2012; accepted for publication April 9, 2013 1Address for correspondence: 500 Chipeta Way, Salt Lake City, UT 84108. E-mail: are shallower and more imbalanced than expected under [email protected] neutrality. Under these conditions coalescent rate increases Genetics, Vol. 194, 485–492 June 2013 485 Table 1 Sample designations with source population and region explicitly considering the full ancestry of the samples, Sample Population ARG-based approaches can accurately capture the statistical correlations in the data induced by shared ancestry. Simi- NA20510 TSI (Toscan, Italy) NA07357 MXL (Mexican, United States) larly, inference of the ARG underlying the samples allows NA21737 MKK (Maasai, Kenya) localization of individual recombinations and regions of con- NA19670 MXL (Mexican, United States) tiguous TMRCA on a very fine scale, not merely the mean NA19649 MXL (Mexican, United States) TMRCA over a predefined window. NA20846 GIH (Gujarati Indian, United States) In this study, we employ a recently developed ARG-based NA19025 LWK (Luhya, Kenya) ’ NA20850 GIH (Gujarati Indian, United States) reconstruction technique (O Fallon 2013) to examine gene- NA18501 YRI (Yoruban, Nigeria) alogical structure across the human X chromosome. We in- NA18940 JPT (Japanese, Tokyo) vestigate a sample of 12 males obtained from the Complete NA20845 GIH (Gujarati Indian, United States) Genomics Diversity Panel (Complete Genomics, Mountain NA18558 CHB (Han Chinese, China) View, CA) (Drmanac et al. 2010) and infer likely ARGs an- cestral to the samples. From the inferred ARGs we extract gradually with backward time, the basal branches of gene- the TMRCA along the length of the sequence and show that alogies are shortened more than the tipward branches, and selective constraint induces the detectable distortions in an- the site frequency spectrum is skewed toward excess low- cestral structure. Coding regions are shown to differ signif- frequency polymorphism (Comeron et al. 2008). The most icantly from noncoding regions, and increasing selective obvious feature of this process is an overall shortening of constraint increases the distortions within coding regions. tree depth, which may be decreased by nearly 50% relative We also identify several regions that have likely experienced to the neutral expectation (Seger et al. 2010). Given that recent positive selection. nearly all functional sequences in the human genome are likely to experience some degree of purifying selection, these effects have the potential to impact large regions of Materials and Methods the human genome. Several recent studies have identified unexpected pat- Samples terns of nucleotide diversity on the human X chromosome, We obtained 12 publicly available male samples from the possibly resulting from purifying selection. In particular, Complete Genomics Diversity Panel (Drmanac et al. 2010) Keinan et al. (2008) noted that variation on the X chromo- representing seven global populations (Table 1). Only the some is reduced by a greater-than-expected degree when non-pseudoautosomal regions of the X chromosome were compared to the autosomes. Although demography is likely considered. Full nucleotide alignments were generated by to play a role (Keinan and Reich 2010; Gottipati et al. 2011), comparing variant files to the human reference genome (Ge- Hammer et al. (2010) demonstrated that the reduction in nome Reference Consortium v37). Because samples were diversity is dependent on the distance to the nearest gene obtained exclusively from males, no genotyping was neces- and that the correlation between diversity and nearest gene sary and haplotype phasing was unambiguous. Overall, we distance is greater on the X than on the autosomes. The analyzed 125 Mb of sequence, 92.5% of the full 135-Mb strong correlation between diversity and nearest gene dis- sequence. tance may result from the reduced opportunity for recombi- Reconstruction nation on the X, which increases linkage disequilibrium and, in areas where many sites are under selection, the potential ThesoftwaretoolACG(O’Fallon 2013) was used to infer for selection-induced genealogical distortion. recombinant genealogies across the X chromosome. The Investigation of the genealogical distortions induced by program uses a Bayesian Markov chain Monte Carlo purifying selection has been hampered by the presence of (MHMC) strategy to sample ARGs and other model param- recombination, and the extent to which theories developed eters in proportion to their likelihood, given the nucleotide under the assumption of no recombination are applicable to alignment, a description of the nucleotide substitution pro- empirical, highly recombinant data sets is not well un- cess, and a set of Bayesian priors on model parameters. The derstood. When genetic sequences recombine, their ancestry program operates similarly to other “genealogy samplers” is described not by a simple gene tree but by a more complex such as BEAST (Drummond and Rambaut 2007), MrBayes structure known as an ancestral recombination graph (ARG) (Huelsenbeck and Ronquist 2001), and LAMARC (Kuhner (Griffiths and Marjoram 1996). In contrast to the well- 2006). developed methods for reconstruction of nonrecombining Variant files were obtained from the Complete Genomics trees, methods for inferring ARGs are still in their infancy, Web site (http://www.completegenomics.com/public-data/) and few tools exist to