INVESTIGATION

Purifying Selection Causes Widespread Distortions of Genealogical Structure on the X

Brendan O’Fallon1 Associated Regional and University Pathologists, Institute for Clinical and Experimental Pathology, Salt Lake City, Utah 84112

ABSTRACT The extent to which selective forces shape patterns of genetic and genealogical variation is unknown in many species. Recent theoretical models have suggested that even relatively weak purifying selection may produce significant distortions in genealogies, but few studies have sought to quantify this effect in . Here, we employ a reconstruction method based on the ancestral recombination graph to infer genealogies across the length of the human and to examine time to most recent common ancestor (TMRCA) and measures of tree imbalance at both broad and very fine scales. In agreement with theory, TMRCA is significantly reduced and genealogies are significantly more imbalanced in coding regions and introns when compared to intergenic regions, and these effects are increased in areas of greater evolutionary constraint. These distortions are present at multiple scales, and chromosomal regions as broad as 5 Mb show a significant negative correlation in TMRCA with exon density. We also show that areas of recent TMRCA are significantly associated with the disease-causing potential of site as measured by the MutationTaster prediction algorithm. Together, these findings suggest that purifying selection has significantly distorted human genealogical structure on both broad and fine scales and that few chromosomal regions escape selection-induced distortions.

NDERSTANDING the impact of natural selection on selection. Few studies, however, have sought to quantify Upatterns of genetic variation in humans has important patterns of genealogical variation on a large scale, and in- implications for disease research, population genetics, and vestigations are often limited to simulation and theory. anthropology. For example, identification of functionally Recent years have seen a significant expansion in our conserved regions aids in quantifying the effect of individual understanding of how natural selection affects genealogical variants on disease phenotypes (Cooper et al. 2010). Simi- structure. For example, strong positive selection is under- larly, delineation of regions that have recently experienced stood to decrease nucleotide variability, shorten the time to adaptive sweeps helps to illuminate traits underlying most recent common ancestor (TMRCA), and increase the uniquely human features. Characterizing the selective forces length of haplotype blocks (reviewed in Bamshad and operating across the genome requires knowledge of how Wooding 2003). Several statistical tests have been proposed selection alters patterns of not only nucleotide, but also that use these distortions to detect recent selective sweeps genealogical, variation. The shapes of gene genealogies (e.g., Sabeti et al. 2007; Voight et al. 2006). In contrast to and their recombinant analogs are intimately connected to positive selection, balancing selection lengthens TMRCA, the character of nucleotide variability in a region, and dif- increases nucleotide variability, and leads to genealogies fering forms of selection are known to produce distinct dis- with long internal branches (e.g., Barton and Navarro tortions to genealogical structure (e.g., Neuhauser and 2002; Andres et al. 2009). Finally, while purifying selection Krone 1997; Barton and Navarro 2002; Seger et al. 2010). was long thought to have negligible effects on genealogies, These selection-induced alterations affect linked neutral recent theoretical work has suggested that it may have a sig- variation, potentially generating a widespread signature of nificant impact (Maia et al. 2004; Comeron et al. 2008; Seger et al. 2010; O’Fallon et al. 2010; Walczak et al. Copyright © 2013 by the Genetics Society of America 2012). Simulations and modeling have suggested that selec- doi: 10.1534/genetics.113.152074 tion at many closely linked sites leads to genealogies that Manuscript received November 26, 2012; accepted for publication April 9, 2013 1Address for correspondence: 500 Chipeta Way, Salt Lake City, UT 84108. E-mail: are shallower and more imbalanced than expected under [email protected] neutrality. Under these conditions coalescent rate increases

Genetics, Vol. 194, 485–492 June 2013 485 Table 1 Sample designations with source population and region explicitly considering the full ancestry of the samples, Sample Population ARG-based approaches can accurately capture the statistical correlations in the data induced by shared ancestry. Simi- NA20510 TSI (Toscan, Italy) NA07357 MXL (Mexican, United States) larly, inference of the ARG underlying the samples allows NA21737 MKK (Maasai, Kenya) localization of individual recombinations and regions of con- NA19670 MXL (Mexican, United States) tiguous TMRCA on a very fine scale, not merely the mean NA19649 MXL (Mexican, United States) TMRCA over a predefined window. NA20846 GIH (Gujarati Indian, United States) In this study, we employ a recently developed ARG-based NA19025 LWK (Luhya, Kenya) ’ NA20850 GIH (Gujarati Indian, United States) reconstruction technique (O Fallon 2013) to examine gene- NA18501 YRI (Yoruban, Nigeria) alogical structure across the human X chromosome. We in- NA18940 JPT (Japanese, Tokyo) vestigate a sample of 12 males obtained from the Complete NA20845 GIH (Gujarati Indian, United States) Genomics Diversity Panel (Complete Genomics, Mountain NA18558 CHB (Han Chinese, China) View, CA) (Drmanac et al. 2010) and infer likely ARGs an- cestral to the samples. From the inferred ARGs we extract gradually with backward time, the basal branches of gene- the TMRCA along the length of the sequence and show that alogies are shortened more than the tipward branches, and selective constraint induces the detectable distortions in an- the site frequency spectrum is skewed toward excess low- cestral structure. Coding regions are shown to differ signif- frequency polymorphism (Comeron et al. 2008). The most icantly from noncoding regions, and increasing selective obvious feature of this process is an overall shortening of constraint increases the distortions within coding regions. tree depth, which may be decreased by nearly 50% relative We also identify several regions that have likely experienced to the neutral expectation (Seger et al. 2010). Given that recent positive selection. nearly all functional sequences in the are likely to experience some degree of purifying selection, these effects have the potential to impact large regions of Materials and Methods the human genome. Several recent studies have identified unexpected pat- Samples terns of nucleotide diversity on the human X chromosome, We obtained 12 publicly available male samples from the possibly resulting from purifying selection. In particular, Complete Genomics Diversity Panel (Drmanac et al. 2010) Keinan et al. (2008) noted that variation on the X chromo- representing seven global populations (Table 1). Only the some is reduced by a greater-than-expected degree when non-pseudoautosomal regions of the X chromosome were compared to the autosomes. Although demography is likely considered. Full nucleotide alignments were generated by to play a role (Keinan and Reich 2010; Gottipati et al. 2011), comparing variant files to the human reference genome (Ge- Hammer et al. (2010) demonstrated that the reduction in nome Reference Consortium v37). Because samples were diversity is dependent on the distance to the nearest gene obtained exclusively from males, no genotyping was neces- and that the correlation between diversity and nearest gene sary and haplotype phasing was unambiguous. Overall, we distance is greater on the X than on the autosomes. The analyzed 125 Mb of sequence, 92.5% of the full 135-Mb strong correlation between diversity and nearest gene dis- sequence. tance may result from the reduced opportunity for recombi- Reconstruction nation on the X, which increases linkage disequilibrium and, in areas where many sites are under selection, the potential ThesoftwaretoolACG(O’Fallon 2013) was used to infer for selection-induced genealogical distortion. recombinant genealogies across the X chromosome. The Investigation of the genealogical distortions induced by program uses a Bayesian Markov chain Monte Carlo purifying selection has been hampered by the presence of (MHMC) strategy to sample ARGs and other model param- recombination, and the extent to which theories developed eters in proportion to their likelihood, given the nucleotide under the assumption of no recombination are applicable to alignment, a description of the nucleotide substitution pro- empirical, highly recombinant data sets is not well un- cess, and a set of Bayesian priors on model parameters. The derstood. When genetic sequences recombine, their ancestry program operates similarly to other “genealogy samplers” is described not by a simple gene tree but by a more complex such as BEAST (Drummond and Rambaut 2007), MrBayes structure known as an ancestral recombination graph (ARG) (Huelsenbeck and Ronquist 2001), and LAMARC (Kuhner (Griffiths and Marjoram 1996). In contrast to the well- 2006). developed methods for reconstruction of nonrecombining Variant files were obtained from the Complete Genomics trees, methods for inferring ARGs are still in their infancy, Web site (http://www.completegenomics.com/public-data/) and few tools exist to infer, visualize, or extract information and converted to fasta format in 50-kb windows that over- from ARGs. Nonetheless, ARG-based approaches offer some lapped by 10 kb. Approximately 3100 input alignments were advantages over methods that do not explicitly consider produced. Independent MCMC runs were performed for genealogies (e.g., Templeton 1993; Tang et al. 2002). By each alignment, and initial runs were conducted for 5 · 107

486 B. O’Fallon Figure 1 Inferred TMRCA across the short arm (top) and long arm (bottom) of the X chromosome. Blue lines indicate TMRCA from maximally likely ARG. Red line indicates mean TMRCA in a 50-kb sliding window. Blue dots at bottom are locations of .

MCMC steps. Convergence was assessed by discarding the Results first 50% of all steps and by examining the probability of Chromosomal patterns of TMRCA the data given the genealogy (the “data likelihood”) for the remaining steps. The likelihood for the remaining states was Our analysis indicates that TMRCA varies widely across the split into four intervals (each containing likelihoods span- chromosome, with an average of 5.24 · 1024 substitutions ning 6,250,000 states), and the means and standard devia- per site (Figure 1). Assuming the rate of de novo mutations is tions of each interval were computed and compared. Runs in 1.2 · 1028 mutations/site/generation (e.g., Roach et al. which the standard error of the means or standard devia- 2010), we then compute that, on average, lineages ancestral tions differed by .25% were labeled as nonconverging, and to our samples coalesced 1.1 MYA. The mean TMRCA a new MCMC run was initiated using the most likely ARG varies widely on multiple scales, with substantial fluctua- identified in the previous run as the starting point for the tions at both small scales (tens of kilobases) as well as at new run. much broader, multi-megabase scales. Nonetheless, 90% To facilitate searching the complex ARG space, we of the density of the TMRCA lies between 1024 and 0.001 employed a Metropolis-coupled scheme utilizing four chains substitutions/site, suggesting that the majority of coalescen- in which the temperature of chain i was 1/exp(2li), where l ces along the human X occurred between 200 KYA and 2 was adjusted over time to ensure that cold chain swaps were MYA. The lowest inferred TMRCA was 9.35 · 1026 substi- accepted at least 20%, but not .80%, of the times such tutions/site, or 20 KYA using the calibration above. This a swap was proposed. Temperature swaps were proposed region is 30 kb in length and located in the GAGE-8 gene, every 500 MCMC states. which codes for a short peptide expressed in the testis. ARGs were sampled and collected every 20,000 MCMC On a fine scale, many narrow peaks of relatively high steps from all runs, and the TMRCA every 100 bp along each TMRCA are apparent. These peaks are typically several ARG was computed by extracting the marginal tree (the hundred base pairs in length, and several are deeper than nonrecombining tree ancestral to a single site) and finding 0.005 substitution/site (10 MYA using the calibration above). the height of the shallowest node that was ancestral to Some of these may represent alignment errors, which are likely all tips. Mean TMRCA at a site was then computed by to produce clusters of false SNP calls, thereby increasing the examining the TMRCA at that site over all trees sampled. observed amount of variability and the inferred TMRCA. The In areas of overlap between adjacent data windows, narrow breadth of these regions is expected from coalescent we employed a weighted-averaging scheme to smoothly process, in which a greater total number of recombinations are transition the TMRCA value from one window to the next. ancestral to the samples in regions of high TMRCA, thereby For example, if a value was 2 kb from one end of the reducing the breadth of regions of deep coalescence. overlap and 8 kb from the other end, the computed value Comparison of coding vs. noncoding regions would be 80% of the value associated with the close window and 20% of the value associated with the far Theoretical studies have suggested that weak to moderate window. strength purifying selection at closely linked sites may

Purifying Selection on the X Chromosome 487 Figure 3 Distribution of Colless index of tree imbalance for simulated Figure 2 Distribution of TMRCA assessed at single sites in exons (red), neutral genealogies, genealogies from coding regions, and genealogies introns (green), and intergenic regions (blue). from noncoding regions. distort genealogical structure, primarily by decreasing tree rank-sum test, P , 5 · 1024 for both measures) (Figure 3). height (e.g.,O’Fallon et al. 2010; Seger et al. 2010). We However, the absolute difference in mean skewness was mod- tested this hypothesis by examining the TMRCA inferred est, with the mean Colless index of trees in exons and non- from our data in the coding regions of genes (obtained from exons being 2.17 and 2.10, respectively. In addition, we the University of California at Santa Cruz Table Browser) to computed the distribution of the Colless index by simulating the TMRCA in noncoding regions. Data were tabulated by 1000 nonrecombinant genealogies under the Kingman model traversing the chromosome in 100-bp steps and at each step with no selection and constant population size. Empirical ge- computing the mean TMRCA from all ARGs sampled during nealogies were significantly more imbalanced than the sim- the MCMC. TMRCAs in exons were found to be significantly ulated, neutral genealogies (mean Colless index 1.57, P , more recent than TMRCAs computed in noncoding regions, 10216).Becauseskewnessisexpectedtochangegradually consistent with theoretical predictions (Figure 2). The mean across sites, the above analysis includes many genealogies from TMRCA found in exons was 4.16 · 1024 (SD 2.3 · 1024), noncoding regions that may be skewed due to their proximity 25% smaller than that for intergenic regions, which was to an exon. To determine if these genealogies were biasing the 5.45 · 1024 (SD 3.4 · 1024). The distributions also differed results, we performed an additional analysis in which we in- greatly in range, with no exonic regions containing TMRCAs cluded all genealogies in coding regions, but only genealogies deeper than 0.0025 substitution/site, but numerous inter- from noncoding regions that were .25 kb from the nearest genic regions extending to depths .0.0025. It is possible exon boundary. Excluding these genealogies further increased that the results were influenced by using selected sites them- thedifferenceinskewnessbetween coding and noncoding selves to perform inference, as this may bias estimates of regions (Wilcoxon rank-sum test, P , 1028)(Table2). genealogical structure. However, the TMRCA computed in Large-scale patterns of TMRCA and exon density introns was similar to that found for exons (mean 4.5 · 1024), suggesting that close linkage to a selected region also The analysis above examines the relationship between decreases TMRCA, an effect not predicted by the selection- TMRCA and purifying selection at the scale of individual induced inference error hypothesis. sites. However, selection may also have broader consequences We also quantified the degree of imbalance or “skewness” by reducing variation at linked regions. When selection acts of genealogies in coding and noncoding regions. We exam- at many distinct areas over a broad region, overall TMRCA ined marginal genealogies at 500-bp intervals extracted may be significantly reduced. To examine the relationship from the maximally likely ARG identified in each window. between mean TMRCA and coding region density over broad For each genealogy, we computed two classical measures of regions, we divided the data into nonoverlapping windows tree imbalance, the Colless index and Sackin’s index (see and computed the average TMRCA and the fraction of bases Rogers 1996 for discussion of tree-shape statistics). For each contained in exons within each window. At all window sizes measure, genealogies in exons were significantly more skewed examined that were #5 Mb, we observed a significant corre- than genealogies in intronic and intergenic regions (Wilcoxon lation between exon density and TMRCA, as predicted by

488 B. O’Fallon Table 2 Average tree imbalance statistics for coding and noncoding regions

Mean Colless index Mean Sackin’s index Coding 2.17 4.57 Noncoding (all) 2.10** 4.52** Noncoding (.25 kb) 2.07*** 4.50*** ** P , 5 · 1024. *** P , 1028. theory (Figure 4 and Table 3). Even at large, multimegabase scales, areas with a relatively high fraction of coding sites were associated with relatively shallow TMRCA. The strength of the correlation increased with window size, indicating that exon density has a particularly strong effect on large win- dows. This effect translates into a substantial shortening of broad regions that are relatively exon-dense. For example, a 5-Mb region with an exon density of 0.025 has a TMRCA nearly 20% more recent than a 1-Mb window with a similar density. Correlation of TMRCA with evolutionary constraint Figure 4 Correlation of exon density with TMRCA in large chromosomal regions for several different window sizes. Lines represent linear regres- While the above analysis suggests that coding regions sions of TMRCA on region width. differ significantly from noncoding regions in terms of broad genealogical properties, an additional question is whether or not increasing degrees of selective constraint sought to determine if the TMRCA inferred in the above amount to greater distortion within coding regions. To analysis was correlated with the MutationTaster score investigate this question, we consulted the Genomic Evolu- (henceforth MT score) at individual sites. We again con- tionary Rate Profile (GERP++) (Davydov et al. 2010) scores sulted the dbNSFP2.0 database (Liu et al. 2011) to obtain in coding regions. The GERP algorithm examines variation precomputed MT scores at nearly all exonic positions. On across 33 mammalian species and thus examines selective the X chromosome, some 1,058,000 sites were annotated constraint on a broad, phylogenetic scale. We employed with MT scores. Because the database contains the predicted a database of precomputed GERP++ scores obtained effect of each possible nucleotide change at each site, we from dbNSFP2.0 (Liu et al. 2011). While GERP++ scores averaged these predictions together to produce an average are available at most exonic sites, due to concerns about MT score for each site. We then performed an analysis sim- non-independence of TMRCA at nearby sites, we down- ilar to that for the GERP++ scores in which the set was sampled the data set by selecting only one in every 1000 sites downsampled by scanning along the chromosome and in- at which a GERP++ score was available. The resulting set cluded sites only if an MT score was available at that site . included 2088 sites. We then computed the Spearman rank and if the site was 5 kb from the previously included site. correlation coefficient of the GERP++ score on TMRCA, This downsampling strategy produced a data set with 2531 fi which was 20.064. To assess significance, we generated sites. The Spearman rank correlation coef cient was then 100,000 permutations of the TMRCA values at the included computed for the empirical data set as well as for 100,000 sites across GERP++ scores, and for each permuted set we permutations of the data. For the empirical (nonpermuted) fi 2 computed the Spearman correlation coefficient. Only 187 of data set the correlation coef cient was 0.067. Among the · 25 the 100,000 coefficients from the permuted data sets were permuted data sets the mean correlation was 3.41 10 fi less than the observed value of 20.064, indicating signifi- (SD 0.019). Only 42 of 100,000 correlation coef cients for cance at the P , 0.005 level. Thus, high GERP++ scores the permuted data sets fell below the empirically observed 2 , are significantly associated with shallow TMRCA, a phenome- value of 0.067, yielding a P-value of 0.0005. Thus, the fi non in agreement with theoretical predictions regarding the MutationTaster values demonstrate a signi cant negative effects of purifying selection on gene genealogies. correlation with TMRCA in our data, such that areas of rel- atively recent common ancestry are more likely to harbor Correlation of TMRCA with predicted sites with high disease-causing potential. disease-causing potential Identification of candidate regions with recent Numerous algorithms exist to predict the likelihood that an selective sweeps individual nucleotide change will produce a human disease phenotype. Among these, MutationTaster yields a relatively When a positively selected variant sweeps to fixation in high sensitivity and specificity (Schwarz et al. 2010). We a population nearby, variants may “hitchhike” to high

Purifying Selection on the X Chromosome 489 Table 3 Correlation of exon density with TMRCA for different-sized windows

Window size Correlation coefficient P-value 250 kb 20.29 ,10210 1Mb 20.38 ,1025 5Mb 20.63 0.00126 frequency with the causative variant. Depending on the ra- pidity of the sweep, the TMRCA may be significantly short- ened, with more rapid and recent sweeps leading to broad areas of continuously shallow TMRCA. The genealogical analysis presented here allows for identification of regions that may have experienced particularly rapid or recent sweeps. By locating broad regions of continuous, shallow TMRCA, candi- date genes may be identified. To identify regions of continu- ous TMRCA, we examined the set of maximally likely ARGs found during the MCMC run for each window, extracted the TMRCA along each ARG, and joined the results by allowing adjacent windows whose normalized difference in TMRCA in the overlapping area was less than a threshold value (0.35) to Figure 5 Distribution of lengths of regions of contiguous TMRCA and be combined into a single region. Plots of the length of each TMRCA of the region. Red dots indicate regions containing genes; blue dots represent intergenic regions. region by the TMRCA of the region demonstrate that the two measures are highly correlated, as expected from theory. The uniting those at both broad and very fine correlation is due to the fact that recombinations slowly break scales. In this work, we have inferred the genealogical regions of continuous TMRCA over time, so very deep regions structure on the X chromosome from a worldwide sample of are expected to contain relatively narrow TMRCA blocks, males and demonstrated that natural selection has signifi- while areas of recent TMRCA will contain longer blocks. Visual cantly distorted that structure across several scales. Coding inspection of the distribution of block length and TMRCA then regions share significantly more recent common ancestors allows for isolation of candidate blocks (Figure 5). than do noncoding regions, and this effect is increased in While we do not attempt a comprehensive categorization regions that are under more significant evolutionary con- of all candidate regions, examination of several of the largest, straint, as measured by the GERP++ score (Davydov et al. most shallow regions suggests some trends. For example, four 2010). Genealogies in coding regions are also significantly fi of the top ve genes code for products that function on the more imbalanced than those from noncoding regions. These cell surface, in one case on the sperm, suggesting a role in distortions are detectable at multiple scales, and large chro- recognition by pathogens (Table 4). The sperm-head compo- mosomal regions with relatively high exon density have sig- nent, CYLC1, resides in the longest contiguous TMRCA block nificantly more recent common ancestors than do regions  ( 346kb),andgenesinvolvedinreproductionhavebeen with low exon density. Finally, we have shown that the pre- found to evolve particularly rapidly in several previous studies dicted disease-causing potential of individual sites is also (e.g.,Wyckoffet al. 2000). The size of our candidate regions is associated with genealogical structure, and sites where fi similar, although slightly smaller, than those identi ed in a re- mutations are predicted to cause disease appear more fre- cent study using SNP data and an independent methodology quently in areas with recent common ancestors. – (339 405 kb) (Tang et al. 2007). These findings corroborate recent theoretical work re- This technique is capable of detecting only complete or garding the effects of purifying selection on the structure of nearly complete selective sweeps. Ongoing sweeps, in which gene genealogies (Comeron et al. 2008; O’Fallon et al. 2010; the favored variant has not reached reached a high frequency Seger et al. 2010; Walczak et al. 2012). These studies have in the global population, are not likely to be associated with shown that, in the absence of recombination, relatively weak recent TMRCA and thus will not be detected by TMRCA- purifying selection at closely linked sites causes genealogies based methods alone. The well-characterized gene G6PD, to be shorter and more imbalanced than expected under which is associated with malarial resistance and is thought neutral theory. That both of these effects can be detected be under strong positive selection (e.g., Ruwende et al. 1995), in genealogies constructed from the human X chromosome falls into this category. strongly suggests that factors such as recombination and population structure do not significantly alter the primary Discussion predictions of the theoretical studies. Furthermore, because Access to full chromosomal nucleotide sequences from exon density is relatively low on the X chromosome (4%), multiple individuals allows examination of the genealogies selection-induced distortions may affect other chromosomes

490 B. O’Fallon Table 4 Candidate genes in regions showing evidence of selective sweeps

Gene TMRCA Length of tract (kb) Description CYLC1 2.36 · 1024 346 Sperm-head component ARHGAP6 6.9 · 1025 317 Cell-membrane actin polymerization BYCRN1 3.63 · 1025 200 Brain cytoplasmic RNA, dendritic biosynthesis ENOX2 1.16 · 1024 280 Cell-surface protein disulphide exchange, growth and tumor-related GPC3 4.74 · 1025 155 Cell-surface peptidoglycan

with higher-exon-density chromosomes more significantly. and second codon positions in exons, estimates in these Finally, we note that traditional positive selection appears areas may be biased toward relatively small values. There to have influenced genealogical structure in several areas, are several reasons to believe that these effects did not sig- creating extended regions of recent TMRCA in several genes nificantly influence this analysis. First, introns were found to encoding expressed on the cell surface as well as have a distribution of TMRCA similar to that of exons (Fig- a sperm-head component. ure 2), a finding that is predicted if genealogical distortions Overall, TMRCA ranged from 9.3 · 1026 substitutions/ at selected areas lead to distortions in closely linked sites, site to 0.0078 substitution/site. If the appropriately sex- but not if inference in exons is erroneously biased toward averaged mutation rate is near 1.2 · 1028 substitutions/site/ small values. Second, the distortions are apparent even at generation, then the TMRCA in generations ranges from 775 very large scales (Figure 4). Because exons represent a very to 650,000, or from 19,375 to 16,250,000 years assuming small fraction of all sites examined (1%, including third- a generation time of 25 years. Similarly, the mean and codon positions and other synonymous sites at which selec- median TMRCA in substitutions/site are 5.24 · 1024 and tion is not expected), even fairly strong conservation at 4.66 · 1024, respectively, which translates to 1,100,000 these sites would not be likely to generate the significant and 971,000 years. Similar estimates were produced in an distortions seen when the mean TMRCA is taken over 100- analysis by Blum and Jakobsson (2011), who found the kb or 1-Mb scales. Finally, inclusion of selected sites is not mean TMRCA to be close to 1,000,000 years. However, such expected to lead to erroneous inference of skewness, one of conversions implicitly assume that the mutation rate is con- the primary findings of this analysis. stant across sites and over time, an assumption that is prob- Together, these results suggest that natural selection may ably false. In particular, areas of deep TMRCA may represent play a significant role in shaping chromosome-wide patterns mutational hotspots, clusters of false-positive SNP calls, or of genealogical and genetic variation. While recent selective both. Additionally, the assumption of time-constant evolu- sweeps appear to have influenced local structure in some tionary rates may not hold in regions with many selected genomic regions, purifying selection appears to generate sites, and the conversion procedure above should be treated detectable and pervasive distortions across large regions of with some skepticism in these regions. the chromosome. The TMRCAs that we estimated are substantially deeper than estimates from the mitochondrion and the nonrecom- bining portion of the Y chromosome (NRY). Estimates of Acknowledgments TMRCA for the mitochondrion range from 100 to 250 KYA, I thank Josh Akey, Leslie Emery, David Crockett, Mary depending on the calibration method employed (Tang et al. Kuhner, and Joe Felsenstein for helpful insights about 2002; Endicott and Ho 2008). Similarly, TMRCA for the NRY ancestral recombination graphs and purifying selection. The is estimated be 100–150 KYA (Tang et al. 2002; Cruciani work was supported by National Science Foundation Post- et al. 2011). Both the mitochondrion and the NRY have an doctoral Rsearch Fellowship DBI-0906018, an Amazon Web effective population size that is one-third that of the X chro- Services Research in Education grant, and the Associated mosome, but correcting for this difference does not entirely Regional and University Pathologists Institute for Clinical resolve the discrepancy. One hypothesis is again purifying and Experimental Pathology. selection, which produces distortions that increase with mu- tation rate and lack of recombination. Because mitochondria have a relatively high mutation rate and no recombination, Literature Cited purifying selection may have led to a substantially more re- cent TMRCA than for the X. Andrés, A., M. J. Hubisz, A. Indap, D. G. Torgerson, J. D. Degenhardt One potentially confounding factor in this analysis is the et al., 2009 Targets of balancing selection in the human ge- – inclusion of sites under purifying selection in the data used nome. Mol. Biol. Evol. 26(12): 2755 2764. Bamshad, M., and S. Wooding, 2003 Signatures of natural selec- to infer branch lengths. At selected sites, the distribution of tion in the human genome. Nat. Rev. Genet. 4(2): 99–111. mutations in the true genealogy may be non-uniform, Barton, N., and A. Navarro, 2002 Extending the coalescent to potentially leading to inferences of branch lengths that are multilocus systems: the case of balancing selection. Genet. too short (O’Fallon 2010). By including, for example, first Res. 79: 129–139.

Purifying Selection on the X Chromosome 491 Blum, M. G., and M. Jakobsson, 2011 Deep divergences of human Neuhauser, C., and S. M. Krone, 1997 The genealogy of samples gene trees and models of human origins. Mol. Biol. Evol. 28(2): in models with selection. Genetics 145: 519–534. 889–898. O’Fallon, B. D., 2010 A method to correct for the effects of puri- Comeron, J. M., A. Williford, and R. M. Kliman, 2008 The Hill- fying selection on genealogical inference. Mol. Biol. Evol. 27 Robertson effect: evolutionary consequences of weak selection (10): 2406–2416. and linkage in finite populations. Heredity 100: 19–31. O’Fallon, B. D., 2013 ACG: rapid inference of population history Cooper, G. M., D. L. Goode, S. B. Ng, A. Sidow, M. J. Bamshad et al., from recombining nucleotide sequences. BMC Bioinformatics 2010 Single-nucleotide evolutionary constraint scores high- 14: 40. light disease-causing mutations. Nat. Methods 7: 250–251. O’Fallon, B. D., J. Seger, and F. R. Adler, 2010 A continuous-state Cruciani, F., B. Trombetta, A. Massaia, G. Destro-Bisol, D. Sellitto coalescent and the impact of weak selection on the structure of et al., 2011 A revised root for the human Y chromosomal phy- gene genealogies. Mol. Biol. Evol. 27(5): 1162–1172. logenetic tree: the origin of patrilineal diversity in Africa. Am. J. Roach, J. C., G. Glusman, A. F. A. Smit, C. D. Huff, R. Hubley et al., Hum. Genet. 88(6): 814–818. 2010 Analysis of genetic inheritance in a family quartet by Davydov, E. V., D. L. Goode, M. Sirota, G. M. Cooper, A. Sidow et al., whole genome sequencing. Science 328: 636–639. 2010 Identifying a high fraction of the human genome to be Rogers, J. S., 1996 Central moments and probability distributions under selective constraint using GERP++. PLOS Comput. Biol. of three measures of phylogenetic tree imbalance. Syst. Biol. 45 6(12): e1001025. (1): 99–110. Drmanac, R., A. B. Sparks, M. J. Callow, A. L. Halpern, N. L. Burns Ruwende, C., S. C. Khoo, R. W. Snow, S. N. R. Yates, D. Kwiatkowski et al., 2010 Human genome sequencing using unchained base et al., 1995 Natural selection of hemi- and heterozygotes for reads on self-assembling DNA nanoarrays. Science 327(5961): G6PD deficiency in Africa by resistance to severe malaria. Nature 78–81. 376: 246–249. Drummond, A. J., and A. Rambaut, 2007 BEAST: Bayesian evo- Sabeti, P. D., D. E. Reich, J. M. Higgins, H. Z. P. Levine, D. J. Richter lutionary analysis by sampling trees. BMC Evol. Biol. 7: 214. et al., 2002 Detecting recent positive selection in the human Endicott, P., and S. Y. Ho, 2008 A Bayesian evaluation of human genome from haplotype structure. Nature 419: 832–837. mitochondrial substitution rates. Am. J. Hum. Genet. 82(4): Sabeti, P. D., P. Varilly,, B. Fry, J. Lohmueller, E. Hostetter et al., 895–902. 2007 Genome-wide detection and characterization of positive Gottipati, S., L. Arbiza, A. Siepel, A. G. Clark, and A. Keinan, selection in human populations. Nature 449: 913–918. 2011 Analyses of X-linked and autosomal genetic variation Schwarz, J. M., C. Rodelsperger, M. Schuelke, and D. Seelow, in population-scale whole genome sequencing. Nat. Genet. 43 2010 MutationTaster evaluates disease-causing potential of (8): 741–743. sequence alterations. Nat. Methods 7(8): 575–576. Griffiths, R. C., and P. Marjoram, 1996 Ancestral inference from Seger, J., W. A. Smith, J. J. Perry, J. Hunn, Z. A. Kaliszewska et al., samples of DNA sequences with recombination. J. Comput. Biol. 2010 Gene genealogies strongly distorted by weakly interfering 3(4): 479–502. mutations in constant environments. Genetics 184: 529–545. Hammer, M. F., A. E. Woerner, F. L. Mendez, J. C. Watkins, M. P. Tang, H., D. O. Siegmund, P. Shen, P. J. Oefner, and M. W. Feldman, Cox et al., 2010 The ratio of human X chromosome to auto- 2002 Frequentist estimation of coalescence times from nucle- some diversity is positively correlated with genetic distance otide sequence data using a tree-based partition. Genetics 161: from genes. Nat. Genet. 42(10): 830–831. 447–459. Huelsenbeck, J. P., and F. Ronquist, 2001 MrBayes: Bayesian in- Tang, K., K. R. Thornton, and M. Stoneking, 2007 A new ap- ference of phylogenetic trees. Bioinformatics 17(8): 754–755. proach for using genome scans to detect recent positive selec- Keinan, A., and D. Reich, 2010 Can a sex-biased human demog- tion in the human genome. PLoS Biol. 5(7): e171. raphy account for the reduced effective population size of chro- Templeton, A. R., 1993 The “Eve” hypothesis, a genetic critique mosome X in non-Africans? Mol. Biol. Evol. 27(10): 2312–2321. and reanalysis. Amer. Anthropologist 95: 51–72. Keinan, A., J. C. Mullikin, N. Patterson, and D. Reich, 2008 Accelerated Voight, B. F., S. Kudaravalli, X. Wen, and J. K. Pritchard, 2006 A genetic drift on chromosome X during the human dispersal out map of recent positive selection in the human genome. PLoS of Africa. Nat. Genet. 41: 66–70. Biol. 4(3): e72. Kuhner, M., 2006 LAMARC 2.0: maximum likelihood and Bayes- Walczak, A. M., L. E. Nicolaisen, J. B. Plotkin, and M. M. Desai, ian estimation of population parameters. Bioinformatics 22(6): 2012 The structure of genealogies in the presence of purifying 768–770. selection: “a fitness-class coalescent.” Genetics 190: 753–779. Liu, X., X. Jian, and E. Boerwinkle, 2011 dbNSFP: a lightweight Wyckoff, G. J., W. Wang, and C. I. Wu, 2000 Rapid evolution of database of human nonsynonymous SNPs and their functional male reproductive genes in the descent of man. Nature 403: predictions. Hum. Mutat. 32(8): 894–899. 304–309. Maia, L., A. Colato, and J. F. Fontanari, 2004 Effect of selection on the topology of genealogical trees. J. Theor. Biol. 226: 315–320. Communicating editor: N. A. Rosenberg

492 B. O’Fallon