The Impact of Tree Topology on Neutrality Tests
Total Page:16
File Type:pdf, Size:1020Kb
| INVESTIGATION Decomposing the Site Frequency Spectrum: The Impact of Tree Topology on Neutrality Tests Luca Ferretti,*,1 Alice Ledda,† Thomas Wiehe,‡ Guillaume Achaz,§,** and Sebastian E. Ramos-Onsins†† *The Pirbright Institute, Woking, GU24 0NF, United Kingdom, †Department of Infectious Disease Epidemiology, Imperial College, London, W2 1PG, United Kingdom, ‡Institute of Genetics, University of Cologne, D-50674, Germany, §Institut de Systématique, Evolution, Biodiversité, Unité Mixte de Recherche 7205, and **Centre Interdisciplinaire de Recherche en Biologie, Unité Mixte de Recherche 7241, Paris College de France, and ††Centre for Research in Agricultural Genomics (CRAG), Bellaterra, 08290 Barcelona, Spain ABSTRACT We investigate the dependence of the site frequency spectrum on the topological structure of genealogical trees. We show that basic population genetic statistics, for instance, estimators of u or neutrality tests such as Tajima’s D, can be decomposed into components of waiting times between coalescent events and of tree topology. Our results clarify the relative impact of the two components on these statistics. We provide a rigorous interpretation of positive or negative values of an important class of neutrality tests in terms of the underlying tree shape. In particular, we show that values of Tajima’s D and Fay and Wu’s H depend in a direct way on a peculiar measure of tree balance, which is mostly determined by the root balance of the tree. We present a new test for selection in the same class as Fay and Wu’s H and discuss its interpretation and power. Finally, we determine the trees corresponding to extreme expected values of these neutrality tests and present formulas for these extreme values as a function of sample size and number of segregating sites. KEYWORDS coalescent theory; neutrality tests; site frequency spectrum; tree shape; tree balance OALESCENT theory (Kingman 1982; Hein et al. 2004; b statistic to tree balance (Blum and François 2006). Impor- CWakeley 2009) provides a powerful framework to inter- tantly, these statistics can only be computed after the tree pret the mutation patterns in a sample of DNA sequences. structure was independently inferred, typically by phyloge- Grounded in the neutral theory of molecular evolution netic reconstruction methods (Felsenstein 2004). (Kimura 1985), binary coalescent trees are the dual back- In population genetics, the historical relationship among ward representations of the continuous-forward-time diffu- nonrecombining sequences is represented by a single genea- sion model of genetic drift. In this view, sequences are related logical tree. The tree is completely determined by the waiting by a genealogical tree where leaf nodes represent the sam- times and the branching order of coalescent events. The pled sequences at present time, and internal nodes (coales- waiting times determine branch lengths; the branching order cent events) represent last common ancestors of the leaves determines tree shape. Population genetic statistics, such as underneath. In particular, the root node represents the most estimates of the scaled mutation rate or tests of the neutral recent common ancestor of the whole sample. evolution hypothesis (neutrality tests) are sensitive to waiting In species phylogeny and epidemiology, tree structure is times and tree shape. often used to compare different models of evolution or to fit The site frequency spectrum (SFS) is one of the most-used model parameters (Bouckaert et al. 2014). Two summary sta- statistics in population genetics. The unfolded SFS tistics are routinely used to characterize tree structure: the g j ; ...; fi ¼ðj1 jn21Þ of a sample of n sequences is de ned as statistic relates to the waiting times (Pybus et al. 2000) and the ; ; ...; 2 ; the vector of counts ji i 2f1 n 1g of all polymorphic “ ” = : Copyright © 2017 by the Genetics Society of America sites with a derived allele ( mutation ) at frequency i n The doi: https://doi.org/10.1534/genetics.116.188763 SFS is a function of both tree structure and mutational process. Manuscript received March 1, 2016; accepted for publication May 19, 2017; published Early Online July 5, 2017. For a given mutational process, the SFS carries information on Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10. the underlying, but not directly observable, genealogical trees 1534/genetics.116.188763/-/DC1. 1 and therefore on the forward process that has generated the Corresponding author: The Pirbright Institute, Ash Rd., Pirbright, Woking, GU24 0NF, United Kingdom. E-mail: [email protected] trees. For a nonrecombining locus, the SFS carries information Genetics, Vol. 207, 229–240 September 2017 229 on the realized coalescent tree and can be used to estimate tree internal nodes close to the root will be referred to as “upper part” structure (both waiting times and topology). of the tree; conversely, the “lower part” is close to the leaves. Variation over time in the effective population size affects The waiting times between subsequent “binary” coalescent the expected waiting times between coalescent events. In the events, i.e., the level heights, are denoted by tk: For trees with past, much attention in theoretical works has been paid to the coalescent events involving multiple mergers, some of the relation between waiting times and population size variation. binary waiting times could be null, i.e., tk ¼ 0: For example, For example, skyline plots (Pybus et al. 2000) are directly if four lineages would coalesce together in a tree with five used to infer variation of population size (Ho and Shapiro lineages, and then the two remaining lineages would coa- 2011), although care should be taken while using this ap- lesce to form the root, then t3 ¼ 0: proach (Lapierre et al. 2016). More generally, formulas of In a neutral, panmictic population of ploidy p (typically the SFS can be generalized to include deterministic changes p ¼ 1 or 2) and constant effective population size Ne that can of population size (Griffiths and Tavaré 1998; Zivkovic and be modeled by the Kingman coalescent, the tk are exponen- Wiehe 2008; Liu and Fu 2015). In contrast, the influence of tially distributed with parameter kðk 2 1Þ; when the time is tree shape on the SFS has not yet been tackled analytically. measured in 2pNe generations (WakeleyP 2009). Two sum- n ; Theshapeofatreecanrangefromcompletelysymmetric mary tree statistics are the height h ¼ k¼2tk which is the trees, in which all internal nodesevenlysplitthelineages;to time from the present to theP most recent common ancestor, n : caterpillar trees, in which each node isolates exactly one lineage. and the total tree length l ¼ k¼2ktk Basic coalescentP theory 2 = ; n21 = In the standard neutral model, as well as in any other equal-rates states EðhÞ¼1 1 n and EðlÞ¼an where an ¼ i¼1 1 i is Markov or Yule model (Yule 1925), both of these extreme cases the ðn 2 1Þth harmonic number. are very unlikely to appear by chance (Blum and François 2006). In fact, since the number of binary tree shapes [enumer- Tree imbalance per level ated by the Wedderburn–Etherington numbers (Sloane and Following Fu (1995), we define the size dk of a branch from Plouffe 1995)] grows rapidly with the number of sequences n, level k as the number of leaves that descend from that branch. any specific tree shape is arbitrarily improbable if n is sufficiently Any mutation on this branch is carried by dk sequences from large. Nonetheless, tree topology is a major determinant of the the present sample. We denote by Pðdk ¼ ijTÞ the probability SFS. For example, a caterpillar shape leads to a large excess of that a randomly chosen branch of level k is of size i, given tree singleton mutations, while a completely symmetric tree leads to T. The complete set of distributions Pðdk ¼ ijTÞ for each i and an overrepresentation of intermediate frequency alleles. k determines uniquely the shape of the tree T. This study aims at a providing a systematic analysis of the The mean numberP of descendants across all branches from impact of the structure of genealogical trees upon the SFS. n2kþ1 = : level k is EðdkÞ¼ i¼1 iPðdk ¼ ijTÞ¼n k This holds for First, we introduce the theoretical framework for neutrality any tree, since all n present-day sequences must descend tests and tree balance. In particular, we develop a new mea- from one of the k branches from that level. sure of imbalance appropriate for population genetics. Then, In contrast, the size variance, VarðdkÞ; depends on the tree we present the decomposition of the SFS in terms of waiting topology: at all levels, it is almost zero in completely balanced times and tree shape. We discuss the case of a single non- trees and maximal in caterpillar trees, where all nodes isolate fi recombining locus, assuming a single realized tree ( xed one leaf from the remaining subtree. For this reason, we pro- topology). As recombination affects mostly lower branches pose the variance VarðdkÞ as the natural measure of imbalance of the tree, this also constitutes an excellent approximation for for each level. a locus with a low level of recombination. The bounds on VarðdkÞ; shown in Figure 1A, vary greatly We present a mathematically rigorous, yet intuitive, inter- from level to level: for example, the variance of the upper- pretation ofneutralitytests interms oftreetopologyand branch 2 most level is Varðd2Þ2½0; ðn=221Þ ; whereas VarðdnÞ¼0 lengths. We focus on a subclass of tests of special interest and (since dn ¼ 1 for all branches). More generally, the maximum simplicity. A qualitative summary of the results about the in- variance at a given level k is obtained in trees where k 2 1 terpretation of neutrality tests is given in Table 1. We also lineages lead to exactly one leaf and one lineage has propose a new neutrality test, L, for selection.