| INVESTIGATION

Decomposing the Site Frequency Spectrum: The Impact of Tree Topology on Neutrality Tests

Luca Ferretti,*,1 Alice Ledda,† Thomas Wiehe,‡ Guillaume Achaz,§,** and Sebastian E. Ramos-Onsins†† *The Pirbright Institute, Woking, GU24 0NF, United Kingdom, †Department of Infectious Disease Epidemiology, Imperial College, London, W2 1PG, United Kingdom, ‡Institute of Genetics, University of Cologne, D-50674, Germany, §Institut de Systématique, Evolution, Biodiversité, Unité Mixte de Recherche 7205, and **Centre Interdisciplinaire de Recherche en Biologie, Unité Mixte de Recherche 7241, Paris College de France, and ††Centre for Research in Agricultural Genomics (CRAG), Bellaterra, 08290 Barcelona, Spain

ABSTRACT We investigate the dependence of the site frequency spectrum on the topological structure of genealogical trees. We show that basic population genetic statistics, for instance, estimators of u or neutrality tests such as Tajima’s D, can be decomposed into components of waiting times between coalescent events and of tree topology. Our results clarify the relative impact of the two components on these statistics. We provide a rigorous interpretation of positive or negative values of an important class of neutrality tests in terms of the underlying tree shape. In particular, we show that values of Tajima’s D and Fay and Wu’s H depend in a direct way on a peculiar measure of tree balance, which is mostly determined by the root balance of the tree. We present a new test for selection in the same class as Fay and Wu’s H and discuss its interpretation and power. Finally, we determine the trees corresponding to extreme expected values of these neutrality tests and present formulas for these extreme values as a function of sample size and number of segregating sites.

KEYWORDS coalescent theory; neutrality tests; site frequency spectrum; tree shape; tree balance

OALESCENT theory (Kingman 1982; Hein et al. 2004; b statistic to tree balance (Blum and François 2006). Impor- CWakeley 2009) provides a powerful framework to inter- tantly, these statistics can only be computed after the tree pret the mutation patterns in a sample of DNA sequences. structure was independently inferred, typically by phyloge- Grounded in the neutral theory of molecular evolution netic reconstruction methods (Felsenstein 2004). (Kimura 1985), binary coalescent trees are the dual back- In , the historical relationship among ward representations of the continuous-forward-time diffu- nonrecombining sequences is represented by a single genea- sion model of . In this view, sequences are related logical tree. The tree is completely determined by the waiting by a genealogical tree where leaf nodes represent the sam- times and the branching order of coalescent events. The pled sequences at present time, and internal nodes (coales- waiting times determine branch lengths; the branching order cent events) represent last common ancestors of the leaves determines tree shape. Population genetic statistics, such as underneath. In particular, the root node represents the most estimates of the scaled mutation rate or tests of the neutral recent common ancestor of the whole sample. evolution hypothesis (neutrality tests) are sensitive to waiting In phylogeny and epidemiology, tree structure is times and tree shape. often used to compare different models of evolution or to fit The site frequency spectrum (SFS) is one of the most-used model parameters (Bouckaert et al. 2014). Two summary sta- statistics in population genetics. The unfolded SFS tistics are routinely used to characterize tree structure: the g j ; ...; fi ¼ðj1 jn21Þ of a sample of n sequences is de ned as statistic relates to the waiting times (Pybus et al. 2000) and the ; ; ...; 2 ; the vector of counts ji i 2f1 n 1g of all polymorphic “ ” = : Copyright © 2017 by the Genetics Society of America sites with a derived allele ( mutation ) at frequency i n The doi: https://doi.org/10.1534/genetics.116.188763 SFS is a function of both tree structure and mutational process. Manuscript received March 1, 2016; accepted for publication May 19, 2017; published Early Online July 5, 2017. For a given mutational process, the SFS carries information on Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10. the underlying, but not directly observable, genealogical trees 1534/genetics.116.188763/-/DC1. 1 and therefore on the forward process that has generated the Corresponding author: The Pirbright Institute, Ash Rd., Pirbright, Woking, GU24 0NF, United Kingdom. E-mail: [email protected] trees. For a nonrecombining locus, the SFS carries information

Genetics, Vol. 207, 229–240 September 2017 229 on the realized coalescent tree and can be used to estimate tree internal nodes close to the root will be referred to as “upper part” structure (both waiting times and topology). of the tree; conversely, the “lower part” is close to the leaves. Variation over time in the effective population size affects The waiting times between subsequent “binary” coalescent the expected waiting times between coalescent events. In the events, i.e., the level heights, are denoted by tk: For trees with past, much attention in theoretical works has been paid to the coalescent events involving multiple mergers, some of the relation between waiting times and population size variation. binary waiting times could be null, i.e., tk ¼ 0: For example, For example, skyline plots (Pybus et al. 2000) are directly if four lineages would coalesce together in a tree with five used to infer variation of population size (Ho and Shapiro lineages, and then the two remaining lineages would coa- 2011), although care should be taken while using this ap- lesce to form the root, then t3 ¼ 0: proach (Lapierre et al. 2016). More generally, formulas of In a neutral, panmictic population of ploidy p (typically the SFS can be generalized to include deterministic changes p ¼ 1 or 2) and constant effective population size Ne that can of population size (Griffiths and Tavaré 1998; Zivkovic and be modeled by the Kingman coalescent, the tk are exponen- Wiehe 2008; Liu and Fu 2015). In contrast, the influence of tially distributed with parameter kðk 2 1Þ; when the time is tree shape on the SFS has not yet been tackled analytically. measured in 2pNe generations (WakeleyP 2009). Two sum- n ; Theshapeofatreecanrangefromcompletelysymmetric mary tree statistics are the height h ¼ k¼2tk which is the trees, in which all internal nodesevenlysplitthelineages;to time from the present to theP most recent common ancestor, n : caterpillar trees, in which each node isolates exactly one lineage. and the total tree length l ¼ k¼2ktk Basic coalescentP theory 2 = ; n21 = In the standard neutral model, as well as in any other equal-rates states EðhÞ¼1 1 n and EðlÞ¼an where an ¼ i¼1 1 i is Markov or Yule model (Yule 1925), both of these extreme cases the ðn 2 1Þth harmonic number. are very unlikely to appear by chance (Blum and François 2006). In fact, since the number of binary tree shapes [enumer- Tree imbalance per level ated by the Wedderburn–Etherington numbers (Sloane and Following Fu (1995), we define the size dk of a branch from Plouffe 1995)] grows rapidly with the number of sequences n, level k as the number of leaves that descend from that branch. any specific tree shape is arbitrarily improbable if n is sufficiently Any mutation on this branch is carried by dk sequences from large. Nonetheless, tree topology is a major determinant of the the present sample. We denote by Pðdk ¼ ijTÞ the probability SFS. For example, a caterpillar shape leads to a large excess of that a randomly chosen branch of level k is of size i, given tree singleton mutations, while a completely symmetric tree leads to T. The complete set of distributions Pðdk ¼ ijTÞ for each i and an overrepresentation of intermediate frequency alleles. k determines uniquely the shape of the tree T. This study aims at a providing a systematic analysis of the The mean numberP of descendants across all branches from impact of the structure of genealogical trees upon the SFS. n2kþ1 = : level k is EðdkÞ¼ i¼1 iPðdk ¼ ijTÞ¼n k This holds for First, we introduce the theoretical framework for neutrality any tree, since all n present-day sequences must descend tests and tree balance. In particular, we develop a new mea- from one of the k branches from that level. sure of imbalance appropriate for population genetics. Then, In contrast, the size variance, VarðdkÞ; depends on the tree we present the decomposition of the SFS in terms of waiting topology: at all levels, it is almost zero in completely balanced times and tree shape. We discuss the case of a single non- trees and maximal in caterpillar trees, where all nodes isolate fi recombining locus, assuming a single realized tree ( xed one leaf from the remaining subtree. For this reason, we pro- topology). As recombination affects mostly lower branches pose the variance VarðdkÞ as the natural measure of imbalance of the tree, this also constitutes an excellent approximation for for each level. a locus with a low level of recombination. The bounds on VarðdkÞ; shown in Figure 1A, vary greatly We present a mathematically rigorous, yet intuitive, inter- from level to level: for example, the variance of the upper- pretation ofneutralitytests interms oftreetopologyand branch 2 most level is Varðd2Þ2½0; ðn=221Þ ; whereas VarðdnÞ¼0 lengths. We focus on a subclass of tests of special interest and (since dn ¼ 1 for all branches). More generally, the maximum simplicity. A qualitative summary of the results about the in- variance at a given level k is obtained in trees where k 2 1 terpretation of neutrality tests is given in Table 1. We also lineages lead to exactly one leaf and one lineage has propose a new neutrality test, L, for selection. Finally, we find n 2 k þ 1 descendants. For this case, we compute the trees corresponding to the maximum and minimum expected values of the tests and provide explicit formulas for 2 2 these extreme values. k 1 2 1 2 n max VarðdkÞ ¼ 1 þ ðn2k þ 1Þ 2 T k k k n 2 Methods ¼ðk 2 1Þ 21 : (1) k Trees can be divided into time segments (“levels”) delimited by the nodes. Each level is unambiguously characterized by its Minimum variance at level k is obtained when all lineages number of lineages k,2# k # n: The most recent level has n have either (we denote by ⌊x⌋ the floor of x, i.e., the largest lineages; the most ancient level (from the root to the next integer smaller or equal to x ⌊n=k⌋ or ⌊n=k⌋ þ 1 descen- internal node) has two lineages. Hereafter, the branches and dants) and it is always small (i.e., # 1=4):

230 L. Ferretti et al. Table 1 Interpreting neutrality tests

1 (Blum and François 2005; Blum et al. 2006), which depend min VarðdkÞ¼ n=k 2 ⌊n=k⌋ ⌊n=k⌋ þ 1 2 n=k # : T 4 only on tree topology and not on branch lengths. In the context (2) of phylogenies of genes from different species, the divergence is expected to be large enough such that there would be substitu- Tree imbalance in population genetics vs. tions on all branches to resolve, in principle, the tree topology. Measures of tree topology, and especially tree imbalance, have Some of these substitutions are expected to have a functional received considerable attentioninthephylogeneticliterature role (e.g., nonsynonymous substitutions, indels). Therefore, al- (Blum and François 2006). Several measures of imbalance have most all splits between lineages should be detectable and cor- been proposed, among them the Sackin’s and Colless’ indices respond to a functional/phenotypic difference between species.

Tree Topology Affects Neutrality Tests 231 On the other hand, from an evolutionary point of view, the importance of a given branch—and of the adjacent splits in the tree—is related to the number of mutations on the branch. For example, consider a branch that is not supported by any mutation. Its significance for future evolution is null, since there is no selective difference between identical alleles and there is no effect on the genetic variability of the popu- lation. This branch could be contracted to zero length, and the splits collapsed into a polytomy of three lineages, without any effect on the present population or on future evolution (even a branch that is supported only by nonepistatic neutral mutations does not affect in any way the future selective processes, even if it has an impact on the genetic diversity of the population). Since mutation-free branches do not have evolutionary significance, their weight in imbalance mea- sures should be low. Since selective effects and effects on genetic diversity are both proportional to the number of mutations along the branches, it seems reasonable to weight local imbalance measures by the expected number of mutations in the branches supporting them. For example, a measure of im- balance Ik for each level would be weighted by the expected number of mutations uktk at thatP level. In this case, we n = obtain the same statistics I ¼ k¼2ktkIk l as in Equation 3 above.

A new measure of tree imbalance We propose an informative statistics on tree balance based on k ... Figure 1 Plots of contributions of different levels ¼ 2 20 of a tree of VarðdkÞ and the reasoning in the previous section. We can n ¼ 20 individuals. (A) Mean, maximum, and minimum contributions of compute the variance in branch size for each level of the tree, different levels to the variance VarðdÞl: In black, the contribution per unit then average it across levels. Fixing a tree T, the average waiting time; in blue, the total contribution per level in the Kingman coalescent. (B) Mean contribution to the value of the tests of each im- variance in branch size across all levels k is d balance component Varð kÞ (blue) and each residual purely waiting time P component tk (green) under neutrality (i.e., for the Kingman coalescent) n Xn ktkVarðdkÞ 1 with S ¼ 20: The sum of all contributions for each test is zero. VarðdÞ¼ k¼P2 ¼ kt Varðd Þ: (4) n kt l k k k¼2 k k¼2 In theory, the same statistics could be applied to the This summary statistic contains the natural weights kt =l; genealogy of a single population or a sample. However, k which is the fraction of branch lengths at level k, discussed genealogical trees in population genetics are usually much in the previous section. Note that this average is different shorter than phylogenetic trees. For short nonrecombining from the total variance in offspring number, i.e., when the sequences, there could be many branches without any muta- variance of sizes is taken across all branches, irrespective of tion event on them. This raises two issues: the detection of their level. The statistics VarðdÞ corresponds instead to the imbalance in trees with short branches, and its evolutionary “within-level” component of variance. meaning. To better understand VarðdÞ; we study the extent to which Regarding detection, neither a mutation-free branch nor each level contributes to the statistics. Figure 1A shows con- the split above it could be detected from sequences. Hence, a tributions per unit time and per whole level. In the first case, split should be weighted by the probability of being detected Varðd Þ are weighted by the number of lineages k, while in through sequence comparison. For example, let I be a mea- k k the second case they are weighted by the length at level k, sure of imbalance at the kth level. Using the probability that 2 kEðtkÞ; which is 1=ðk 2 1Þ for constant population size. Figure there is at least one mutation on level k, that is 1 2 e uktk ; as a 1A shows that the largest contributions to VarðdÞ come from weight function, the combined statistics becomes the levels close to the root. In particular, for the neutral model P P n 2 n Xn at constant population size, the dominant contribution comes 1 2 e uktk I ukt I 1 Pk¼2 k Pk¼2 k k : I ¼ n 2 n ¼ ktkIk from the uppermost level, i.e., from Varðd2Þ This measure 2 uktk k¼2 1 e k¼2uktk l (3) k¼2 contains the same information as the root balance v1; defined for small u: as the smaller of the two root branch sizes:

232 L. Ferretti et al. Table 2 Selected unbiased linear estimators of u

Estimator Formula Weights ðwiÞ abgReference P n21 ^ ji 1=ian 001=an Watterson (1975) uW i¼1 P an  ^ n21 n 2 =n n 2 = n 2 up 2 iðn 2 iÞji 2 ð 1Þ 2 ð 1Þ 0 Tajima (1983) i¼1 ðn 2 iÞ nðn 2 1Þ 2 P ^ n21 = n 2 = n 2 et al. L iji 1 ð 1Þ 0 1 ð 1Þ 0 Zeng (2006) u i¼1 n 2 1  Pn2 1 2 ^ 2 i ji n 2=nðn 2 1Þ 0 0 Fay and Wu (2000) uH i¼1 i nðn 2 1Þ 2 ——— ^ j1 di;1 Fu and Li (1993) uj1

" # 2 1 n 2 n 2 n 2 1 nX1 Varðd Þ¼ v 2 þ n2v 2 ¼ 2v : u^ ¼ P w i j (6) 2 2 1 2 1 2 2 1 w w i i i i¼1 (5) is also an unbiased estimator of u. For instance, Watterson’s ^ Hence the imbalance measure VarðdÞ depends strongly on the estimator uW ¼ S=an follows from setting wi ¼ 1=i in Equa- ’ ^ root balance v1; which has been previously recognized as a tion 6; Tajima s estimator up (Tajima 1983) is obtained by meaningful global measure of tree balance (Ferretti et al. letting wi ¼ðn 2 iÞ: In fact, one can write all usual u estima- 2013; Li and Wiehe 2013), and on the imbalances of the first tors (Tajima 1989; Fu and Li 1993; Fay and Wu 2000) as upper splits as well. linear combinations of the SFS with adequate weights (Achaz 2009) detailed in Table 2: Estimators of u and neutrality tests nX21 A fundamental population genetic quantity is the scaled 1 V ; TV ¼ i i ji (7) NVðSÞ mutation rate u ¼ 2pNem; where m is the mutation rate per i¼1 generation per sequence. u is the key parameter of the neutral qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP n21V : mutation–drift equilibrium. Usually, it cannot be measured where NVðSÞ¼ Varð i¼1 i i jiÞ In this expression, u ^ directly, but can only be estimated from observable data. is usually estimated by method of moment as u ¼ S=an; P For example, under the standard neutral model (i.e., constant ^2 2 = 2 n21 = 2 u ¼ SðS 1Þ ðan þ bnÞ with bn ¼ i¼1 1 i (Tajima population size) an unbiased estimator of u is Watterson’s 1989).qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Hence the general form of NVðSÞ is NVðSÞ¼ u^ ¼ S=a ; where S is the number of observed polymorphic W n V V 2 fi V sites in a sequence sample of size n (“segregating sites”) ln S þ kn SðS 1Þ with appropriate coef cients ln and V; (Watterson 1975). kn reported in Table 4 for some tests. More generally, it has been shown that many of the well- Decomposition of the SFS and its combinations known u estimators can be expressed as linear combinations j of the components ji of the SFS (Tajima 1983;P Achaz 2009; Here we discuss the dependence of the average spectrum Eð Þ ^ n21 = on tree topology and branch lengths. Ferretti et al. P2010). For example, uW ¼ i¼1 ð1 anÞji or ’ ^ n21 2 = 2 : Tajima s up ¼ i¼1 ð2iðn iÞ nðn 1ÞÞji Other estimators The SFS is determined by the number of mutations of size i, are presented in Table 2. Furthermore, the classical neutral- 1 # i # n 2 1: A mutation has size i if it appears on a branch of ity tests (in their nonnormalized version) can be written as a size i. We assume that mutations occur along branches difference between two u estimators, hence as a linear combina- according to a homogeneous Poisson process with rate m : ’ tion of the ji For instance, the nonnormalized Tajima s D (Tajima per unit time. Fixing a tree with respect to shape and branch ^ ^ 1989) is up 2 uW ; while Fay and Wu’s H (Fay and Wu 2000) is lengths, we can average over the mutation process. Denoting ^ ^ up 2 uH: The most common tests are presented in Table 3. by Em the expected value for the mutation process, we obtain Their expression as linear combinations of the ji helps to for the mean frequency spectrum (Fu 1995) understand discrepancies between these tests through their Xn weight functions. For instance, from the weight functions, it EmðjijTÞ¼u ktk Pðdk ¼ ijTÞ; (8) is immediately clear that H assigns large negative weight only k¼2 to ji with large i (high-frequency derived alleles), while D ; assigns negative weight to ji with small and large i (rare where Pðdk ¼ ijTÞ is the distribution of dk the number of de- alleles). scendants of the branches of level k.Theseprobabilitiesde-

For each component ji of the SFS, the product i ji is an pend only on the shape of the tree T and not on waiting times. unbiased estimator of u ¼ 2pNm: Hence, given weights The full set of Pðdk ¼ ijTÞ; k ¼ 2 ...n; actually gives a com- ðw1; ...; wn21Þ; the weighted linear combination plete description of the tree up to permutation of the leaves.

Tree Topology Affects Neutrality Tests 233 Table 3 Neutrality tests discussed in this article

Test Formula Weights ðViÞ abgReference  ^ ^ n D u 2 uW ðn 2 iÞ 2 1=ian 22=nðn 2 1Þ 2=ðn 2 1Þ 21=an Tajima (1989) p 2  H ^ ^ 2 =n n 2 u 2 uH n 4 ð 1Þ 2=ðn 2 1Þ 0 Fay and Wu (2000) p ðn 2 2iÞ 2 E ^ ^ = n 2 2 =ia = n 2 2 =a et al. uL 2 uW 1 ð 1Þ 1 n 01ð 1Þ 1 n Zeng (2006)  L ^ ^ 2 =n n 2 =a uW 2 uH n 2 ð 1Þ 01n This study 1=ian 2 i 2 D ^ ^ 2 =ia ——— FL 2 W di;1 1 n Fu and Li (1993) uj1 u

^ = ; ^ Replacing ji by their mean according to Equation 8, we For instance, uW has a ¼ b ¼ 0 and g ¼ 1 an while up has obtain the general expression for the mean of SFS-based u a ¼ 2 2=nðn 2 1Þ; b ¼ 2=ðn 2 1Þ; and g ¼ 0; hence their dif- estimators ference Tajima’s D has a ¼ 2 2=nðn 2 1Þ; b ¼ 2=ðn 2 1Þn; and g ¼ 2 1=an: Coefficients for other estimators and tests nX21 Xn ^ u can be found in Table 2 and Table 3. With this special class of E u jT ¼ P​ i w kt Pðd ¼ ijTÞ (9) m w w i k k weights, Equation 10 becomes i i¼1 k¼2 nX21 Xn and tests fVðulÞ 2 E ðTVjTÞ¼ ai þ bi þ g kt Pðd ¼ ijTÞ: m l k k 2 i¼1 k¼2 1 nX1 Xn E ðTVjTÞ¼fVðulÞ iV kt Pðd ¼ ijTÞ; (10) (16) m l i k k i¼1 k¼2 P n21 = UsingP i¼1 iPðdk ¼ ijTÞ¼EðdkÞ¼n k and fi n21 2 2 where the normalization function fV is de ned by i¼1 i Pðdk ¼ ijTÞ¼VarðdkÞþE ðdkÞ and exchanging the   order of the sums, this becomes S " # V u f ð lÞ¼Em (11) Xn 2 NVðSÞ tk n E ðTVjTÞ ¼ fVðulÞ aVarðdÞ þ a þ bn þ gk : m l k and depends on ul only, since S is a Poisson variable with k¼2 parameter ul: (17) It is also possible to condition on S as well, obtaining Data availability S Xn E ðj jT; SÞ¼ kt Pðd ¼ ijTÞ (12) The human SNP data are publicly available on the 1000 Ge- m i l k k k¼2 nomes Project Web site. The tests discussed in this article are implemented in the mstatspop software, available at https:// and github.com/cragenomica/mstatspop.

2 S 1 nX1 Xn EmðTVjT; SÞ ¼ iVi ktk Pðdk ¼ ijTÞ: (13) NVðSÞ l i¼1 k¼2 Results Interpretation of neutrality tests for a single locus A new subclass of neutrality tests and their decomposition We consider the application of neutrality tests to a single locus without recombination, i.e., with a given genealogy. We show Interestingly, several common tests (and estimators) are polyno- that some commonly used test statistics have a simple but mials up to second order in the frequency of mutations, and hence rigorous interpretation in terms of tree imbalance and wait- can be written in terms of a general weight function of the form ing times. The tests are summarized in Table 3 and their in- terpretation in Table 1. The weight of the different components Vi ¼ ai þ b þ g=i; (14) isillustratedinFigure1B. with appropriate values of a; b; and g satisfying Tajima’s D test: This is the most-used neutrality test. It is 2 ^ ^ nðn 1Þ proportional to the difference up 2 uW : If positive, it indicates a þ bðn 2 1Þþgan ¼ 0 ðor ¼ 1 for estimatorsÞ: 2 an excess of common alleles; if negative, an excess of rare (15) alleles.

234 L. Ferretti et al. Table 4 Coefficients of the normalization of the neutrality tests discussed in this article

V V Test ln kn h i n 2 D þ1 2 1 2ðn þnþ3Þ n 2 bn 2 1 2 2 3ðn 2 1Þan an 2 þ 2 an þbn 9nðn 2 1Þ nan an

n 2 2 3 2 H 2 18n ð3nþ2Þbn 2 ð88n þ9n 2 13nþ6Þ n 2 a þ1 6ð 1Þ n n n2 2 a2 b h 9 ð 1Þ ð n þ nÞ i n E 2 1 bn n 2 2ðnbn 2 nþ1Þ n 2 1 b 2 2 3 þ1 2ðn 2 1Þan an 2 2 þ 2ð Þ n an þbn an n21 ðn 2 1Þan n 2 1 n h io b n2 n b 2 n3 n2 n 2 n n 2 L 1 2 1 1 n 36 ð2 þ1Þ nþ1 116 þ9 þ2 3 2 4 n2b 2 ð5 þ2Þð 1Þ a 1 a 2 2 þ 2 2 n n 2 a n n n an þbn an 9nðn21Þ ð 1Þ n 4

Watterson’s estimator uW itself has a simple interpreta- Fay and Wu’s H test: This statistic was specifically designed ^ tion. In fact, its average is EmðuW Þ¼uðl=anÞ; i.e., it is propor- to detect selective sweeps at partially linked loci, as most tional to the total length of the tree, divided by the mean weight is given to derived alleles with high frequency. length. As such, it is independent from the tree topology. Strongly negative H is caused by an excess of high-fre- In more practical terms, it is independent of mutation quency derived alleles, which is a signature of a locus frequencies. “hitchhiking” on a nearby sweep locus (Fay and Wu Using the result of section A new subclass of neutrality tests 2000). In this article, we always consider the normalized and their decomposition with the weights in Table 3, we can version of this test (Zeng et al. 2006). We can rewrite its reexpress the mean Tajima’s D as mean value as ( " 2 4 EmðDjTÞ ¼ f ðulÞ 2 VarðdÞ EmðHjTÞ ¼ f ðulÞ 2 VarðdÞ D nðn 2 1Þ H nðn 2 1Þ " ! #) (18) # (19) 1 Xn 2n 1 k 2n 1 Xn þ t 1 2 2 ; þ t ð1 2 2=kÞ : l k ðn 2 1Þ k a n 2 1 l k k¼2 n k¼2 i.e.,EmðDjTÞ canbedecomposedintotwocomponents:one Like Tajima’s D, H contains the imbalance term with negative that is a linear combination of tree lengths, independent sign. However, it has another contribution that weights pos- from the topology, plus a topological component that corre- itively the waiting times close to the leaves, which is opposite sponds to the measure of tree imbalance VarðdÞ introduced to Tajima’s D: before. 2 : In qualitative terms, Tajima’s D is the sum of an imbalance H ’ tree imbalance þ length of lower branches term with negative sign plus terms that give positive weight The only waiting time that has null weight is the root waiting to the ancient waiting times, and negative weight to the re- time. Therefore, H is strongly negative for (i) large imbal- cent ones: ance, and (ii) long branches close to the root. This is precisely D ’ 2 tree imbalance the signal expected by hitchhiking in the proximity of strong selective sweeps, i.e., when the sweep locus itself is uncoupled 2 : þ length of upper branches length of lower branches from the locus under consideration by one (or a few) recom- bination event(s). Therefore, Tajima’s D is large and positive when there are long branches close to the root. It is strongly negative when Zeng’s E test statistic: This is another test designed to the tree is unbalanced and/or when recent branches are detect selective sweeps. However, it is known to be less pow- ’ ^ ^ long. Tajima s D is thus sensitive to both unbalanced trees erful than H (Zeng et al. 2006). It is defined by uL 2 uW ; ^ and trees with long branches close to the leaves (when where the estimator uL also has a simple interpretation: ^ negative), and balanced trees with long branches close to EmðuLÞ¼uðn=n 2 1Þh is the height h of the tree divided by the the root (when positive). The former are typical trees for expected height. Unsurprisingly, the test is therefore a com- recently increasing populations or loci under directional parison of height and length of the tree: selection, the latter are typical under balancing selection Xn f ðulÞ n l f ðulÞ n k or for structured populations or contractions in population E ðEjTÞ¼ E h 2 ¼ E 2 t ; m l n 2 1 a l n 2 1 a k size. n k¼2 n Note that the definition of upper and lower branches, as (20) well as their weighting, depends explicitly on the test and on the sample size n. E ’þ tree height 2 tree length;

Tree Topology Affects Neutrality Tests 235 Figure 2 Statistical power of L and other neutrality tests to detect hitchhiking against the standard neutral model. Solid line, left tail; dashed line, right 6 tail. Coalescent simulations performed with mstatspop (Ramos-Onsins) for a sample of size n ¼ 100 in a population of size Ne ¼ 10 ; for sequences of 5 23 3 length 10 bp and u ¼ 10 per bp, located 1 Mbp away from a selected sites with selection coefficient 4Nes ¼ 10 : (A) Recombination rate 4Ner ¼ 10 with respect to the selected site, (B) recombination rate 4Ner ¼ 100 with respect to the selected site, (C) immediately after fixation of the selected allele, (D) 0.4 coalescent times after fixation of the selected allele. which can be rephrased as Furthermore, since E compares upper and lower branches, it can actually be naturally interpreted as a test for star-likeness 2 : E ’þlength of upper branches length of lower branches of a tree. In star-like trees, the length is maximal with respect to the height (l ¼ nh), corresponding to strongly negative Like Fay and Wu’s H, the E test is focused on high-frequency values of E. alleles. However, it uses no topological information, but de- pends only on waiting times. This explains its lower power for Fu and Li’sDFL: Finally, we will discuss a common test not classical signatures of hitchhiking compared to other tests. included in Equation 14. Fu and Li’s DFL is one of several

236 L. Ferretti et al. ^ ^ uW 2 uH L ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : (22) ^ ^ Var uW 2 uH

The test compares the amount of high-frequency polymor- phisms with the total number of polymorphisms. The L test belongs to the family described by Equation 14, with weights a ¼ 2 2=nðn 2 1Þ; b ¼ 0; and g ¼ 1=an: Its precise definition is P h i n21 1 2 2i2 i¼1 a nðn 2 1Þ ji L ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffin ; (23) L L 2 lnS þ knSðS 1Þ

fi L L where the coef cients ln and kn are given in Table 4. Their derivation can be found in Supplemental Material, File S1. The interpretation of the test can be read from Equation 17: Figure 3 Maximum and minimum values of neutrality tests as a function ( of n for S ¼ 10;100: The minimum of Fay and Wu’s H is not shown since 2 2 2 EmðLjTÞ¼f ðulÞ 2 VarðdÞ its decreases from about 10 to 30 in the range of sample sizes of the L nðn 2 1Þ plot.  ) (24) 1 Xn k 2n 1 þ tk 2 : l an ðn 2 1Þ k } 2 k¼2 testsP based on singletons. Its mean is EmðDFLjTÞ l n ; k¼2ktkPn;kð1jTÞ hence this test should measure the relative Its qualitative interpretation is different from all previous contribution of external branches to total tree length: tests. It is the sum of an imbalance term with negative sign, plus negative weight to the ancient waiting times, and positive DFL ’þ length of internal branches (21) weight to the recent ones: 2 length of external branches; L ’ 2 tree imbalance 2 length of upper branches i.e., negative values of this test should signal extremely un- length of lower branches: balanced (caterpillar) trees or star-like trees. However, de- þ spite its intuitive interpretation, negative values of Fu and Li’s This interpretation and the presence of the Fay and Wu’s D can be misleading if interpreted in terms of tree shapes. FL estimator u^ in the test suggest that this test could be most The reason for this is that these values of the test can be a H powerful in selective scenarios. result of purifying selection—nonneutral mutations that de- In fact, simulations of the statistical power of the test in crease fitness and therefore can only reach low frequencies Figure 2 show that the left tail of L has a power similar to the before disappearing from the population. These mutations normalized Fay and Wu’s H test for hitchhiking (but slightly appear mostly as singletons concentrated on the lower lower for most parameters). On the other hand, the right tail branches. This scenario violates the assumption of muta- of L has a power similar to the left tail of Zeng’s E, performing tional homogeneity along the tree and therefore the interpre- well immediately after fixation and outperforming most tation of Equation 21 is not valid anymore. other tests for an intermediate range of times after fixation. The test seems therefore to retain some of the advantages of A new neutrality test for positive selection Fay and Wu’s H, while being able to detect different selective The family of tests described by Equation 14 includes Tajima’s signals as well. D, Fay and Wu’s H,andZeng’s E among many others. These Extreme trees and extreme mean values of three tests are built as differences of four different estima- ^ ^ ^ ^ neutrality tests tors uW ; up; uL; and uH: However, they do not exhaust all combinations of these estimators. There is another combi- Our precise interpretation of the expected values of neutrality nation that has not been studied previously and will be de- tests in terms of tree shape and waiting times allows us to find ^ ^ ^ tailed in this section (since up ¼ 2uL 2 uH; the other two both the extreme expected values of the tests and the corre- ^ ^ ^ ^ combinations uL 2 uH and up 2 uL are equivalent to Fay sponding “extreme” trees. and Wu’s H). In this section, we will compute the maximum and mini- The new test is the difference between the Watterson mum value of EmðTVjT; SÞ; i.e., the maximum and minimum ^ ^ estimator uW and the Fay and Wu’s estimator uH: We denote expected values of the test across all trees T, for a given num- this test by L: ber of mutations S, and a given sample size n. For large S,these

Tree Topology Affects Neutrality Tests 237 2 2 values depend only on the sample size. The extreme values are 2 ðn22Þ ðn22Þ nðn 2 1Þ S nðn 2 1Þ logðnÞ min EmðHjT; SÞ¼qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi / 2 pffiffiffiffiffiffi / 2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi: presented in Figure 3 as a function of n and for different values T S1 H n1 2 2 = lH S þ kH SðS 2 1Þ kn p 88 9 of S. n n The expected value for all tests described by Equation 14 is (29) a linear combination of imbalances VarðdkÞ with coefficients of the same sign: Zeng’sE:Its maximum corresponds to a tree with length " concentrated in the upper branches (k ¼ 2), while its mini- Xn S ktk mum corresponds to star-like trees (i.e., length concen- EmðTVjT; SÞ¼ aVarðdkÞ NVðSÞ l trated in the lowest branches, k ¼ n). The corresponding k¼2 !# (25) values are n2 n h i þ a þ b þ g rffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 n 1 n 1 k k 2 2 S 2 2ðn 1Þ an 2ðn 2 1Þ an 1 3 max EmðEjT; SÞ¼qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi / pffiffiffiffiffi / logðnÞ T S1 E n1 2 p2 2 9 lES þ kESðS 2 1Þ kn For this reason, maximum and minimum values correspond to n n maximallybalancedorunbalancedtopologies.Hence,toobtain (30) these values, it is sufficient to replace VarðdkÞ by its maximum and or minimum, then maximize/minimize the result over the rffiffiffiffiffiffiffiffiffiffiffiffiffiffi = 1 2 1 S 1 2 1 waiting times tk l (see File S1). The maximum imbalance is n 2 1 an 2 3 min E ðE T; SÞ¼qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi / n p1ffiffiffiffiffian / 2 m j 2 2 given by Equation 1, while the minimum imbalance will be T E E 2 S1 kE n1 p 9 lnS þ knSðS 1Þ n approximated by minT VarðdkÞ0: In the following we will 2 1:9: (31) also use the related approximation ⌊n=k⌋ n=k: Both ap- = proximations are correct up to Oð1 nÞ for large trees. The minimum value of EmðEjT; SÞ is also the minimum abso- lute values of E. Tajima’sD:Its maximum corresponds to a tree with maxi- mally balanced topology and length concentrated in the L test: Its maximum corresponds to a star-like tree with length upmost branches (k ¼ 2), while its minimum corresponds concentrated in the lowest branches (k ¼ n), while its minimum to all maximally unbalanced trees with length concentrated corresponds to a maximally unbalanced tree with length con- in the upmost and lowest branches (k ¼ 2; n). The corre- centrated in the upmost branches (k ¼ 2). The corresponding sponding values are values are h i h i n 2 1 n 1 2 S 2 1 2 2 1 2 2 2ðn 1Þ an 2ðn 2 1Þ an 3 2 S ; qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi / pffiffiffiffiffiffi / ffiffiffi an nðn 1Þ an nðn 2 1Þ 3 max EmðDjT SÞ¼ p logðnÞ max E L T; S qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi / pffiffiffiffiffi / pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi T S1 D n1 mð j Þ¼ lDS þ kDSðS 2 1Þ kn 2 2 T L L 2 S1 kL n1 2 6p2 2 29 n n lnS þ knSðS 1Þ n (26) 0:3 (32) and and h i 2 2 2 2 2 1 2 1 1 2 ðn 1Þ þ1 1 2 ðn21Þ þ1 S 2 a 2 S 2 n an n a 3 n nðn 1Þ an nðn 1Þ 3 ; qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi / pffiffiffiffiffiffin / 2 ffiffiffi 2 : ; min EmðLjT; SÞ¼qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi / pffiffiffiffiffi / 2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi logðnÞ: min EmðDjT SÞ¼ p 2 1 L 2 T S 1 D n 1 T L L 2 S1 k n1 2 6p 2 29 D D 2 k 2 lnS þ knSðS 1Þ n ln S þ kn SðS 1Þ n (33) (27) The maximum value of E ðLjT; SÞ is also the maximum abso- where the first arrow in each equation represents the limit of a m lute value of L. large number of segregating sites; and the second, the asymp- The dependence of the extreme values on n and S is shown totic behavior for a large sample size. The maximum in Figure 3 for all the tests discussed above. and minimum values of E ðDjT; SÞ are also the absolute max- m These results are useful to interpret the actual strength of the imum and minimum values of D over all possible spectra. signal given by the tests. The normalization of neutrality tests suggests that values between 21 and 1 fall into the normal range Fay and Wu’sH:Its maximum corresponds to a tree with for realizations of the neutral model without recombination. maximally balanced topology and length concentrated (sur- However, there is no indication of which values could be deemed prisingly) in branches at k ¼ 4; while its minimum corresponds “large” in absolute terms. The extreme values computed above to a maximally unbalanced tree with length concentrated in fill this gap, since they give a natural reference in terms of ex- the upmost branches (k ¼ 2). The corresponding values are treme trees. These values can be used to see if a tree is close to n n 4ðn 2 1Þ S 4ðn 2 1Þ logðnÞ one of the extreme trees for the test used, and to understand how max EmðHjT; SÞ¼qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi / pffiffiffiffiffiffi / pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi T H H 2 S1 kH n1 4 p2 2 88=9 large a signal of nonneutrality could be in theory. ln S þ kn SðS 1Þ n As an example, consider the regions of the human genome (28) shown in Figure S1 in File S1, based on data from 1000 and Genomes Project Consortium et al. (2015). The strong signals

238 L. Ferretti et al. of selection in Central Europeans detected by Fay and Wu’s The interpretation of common estimators and tests is H appear much less extreme when compared with the summarized in Table 1. This interpretation is rigorous and theoretical minimum, which is so low that it does not appear consistent with intuition. Our results help to understand the in the plot. On the other hand, the deviations from neutrality peculiarities of the different tests. For example, we reinter- shown by Tajima’s D 136.4 Mbp of chromosome 2 and pret Zeng’s E as a test for star-likeness, and understand its 29.4 and 30.6 Mbp of chromosome 6 in Central Europeans reduced power to detect selection compared to Fay and Wu’s do not look impressive, unless we notice that it is pretty close H as a consequence of its insensitivity to tree imbalance and to the minimum possible value for the test. As another exam- of the compressed distributions of its negative values. ple, the deviations from neutrality of L and Fay and Wu’s H The imbalance measure VarðdÞ is also related to other 31.3 Mbp on chromosome 6 in Yoruba look similar, but balance statistics proposed recently, namely the root balance the minimum of L is much closer. v1 and the standardized sum v1 þ v2 þ v3 (Li and Wiehe The results of this section could also be used to renormalize 2013), which can also be inferred quite reliably from se- neutrality tests in the spirit of Schaeffer (2002) (see File S1). quence data. In contrast, balance statistics such as Colless’ However, our results could not actually show any improve- index (Colless 1982), which considers the average balance of ment with respect to the usual normalization. the tree across all internal nodes, are less suited for popula- tion genetic applications, since balance at lower nodes cannot usually be estimated from sequence data, due to the paucity Discussion of polymorphisms which separate closely related sequences. The ancestry of the sequences in a sample from a single locus, Furthermore, recombination affects mostly the lower part of or an asexual population, is described by a single genealogical the tree, hence it introduces additional noise, preventing ac- tree. The same is not true for multi-locus analyses of sexual curate reconstruction of its topology. Further studies of species: recombination generates different trees along the VarðdÞ and similar imbalance measures on phylodynamic genome. Inferring these trees is possible only if there are trees could provide some interesting summary statistics. enough mutations per branch. However, in most sexual and The limitation of the approach presented here lies in the asexual populations, lower branches are typically short com- assumption that mutations are mostly neutral and the muta- pared to the inverse mutation rate. Moreover,in many eukary- tion rate is constant, i.e., mutations should occur randomly on otic genomes, the mutation and recombination rates are of the the tree. This assumption fails for the case of purifying selec- same order of magnitude, which means that there are just a tion, when deleterious mutations can be more abundant than few segregating sites in each nonrecombining fragment of the neutral ones and tend to accumulate on the lower branches of genome. The paucity of mutations, caused by the interplay the tree. In fact, for sequences under purifying selection, the of genetic relatedness within a population (hence short topology of the tree itself depends on the deleterious muta- branches), and recombination, does not allow a full recon- tions. Therefore, our approach could not work for tests aimed struction of the trees. Therefore, summary statistics are often at detecting rare alleles under purifying selection, like Fu and used for population genetics analysis. These statistics are also Li’s tests (or extreme negative values of Tajima’s D). computationally useful, since any given configuration of mu- Beyond clarifying the interpretation of existing tests, our tations has low probability and it is therefore hard to apply results open some possibilities for building new neutrality inference methods on the configuration itself. Moreover, they tests to explore different aspects of tree shape. Our new L test are more robust to details of the model like mutation and is a simple test for selection that shows an interesting behav- recombination rates. ior, with power similar to Fay and Wu’s H in the left tail and to Summary statistics are often more directly related to the Zeng’s E in the right tail, and therefore is able to detect de- mutation pattern of the sequences rather than to their gene- viations from neutrality in hitchhiking and selective scenarios alogy. In this work, we clarified the precise correspondence at different times and different recombination rates. This new between some SFS summary statistics and some features of L test is in the same class as Tajima’s D and the other tests, the genealogical trees. hence it is sensitive to the variance VarðdÞ: New tests in the It is well known that the frequency spectrum is sensitive to same class are possible, but one could imagine other tests tree topology and branch lengths. Interestingly, several esti- sensitive, e.g., to different combinations of the variances mators and neutrality tests built on the SFS, such as Watterson VarðdkÞ or to the skewness or kurtosis of Pðdk ¼ ijTÞ as well. uW ; Tajima’s D, and Fay and Wu’s H, show quite a simple While the variance VarðdÞ is a direct measure of imbalance dependence on tree imbalance and waiting times. A new and especially to the imbalance of the upper branches, other measure of tree imbalance, the variance in the number of combinations could be sensitive to different tree features. descendants of a mutation at a given level, plays an important While our results help to interpret positive and negative role in the interpretation of these neutrality tests. The sim- values of the tests, they also provide information about the size plicity of these results stems from the simple weights of these of these values. Given the normalization of the tests, it is well estimators and tests: the SFS is multiplied by functions of the known that the typical range of values of the standard neutral frequency that are constant (Watterson), linear (Zeng), or model is 61; and confidence intervals can be computed by quadratic polynomials (Fay and Wu and Tajima). coalescent simulations, but this says nothing about the size of

Tree Topology Affects Neutrality Tests 239 deviations from this model. Our results on extreme trees and Ferretti, L., F. Disanto, and T. Wiehe, 2013 The effect of single the corresponding extreme test values give some indication recombination events on coalescent tree height and shape. PLoS on the range of potential deviations from neutrality. One 8: e60123. Fu, Y.-X., 1995 Statistical properties of segregating sites. Theor. Finally, our approach can be used to understand the aver- Popul. Biol. 48: 172–197. age structure of the genealogical trees generated by models for Fu, Y.-X, and W.-H. Li, 1993 Statistical tests of neutrality of mu- which the expected SFS is known. Some of our results could tations. Genetics 133: 693–709. also find application in phylogenetic studies of closely related 1000 Genomes Project ConsortiumAuton, A., L. D. Brooks, R. M. Durbin, E. P. Garrison et al., 2015 A global reference for hu- species or populations, where the reconstruction of the phy- – fi man genetic variation. Nature 526: 68 74. logenetic tree could be dif cult or ambiguous. Griffiths, R., and S. Tavaré, 1998 The age of a mutation in a general coalescent tree. Stoch. Models 14: 273–295. Hein, J., M. Schierup, and C. Wiuf, 2004 Gene Genealogies, Vari- Acknowledgments ation and Evolution: A Primer in Coalescent Theory. Oxford uni- This work was stimulated by discussions with Michael Blum versity press, Oxford. Ho, S. Y., and B. Shapiro, 2011 Skyline-plot methods for estimat- and Filippo Disanto. We thank an anonymous reviewer for ing demographic history from nucleotide sequences. Mol. Ecol. useful comments. A.L. is funded by the United Kingdom Resour. 11: 423–434. National Institute for Health Research, Health Protection Kimura, M., 1985 The Neutral Theory of Molecular Evolution. Cam- Research Unit on Modelling Methodology (grant HPRU- bridge University Press, Cambridge, UK. 2012-10080). L.F. and G.A. acknowledge support from Kingman, J. F., 1982 On the genealogy of large populations. J. Appl. Probab. 19: 27–43. the grant ANR-12-JSV7-0007 from Agence Nationale de Lapierre, M., C. Blin, A. Lambert, G. Achaz, and E. P. Rocha, Recherche (France). G.A. acknowledges support from the 2016 The impact of selection, gene conversion, and biased grant ANR-12-BSV7-0012-04 from Agence Nationale de sampling on the assessment of microbial demography. Mol. Biol. Recherche (France). T.W. acknowledges support from Evol. 33: 1711–1725. DFG-SPP1590 by the German Science Foundation. S.E.R.O. Li, H., and T. Wiehe, 2013 Coalescent tree imbalance and a sim- ple test for selective sweeps based on microsatellite variation. is supported by grants CGL2009-09346 (MICINN, Spain), PLOS Comput. Biol. 9: e1003060. AGL2013-41834-R (MEC, Spain), by the CERCA Programme/ Liu, X., and Y.-X. Fu, 2015 Exploring population size changes Generalitat de Catalunya and acknowledges financial support using snp frequency spectra. Nat. Genet. 47: 555–559. from the Spanish Ministry of Economy and Competitiveness, Pybus, O. G., A. Rambaut, and P. H. Harvey, 2000 An integrated framework for the inference of viral population history from through the Severo Ochoa Programme for Centres of Excellence – ‐ ‐ reconstructed genealogies. Genetics 155: 1429 1437. in R&D 2016-2019 (SEV 2015 0533). Ramos-Onsins, S. E., 2017 Coalescent simulation software. Available at http://bioinformatics.cragenomica.es/numgenomics/people/ sebas/software/software.html. Literature Cited Schaeffer, S. W., 2002 Molecular population genetics of sequence length diversity in the adh region of drosophila pseudoobscura. Achaz, G., 2009 Frequency spectrum neutrality tests: one for all Genet. Res. 80: 163–175. and all for one. Genetics 183: 249–258. Sloane, N., and S. Plouffe, 1995 The Encyclopedia of Integer Se- Blum, M. G., and O. François, 2005 On statistical tests of phylo- quences. Academic Press, San Diego. genetic tree imbalance: the sackin and other indices revisited. Tajima, F., 1983 Evolutionary relationship of DNA sequences in Math. Biosci. 195: 141–153. finite populations. Genetics 105: 437–460. Blum, M. G., and O. François, 2006 Which random processes de- Tajima, F., 1989 Statistical method for testing the neutral muta- scribe the tree of life? a large-scale study of phylogenetic tree tion hypothesis by DNA polymorphism. Genetics 123: 585–595. imbalance. Syst. Biol. 55: 685–691. Wakeley, J., 2009 Coalescent Theory: An Introduction. Roberts & Blum, M. G., O. François, and S. Janson, 2006 The mean, variance Company Publishers, Greenwood Village, CO. and limiting distribution of two statistics sensitive to phyloge- Watterson, G., 1975 On the number of segregating sites in genet- netic tree balance. Ann. Appl. Probab. 16: 2195–2214. ical models without recombination. Theor. Popul. Biol. 7: 256– Bouckaert, R., J. Heled, D. Kühnert, T. Vaughan, C.-H. Wu et al., 276. 2014 Beast 2: a software platform for bayesian evolutionary Yule, G. U., 1925 A mathematical theory of evolution, based on analysis. PLOS Comput. Biol. 10: e1003537. the conclusions of Dr. J. C. Willis, F.R.S. Philos. Trans. R. Soc. Colless, D., 1982 Review of phylogenetics: the theory and practice Lond. B 213: 21–87. of phylogenetic . Syst. Zool. 31: 100–104. Zeng, K., Y.-X. Fu, S. Shi, and C.-I. Wu, 2006 Statistical tests for Fay, J., and C.-I. Wu, 2000 Hitchhiking under positive Darwinian detecting positive selection by utilizing high-frequency variants. selection. Genetics 155: 1405. Genetics 174: 1431–1439. Felsenstein, J., 2004 Inferring Phylogenies. Sinauer Associates, Zivkovic, D., and T. Wiehe, 2008 Second-order moments of seg- Sunderland, MA. regating sites under variable population size. Genetics 180: Ferretti, L., M. Perez-Enciso, and S. Ramos-Onsins, 2010 Optimal 341–357. neutrality tests based on the frequency spectrum. Genetics 186: 353–365. Communicating editor: Y. S. Song

240 L. Ferretti et al.