<<

Species Tree Likelihood Computation Given SNP Data Using Ancestral Configurations

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Hang Fan, M.S.

Graduate Program in Statistics

The Ohio State University

2013

Dissertation Committee:

Professor Laura Kubatko, Advisor

Professor Bryan Carstens

Professor Radu Herbei

1

Copyright by

Hang Fan

2013

2

Abstract

Inferring trees given genetic data has been a challenge in the field of phylogenetics because of the high intensity during computation. In the coalescent framework, this dissertation proposes an innovative method of estimating the likelihood of a species tree directly from Single Nucleotide Polymorphism (SNP) data with a certain nucleotide substitution model. This method uses the idea of Ancestral Configurations

(Wu, 2011) to avoid the computation burden brought by the enumeration of coalescent histories. Importance sampling is used to in Monte Carlo integration to approximate the expectations in the computation, where the accuracy of the approximation is tested in different tree models. The SNP data is processed beforehand which vastly boosts the efficiency of the method. Gene tree sampling given the species tree under the coalescent model is employed to make the computation feasible for large trees. Further, the branch lengths on the species tree are optimized according to the computed species tree likelihood, which provides the likelihood of the species tree topology given the SNP data.

For inference, this likelihood computation method is implemented in the stepwise addition algorithm to infer the maximum likelihood species tree in the tree space given the SNP data, and simulations are conduced to test the performance. We also apply this method to the problem of species delimitation in the purpose of validating proposed species delimitations given the SNP data, and we run simulations to check the validation ii outcomes under different scenarios, such as in the presence of subsampling in the SNP data.

iii

Dedication

This document is dedicated to my parents.

iv

Acknowledgments

Looking back in time, just like the coalescent theory used in my dissertation, I have been enjoying so much in my PhD study in statistics. Moreover, I have gained the best experience in my life and this will continue to benefit my future career and life. I am writing the acknowledgement in the last moment before submission, because this will come to an end for my PhD odyssey and it is certainly hard to say goodbye to many people I am grateful to and the time being a PhD student.

Life is full of decisions. There are not many people or opportunities, however, which can make life different. My dear advisor Professor Laura Kubatko is certainly one of my life- changing angels. Thanks to her, I made my transition from biology to statistics, the field I have passion for. She guided me to discover the joy of research, and also showed me of how to do things in a professional way. In personal life, she is also a great friend and I am still amazed by how she built a balance between career and life. PhD study is a tough and challenging exploration. Hence, it is extremely important for me that Laura has always been supportive, encouraging, patient and understanding. Words cannot say enough about my gratitude to her. I feel so blessed to have Laura as my advisor and she is no doubt my role model.

v

I want to thank Professor Radu Herbei, who is my most favorite teacher and also sit on my committee. I took ALL the statistics classes he taught, including probability theory series, stochastic process, large sample theory, stochastic differential equations, etc. I truly love learning from him. He always explains theories intuitively, and he made the beauty of math and statistics visible and fascinating.

Professor Bryan Carstens, another committee member, also helped me a lot in my research. I highly appreciated Bryan’s flexibility and kindness to discuss research with me. With his help, I had a much better understanding to species delimitation. Bryan also gave me many valuable suggestions from the perspective of an empirical biologist.

I had so many wonderful memories in Department of Statistics at OSU. Professor Mark

Berliner’s class simply enlightened my interest into statistics. I enjoyed the classes taught by Professor Peter Craigmile, Professor Doug Wolfe, Professor Tom Santner, Professor

Angela Dean and Professor Dennis Pearl. I also want to thank our kind staff to offer me so much help.

Outside school, I would like to thank my awesome friends to bring sunshine and joy to my life. Special thanks to my friends, Dr. Chia-Hua Lin and Dr. Agus Munoz-Garcia.

Their friendship accompanies me to go through ups and downs in my PhD years.

vi

At last, I want to thank my parents, Xiaohong Zhang and Yamin Fan, for EVERYTHING.

In my last year of PhD, when I was very pressured from research and job hunting, they once told me, “no matter what happens, there are always two people on the other side of the world, thinking of you.” My parents are respective and supportive to all my decisions and their love drives me to pursuit my dreams. Now dream comes true.

In the end, again, I want to thank the people here who have helped me.

Thanks to you, I become a Doctor of Philosophy and a person I’m proud of.

vii

Vita

2003 ...... Tanglai High School

2007 ...... B.S. Biological Sciences, Tsinghua

University

2010 ...... M.S. , The Ohio State University

2010 to present ...... Graduate Teaching Associate, Department

of Statistics, The Ohio State University

Publications

Helen Hang Fan and Laura S. Kubatko. 2011. Estimating species trees using approximate Bayesian computation. Evolution 59: 354-36.

Laura S. Kubatko and Helen Hang Fan. 2012. Reply to “Letter to the Editor on the article entitiled “Estimating species trees using Approximate Bayesian Computation” (Fan and Kubatko, Mol.Phylogenetics Evol. 59, 354-363). Molecular Phylogenetics and Evolution 66(1): 438-439.

Zexuan Li, Yishu Huang, Jing Ge, Hang Fan, Xiaohong Zhou, Shentao Li, Mark Bartlam, Honghai Wang, and Zihe Rao. 2007. The crystal structure of MCAT from Mycobacterium tuberculosis reveals three new catalytic models. Journal of Molecular Biology 371(4): 1075-1083. viii

Fields of Study

Major Field: Statistics

ix

Table of Contents

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... viii

Table of Contents ...... x

List of Tables...... xiii

List of Figures ...... xiv

Chapter 1: Background and Literature Review ...... 1

1.1 A definition of species tree and gene tree...... 1

1.2 Models for SNP evolution along gene trees ...... 6

1.2.1 Nucleotide substitution models ...... 7

1.2.2 Computation of the likelihood of a gene tree given a species tree and SNP data

...... 12

1.3 Models for generating gene trees from a species tree ...... 15

1.4 The overall model for the evolution of SNP data given a species tree ...... 22

1.5 A survey of methods used for inferring species trees from genetic data ...... 23 x

1.5.1 Sequence-based methods ...... 24

1.5.2 Summary statistic methods...... 26

1.6 Overview of this dissertation ...... 27

Chapter 2: Likelihood Computation Using Ancestral Configurations ...... 28

2.1 Definition of AC ...... 28

2.2 Advantages of using AC ...... 33

2.3 A method for the likelihood computation of a species tree given SNP data using

ACs...... 34

2.4 SNP data processing and gene tree sampling to improve computational efficiency

...... 44

2.5 Accuracy of likelihood estimation ...... 46

2.6 Optimization of branch lengths on the species tree ...... 53

Chapter 3 Species tree inference ...... 56

3.1 Tree searching strategies ...... 56

3.1.1 Exact methods ...... 57

3.1.2 Heuristic methods ...... 60

3.1.3 Stochastic methods ...... 62

3.2 A likelihood-based tree searching approach using stepwise addition ...... 63

3.2.1 Algorithm ...... 63

xi

3.2.2 Simulation ...... 64

Chapter 4: Species Delimitation ...... 72

4.1 Current validation methods in species delimitation ...... 72

4.2 A likelihood method for species delimitation using SNP data ...... 75

4.2.1 Algorithm ...... 75

4.2.2 Simulation ...... 77

4.2.2.1 Simulation design ...... 77

4.3.2.2 Simulation results ...... 81

Chapter 5: Discussion ...... 86

References ...... 91

xii

List of Tables

Table 1. Comparison of the numbers of ACs and histories as the number of taxa (N) goes up (Wu, 2011)...... 34

Table 2. Branch lengths settings for the simulations in stepwise addition...... 65

Table 3. Frequencies of the estimated true species trees and the averaged RF distances of the four simulation settings in stepwise addition...... 67

Table 4. Percentages of the true species trees found in three additional sets of simulations with different initial trees...... 70

Table 5. Results for Simulation I...... 82

Table 6. Results for Simulation II with subsampling...... 83

Table 7. Results for Simulation III with subsampling and long branches...... 83

xiii

List of Figures

Figure 1. An example of a 3- ((A, (B, C)))...... 2

Figure 2. A gene tree ((B, C), A) is contained in a species tree ((B, C), A)...... 4

Figure 3. A “unit” of tree for the pruning algorithm of Felsenstein (1981)...... 13

Figure 4. Discordance between the species tree (A, (B, C)) and the gene tree ((A, B), C) caused by deep coalescence...... 16

Figure 5. ACs and OLs in an example 4-taxon species tree...... 31

Figure 6. An interval composed of an AC and an OL in a general interval on the species tree...... 38

Figure 7. Boxplots of the estimates of the negative log likelihood of the species tree for different numbers of samples (W) in importance sampling...... 48

Figure 8. Plots of log likelihood estimates and one interval length...... 52

Figure 9. A Branch and Bound example for a 4-taxon species tree...... 59

Figure 10. Two 5-taxon species tree models, asymmetric and symmetric, in stepwise addition simulations...... 65

Figure 11. Distributions of the RF distance between the ML species tree estimates and the true species tree under the four scenarios in stepwise addition simulation...... 68

xiv

Figure 12. Two species delimitations and their associated species trees used in species delimitation simulation I, II, and III...... 77

xv

Chapter 1: Background and Literature Review

This dissertation is to compute the likelihood of a species tree given SNP data, under the coalescent model. Further the computed likelihood of the species tree can be used to infer the maximum likelihood species tree and can also be applied to species delimitation given SNP data. In Chapter 1 we will depict the problem of species tree estimation given

SNP data by reviewing the background and current methods. At the beginning, the concepts of a species tree and a gene tree will be defined in Section 1.1. We will inspect probabilistic nucleotide substitution models for SNP evolution along the gene tree in

Section 1.2. In Section 1.3, we will study and model the incongruence between gene trees and species trees caused by deep coalescence. Finally, the overall model for SNP evolution given the species tree will be presented in Section 1.5. In addition, we will survey current methods of species tree estimation given genetic data in Section 1.6.

1.1 A definition of species tree and gene tree

As a unit used in biological classification, a taxon (plural: taxa) is a group of one or more populations of organisms. For example, a taxon might be a species or a population. A fundamental question of interest is how the relationships among a group of taxa may best

1 be described. A branching diagram, called a phylogenetic tree, has been used to describe such relationships among taxa. A tip node of a phylogenetic tree represents an extant taxon, while an internal node is assumed to be the most recent common ancestor (MRCA) of all its descendent nodes. There are different types of phylogenetic trees, and here we use a bifurcating rooted tree, in which we assume the root node of the tree as the MRCA for all the taxa on the tips of the tree and an internal node has exactly two descendent nodes.

Figure 1. An example of a 3-taxon phylogenetic tree ((A, (B, C))).

Figure 1 shows a rooted bifurcating phylogenetic tree for taxa A, B, and C, which we denote in the Newick format, (A, (B, C)) (Felsenstein, J. The Newick tree format.). The tree is directed from past (top) to present time (bottom). There are three tip nodes,

2 representing three extant taxa, A, B, and C. The internal node is the MRCA for taxa B and C and the root node is the MRCA for all the three taxa. The tree branches connect the nodes, and the branch lengths are proportional to the amount of evolutionary time. For example, the branch between tip node A and the root node is measured by units of time. The topology and branch lengths fully determine a phylogenetic tree.

If taxa are species, a phylogenetic species tree represents the true evolutionary relationship among the species, which is typically unobserved and is generally the object of interest. An internal node of a species tree denotes a speciation event, and a branch length is measured by the time between two speciation events.

A common goal in evolutionary biology is to estimate the species tree given some data about the sampled taxa. One common type of data is DNA sequence data.

Deoxyribonucleic acid (DNA) carries the genetic information in almost all living organisms. Researchers collect DNA samples from different species in order to study their evolutionary relationship. DNA is a double-stranded helix, and is composed by a chain of nucleotides. There are four different types of nucleotides, adenine (A), guanine

(G), cytosine (C) and thymine (T). For example, a sequence of DNA of length 10 might be “AGGCCTTTAC”. For all species, except for some viruses, a genome refers to the entire sequence of DNA. The DNA is organized into chromosomes. Over time, mutation in the DNA sequences occurs.

3

In DNA replication, a DNA sequence gives copies to the descendants, making the genetic history a branching tree (Maddison, 1997). A genetic locus (plural: loci) is the location of a gene or DNA sequence on a chromosome. If some DNA sequences located on the same genetic locus are collected from a group of taxa, we can also use a phylogenetic tree to represent the evolutionary history, called the gene tree. Each genetic locus generates its own gene tree. A gene tree is also assumed to be a rooted bifurcating tree. The branches of a gene tree are called gene lineages. Considering all the genes in the genome for a species, the species tree is actually a bundle of gene histories or lineages, that is, the gene trees are contained in the species tree (Maddison, 1997).

Figure 2. A gene tree ((B, C), A) is contained in a species tree ((B, C), A). and give the intervals between speciation events, which are denoted by dotted lines. Coalescent events are represented by a and b, in interval 5 and 4, respectively. and are coalescent times associated with the coalescent times.

4

Figure 2 is an example of a gene tree contained in a species tree for 3 taxa, A, B, and C.

The gene tree and species tree share the same topology (A, (B, C)). The five branches on the species tree are called intervals (marked by circled numbers), which are separated by the dashed lines, marking the speciation events. Lengths of intervals will be called speciation times, for example is the speciation time for interval 4. Intervals 1, 2 and 3 are tip intervals, representing the times during which the extant taxa A, B, and C existed as distinct species. Interval 4 and 5 are internal intervals. Interval 5 is a root interval, where all the gene lineages will join into one , called the root lineage. All the intervals indicate populations living in the time intervals, present (tip intervals) or past

(internal intervals), whose lengths are measured by the speciation times.

We assume that no gene flow has occurred between distinct species. In Figure 2, the gene tree ((B, C), A) has to satisfy the constraint imposed by the species tree. Viewing time backwards, when gene lineage 2 from taxa B and gene lineage 3 from taxa C coalesce at node , the coalescent event (node ) has to be above the dashed line, or earlier than the speciation event of B and C. Similarly, node is the other coalescent event, which joins lineages 1 and 4, and it must occur in the root interval 5 under this assumption. and are the coalescent times at which the two coalescent events occur, which are defined relative to their respective intervals. Thus has to be in the range[ ], while is in the range [ . Let be the number of generations and be the mutation rate per site per generation (assumed constant throughout time). Both speciation times and coalescent

5 times will be measured here by , the expected number of mutations per site for generations.

1.2 Models for SNP evolution along gene trees

With DNA sequences collected from different taxa, we assume that they share the same ancestor. The sequences are aligned, and there are many methods to computationally align the sequences (Lipman et. al., 1989, Chenna et. al., 2003, Lipman and Pearson,

1985; Notredame, et. al., 2000). In this dissertation, we assume each site of the aligned

DNA sequences has a conditionally independent evolutionary history (gene tree) given the species tree, and in reality single-nucleotide polymorphisms (SNPs) will be the best fit. A SNP is a genetic variation that occurs on a single nucleotide site among sampled individuals, which is caused by DNA mutations, such as point mutation, insertion or deletion. For each taxon, we sample and align SNPs with equal number of sites from one or more individuals. If we sample five SNPs from four individuals, the data look like below

6

1.2.1 Nucleotide substitution models

In order to estimate the species tree, we need a model for the evolution of SNPs over time. Here we focus on probabilistic nucleotide substitution models.

The mutation process along the lineages of the gene tree is modeled by a finite-state continuous time Markov process, represented by { }, where is the state of the process at time t. For DNA data, { } for all . Let be the probability of the process changing from state to state during a time interval t. We can store all the transition probabilities ( ) in a 4×4 transition matrix .

This model assumes that the Markov property is satisfied: { | }

{ | } , where contains all the information up to time . Then,

{ | }, which indicates that the probability of the future state of the process at time only depends on the current state at time , and not the previous history, including how the process has reached the current state . If all the states can communicate (irreducible, , which means there is a positive probability path between state and state ), there exits a limiting distribution

∑ and for all the states, { }.

If , the Markov process is reversible. If the Markov process is stationary, . We can use the Markov property to get the Chapman-

Komolgorov equation: ∑ , or , in

7 matrix notation. Since is continuous and differentiable, we take the derivative of all the entries in with respect to , so we have , where is the instantaneous rate matrix. Given the initial condition , where is the identity matrix, the solution to this differential equation is . In reverse, we can compute the instantaneous rate matrix by . All the off-diagonal entries of are rates of substitutions per unit time. We can obtain the limiting distribution by solving

.

( )

In the instantaneous rate matrix, for example, is the substitution rate from A to G and

∑ , { }. Since the row sum must be 0, the ( ) entry will be . Similarly, the ( ) entry is , the ( ) entry is , and the ( ) entry is

. There are 12 parameters (the off-diagonal entries) in the matrix , so 12 parameters will fully determine the rate matrix.

Most commonly used models use fewer than 12 parameters by placing restrictions on some of the values to reflect some biologically plausible models. For example, the simplest model is the JC69 model (Jukes and Cantor, 1969), which assumes that the substitution rates among the nucleotides are all the same, and the limiting distribution has

8

equal proportions, where ⁄ . So in the model there is only one parameter, which is the substitution rate . The instantaneous rate matrix is

( )

The transition probability matrix can be computed from , which gives

{

If we want to distinguish transitions (mutations between purines A and G or pyrimidines

C and T) and transversions (mutations from a purine to a pyrimidine or vice versa), the

K80 model (Kimura, 1980) can be used. K80 also assumes equal base frequencies in the

limit distribution, that is ⁄ . So K80 has 2 parameters: the substitution rate for transitions and the substitution rate for transversions. The instantaneous rate matrix is as follows,

( )

9

The transition probability matrix is

{

The F81 model (Felsenstein, 1981) deals with the case that accounts for different stationary probabilities ( but with the same mutation rate (μ). So

F81 has 4 parameters. The instantaneous rate matrix is

( )

The transition probability matrix is

{

If we extend the F81 model by differentiating between transitions and transversions, we have the HKY85 model (Hasegawa, Kishino and Yano, 1985). The HKY85 model assumes different stationary probabilities and different transition-transversion mutation 10 rates. Let be the mutation rate for transversions and κ be the ratio of the mutation rates of transitions over transversions. There are 5 parameters, and the Q matrix is

( )

By solving , we get the transition probability matrix. Another model TN93 model (Tamura and Nei, 1993) distinguishes the two different transitions (A and G, versus C and T), but uses the same transversion rate. The TN93 model is very similar to the HKY85 model.

The most general nucleotides substitution model is the general time-reversible model

(Tavare, 1986) (GTR model). Because of the assumed reversibility for the rate matrix:

, instead of considering 12 parameters, the GTR model involves only 9 parameters, including 3 different stationary probabilities and 6 different substitution rates among 4 states. Hence, the instantaneous rate matrix Q is

( ) 11

Nucleotide substitution models have been developed into many different types of models that are more biologically plausible. Some researchers considered that some sites of nucleotides are invariant (unchanging) over time, and the substitution rates of the variant sites can be modeled by a Gamma distribution (Nei and Gojobori, 1986; Yang, 1993;

Tamura and Nei, 1993). An extension of the GTR model is called the GTR+I+ model, where I stands for the proportion of invariant sites and stands for Gamma distributed rates among the variant sites.

1.2.2 Computation of the likelihood of a gene tree given a species tree and SNP data

Nucleotide substitution models provide the probabilistic models to infer gene trees from

SNP data. Recall that we assume that each SNP has a conditionally independent gene tree given the species tree. Under a given nucleotide substitution model, we can compute the likelihood of a gene tree given SNPs. In addition, evolution along different lineages on the gene tree is assumed to be independent. Felsenstein (1981) provides a pruning or peeling algorithm, making it feasible to compute the likelihood of a fixed gene tree given a SNP.

The idea behind pruning is to do the calculation recursively from bottom to top, “unit by unit” of the tree. For illustration, a “unit” of the tree is composed of three nodes, one parent node and two descendent nodes (see Figure 3).

12

Figure 3. A “unit” of tree for the pruning algorithm of Felsenstein (1981). The parent node has the left child node , with the time interval between them. The right child node has the time interval from the parent node .

As shown in Figure 3, is the parent node, with two children nodes and . Each node has one of the states, A, G, C, or T. The two branch lengths from the parent to children nodes are and , respectively. The conditional likelihood of the data at the parent node or below (the dashed line in Figure 3) given that the parent node has state ,

, is:

(∑ ) (∑ )

13

In equation (1), is the conditional likelihood of the data at node s or below given that node has state j. is similarly defined as . If is a tip node (at bottom of the tree) with an observed state , the conditional likelihood of the observed data at tip node given that tip node has state , , is an indicator function: if the observed state is not , the conditional likelihood

; if the observed state is , then .

So starting from the tips (bottom), we recursively compute the conditional likelihoods for the parent nodes in the “units”, all the way up to the root. At the root, we obtain conditional likelihoods of the data at the root and below given each of the four possible states observed at the root, , where { }. Thus the overall likelihood of the gene tree ( ) given a SNP ( ) is

| ∑ ∑

In equation (2), can be recursively computed using equation (1).

is the probability of the root having state . Since the Markov process for all the base substitution models is assumed to have reached the stationary distribution, the base frequency at the root node is just the base frequency in the stationary distribution, { }. It is worth noting that the likelihood of a gene tree given a SNP is taken as a function of the gene tree (a summary statistic),

14 instead of a function of data (Feslenstein, 1981).The same pruning idea will be used in my dissertation.

1.3 Models for generating gene trees from a species tree

In Figure 2, the gene tree and the species tree share the same topology, so the gene tree describes the true evolutionary relationship among A, B and C. However, in practice, a gene tree can disagree with its species tree topologically for the following reasons: lineage sorting/deep coalescence, horizontal gene transfer/hybridization, gene duplication/extinction, hybridization and nonneutral evolution (Syvanen, 1990,

Maddison, 1997, Linz et. al., 2007).

In this project, we only focus on the discordance between the gene tree and the species tree caused by deep coalescence, which has been widely studied (Maddison, 1997;

Felsenstein 2004; Maddison and Knowles, 2006; Edwards, 2009). When viewing the evolutionary process backward in time, the MRCA of gene lineages may coalesce deeper than the speciation event, resulting in deep coalescence. For example, in the species tree

(A, (B, C)) of Figure 4, the gene lineage from taxon B fails to coalesce with the gene lineage from taxon C, but coalesces with the gene lineage from taxon A first, resulting in a different gene tree ((A,B),C) from the species tree. This process can be affected by the width of a branch, which represents the effective population size and the length of a branch, which represents the speciation time. If the branch is short and wide, it’s more

15 likely for gene lineages to fail to coalesce during the speciation time, leading to deep coalescence (Maddison, 1997).

Figure 4. Discordance between the species tree (A, (B, C)) and the gene tree ((A, B), C) caused by deep coalescence. The dotted lines mark the speciation events.

The discrete time coalescent models are built from the models in population genetics.

One of the most popular models in population genetics is the Wright-Fisher model. It assumes that all generations have the same size that all the individuals in one generation will be replaced by the offspring in the next generation. The next generation is produced by selecting individuals at random with replacement from the current generation. Hence generations do not overlap and remain the same size throughout time. Another famous model, the Moran model, also assumes that the size of the population is constant. The

16 individuals in the population reproduce as follows: in every discrete time intervals, two individuals are randomly chosen with replacement, the first one reproduces itself

(becomes two copies of the first gene copy) and the second one dies. So the generations in the Moran model are overlapping, in contrast to the non-overlapping generations in the

Wright-Fisher model.

Here we only focus on the model developed from the Wright-Fisher model. Under the

Wright-Fisher model, assuming the population size of the gene copies from diploid individuals is , we know that the probability of two gene copies sharing same parent in the previous generation is . So the number of generations until two genes first share a

parent follows a Geometric distribution, with success probability . In a sample of

gene copies, the number of generations until at least two genes share a common ancestor follows a Geometric distribution, with success probability .

Kingman (1982a, b, c) derived a the continuous time approximation to the discrete-time

Wright-Fisher model by showing that: as the population size goes to infinity, the time

(measured in number of generation) to the first coalescent event in a sample of lineages independently follows an exponential distribution. So when gene lineages coalesce into

lineages, this coalescent time is exponentially distributed, with mean

(Wakeley, 2008).

17

Under the coalescent model, we are able to obtain the distribution of gene trees given a species tree (S), | . contains all the possible gene trees given the species tree ,

{ }. Every gene tree includes the tree topology and the coalescent times along the gene tree, where . Assuming that gene trees given a species tree are conditionally independent, we have

| ∏ | ∏ |

Rannala and Yang (2003) derive the probability density of a gene tree with specified tree topology, given a species tree, | . One parameter of a species tree is the effective size of the population, , for all the populations. We re-parameterize as , where is the mutation rate, and in this dissertation we consider as a known constant throughout time.

We will assume that coalescent events in different intervals are independent. Thus we show work on only one interval . In this interval , let the number of gene lineages at the start of the interval be , the number of lineages at the end of the interval be , and the length of the interval be . Note that there are coalescent events in this interval. Let be the time at which j lineages coalesce into ( ) lineages, where

. The joint probability density of the tree structure of lineages

18

joining into lineages in the interval of length and the coalescent times

( ) is

{ [ ( )]}

∏ [ { }]

Because the coalescent events in different intervals are independent, the probability density of the whole gene tree , including the topology and branch lengths , given a species tree with intervals, would be the product of (4) of all the intervals:

| |

∏ { { [ ( )]}

∏ [ { }]}

This density is usually called the coalescent density. If we integrate the coalescent density over the coalescent times, the integral will give us the probability of a gene tree topology given a species tree.

19

In equation (5), coalescent events have been assigned to certain intervals, but because of deep coalescence, the coalescent events can also take place in deeper intervals. For example, in Figure 2, coalescent event occurs in interval 4 (say, possibility I), and is defined within interval 4. However, for the same gene tree topology and the species tree,

can also take place in the root interval 5 (say, possibility II). So the coalescent density describing the gene tree and species tree setup in Figure 2 can only represent possibility I of the gene tree, given the species tree.

If we use “coalescent histories” to represent the “possibilities” under a gene tree topology, given a species tree, one gene tree topology can have different histories

(Degnan and Salter, 2005). Therefore, the coalescent density (shown in equation (5)) is actually the probability density for a coalescent history of a gene tree, given a species tree. Instead of using gene trees, if there are H possible coalescent histories for species

tree , for the history , we have the probability density function as

| ∏ { { [ ( )]}

∏ [ { }]}

20

Degnan and Salter (2005) further derived the marginal distribution (discrete distribution) of gene tree topologies given a species tree, by summing over the probabilities for all possible coalescent histories.

For a -lineage gene tree, there are coalescent events, and we’d like to store the intervals in which the coalescent events take place. Because the last coalescent event must occur in the root interval, we do not need to store it. A coalescent history for a gene tree is thus a vector with elements, where each element records the interval where the coalescent event takes place. For instance, for the 4-taxon gene tree in Figure

2, we have the two coalescent events . There are two coalescent histories: the history shown in Figure 2 is , and the other possible history is , since event can happen in interval 4 and 5 but event can only happen in root interval 5. In Degnan and Salter (2005), a probability mass function for a gene tree topology given a species tree is explicitly given by summing over all the histories under the constraint of the gene tree. Degnan and Salter (2005) implemented this algorithm in a program called COAL to provide a gene tree topology distribution given a species tree.

However, when dealing with large trees (gene tree or species tree with a large number of taxa), COAL is limited by the enumeration of coalescent histories. The number of histories increases dramatically as the number of taxa increases. According to Degnan and Salter (2005), for a 20-taxon asymmetric species tree, when the gene tree has the

21 same topology as the species tree, there are more than 1.7 billion histories to enumerate, so the computation cost is extremely high for large tree problems.

1.4 The overall model for the evolution of SNP data given a species tree

Now we can formulate the likelihood of a species tree given SNP data, by combining the models from Section 1.2 and Section 1.3. According to Maddison (1997), for K possible gene trees given the species tree, the likelihood of the species tree ( ) given SNPs ( ),

| , is given by

| ∏ {∑ ∫[ | | ] }

In equation (7), | is from SNP evolution along the gene tree and | is the coalescent density.

Since each gene tree includes multiple coalescent histories, we will modify equation (7).

Instead of summing over gene trees, we need to sum over the possible gene tree histories given the species tree .

| ∏ {∑ ∫[ | | ] }

22

The likelihood of a species tree can be used as the basis for inference. One immediate application would be to search for the maximum likelihood species tree given data.

Bayesian approaches use the likelihood function to give a posterior distribution for the parameters on the species tree and the species tree itself.

According to Degnan and Salter (2005), when gene tree and species tree are both maximally asymmetric, it generates the largest number of gene histories. Note that when the gene tree and the species tree do not match in topology, the number of histories will be less (Degnan and Salter, 2005). However, it does not change the fact that the total number of gene histories given the species tree is still too large to enumerate. To get over the issue of history enumeration, Wu (2011) proposed an algorithm to infer the distribution of gene tree topologies given the species tree by summing over what he called ancestral configurations (ACs), rather than summing over histories as in Degnan and Salter (2005). This algorithm is implemented in the program STELLS, which runs faster than COAL when computing the probability distribution of gene tree topologies given a species tree.

1.5 A survey of methods used for inferring species trees from genetic data

23

Recall that our goal is to estimate the species tree using genetic data at the tips of the tree.

In this section, we give a brief literature review of some of the currently available methods for doing this. These can be broadly divided into sequence-based methods and summary statistic method. Each class of methods is discussed below.

1.5.1 Sequence-based methods

Concatenation methods (Huelsenbeck et. al., 1996) simplify the whole process by making the assumption that all gene trees have the same evolutionary history ( ) given the true species tree , so | | . This method concatenates all the sequences into a long sequence. Although concatenation methods greatly lower the computation burden of the problem, they ignore the coalescent part of the process and assume that all gene share the same evolutionary history. A few studies have shown that when deep coalescence occurs in the short intervals on the species tree, concatenation methods will likely result in estimation of a wrong tree, (Carstens and Knowles, 2007; Kolaczkowski and Thornton,

2004; Kubatko and Degnan, 2007; Mossel and Vigoda, 2005).

In a probabilistic framework, we can use equation (8) to compute the likelihood of a species tree given DNA sequence data, and infer the maximum likelihood species tree.

However, the challenge in implementation is brought by the enormous number of gene tree histories given a large species tree which will affect the efficiency with which the likelihood can be computed. In addition, to infer a maximum likelihood species tree given data, we need to search over all of the tree topologies and optimize the branch lengths on each of the species tree topologies based on the likelihood. As one would 24 expect, inferring a maximum likelihood tree is extremely computationally intensive when the tree is large.

Since the maximum likelihood method is hard to implement, Bayesian approaches are applied to give a posterior distribution of the species tree given data. Liu (2008) developed the software Bayesian Estimation of Species Trees (BEST) which employs a two-step procedure to estimate the posterior distribution of species trees under the coalescent model. The first step involves the use of Mr. Bayes (phylogenetic software for

Bayesian inferences; Ronquist et al., 2012) to infer the posterior distribution for each individual gene tree. Next Markov Chain Monte Carlo (MCMC) is used with importance sampling to estimate the posterior distribution of the species trees. Instead of using multiple-stages estimation as in BEST, *BEAST (Bayesian Evolutionary Analysis

Sampling Trees; Heled and Drummond, 2010) uses one stage Bayesian MCMC analysis to estimate the joint posterior distribution of the gene trees and the species tree. Both

Bayesian approaches can estimate the parameters of the species tree given DNA sequence data, but both suffer from the common disadvantages of MCMC: the high intensity of computation and time-inefficiency.

To avoid the enumerate of gene tree histories, Bryant et. al. (2012) developed a method to compute the likelihood of a species tree directly from unlinked biallelic SNPs, without using gene trees or gene tree histories. The likelihood obtained from this method is exact and has been used in an MCMC approach to infer the joint posterior distribution of the

25 parameters of the species tree. The method is implemented in the program SNAPP (SNP and AFLP Phylogenies; Bryant et. al., 2012).

1.5.2 Summary statistic methods

The estimates of gene trees from sequence data can be viewed as summary statistics of the data. Given a user-specified nucleotide substitution model, the likelihood of a gene tree can be computed and the maximum likelihood estimate of a gene tree can be used as the estimate for the gene tree (Felsenstein, 1981).

Treating summary statistics as input data to infer the species tree can greatly reduce the computation burden. Species Tree Estimation using Maximum Likelihood (STEM;

Kubatko et. al., 2009) uses a sample of gene trees with branch lengths to estimate the maximum likelihood species tree. Liu et. al. (2009) proposed two method, Species Tree estimation using Average Ranks of coalescences (STAR) and Species Tree Estimation using Average Coalescent times (STEAC). STAR uses average ranks of coalescences of estimated gene trees to construct a neighbor-joining (NJ) tree (tree topology). STEAC uses the distance matrix which has entries as twice as the average coalescent times over the estimated gene trees, to build a distance tree topology as the estimate for the species tree. Liu et. al. (2011) computed the pseudo-likelihood estimate of the species tree using gene trees to infer the maximum pseudo-likelihood species tree, implemented in the program, Maximum Pseudo-likelihood for Estimating Species tree (MP-EST). Fan and

Kubatko (2011) applied the idea of Approximate Bayesian Computation in the species 26 tree estimation, using the gene tree topology distribution as the summary statistics (ST-

ABC).

STEM, MP-EST, and ST-ABC will give an estimate of the species tree with branch lengths, while STAR and STEAC only estimate the topology of the species tree. ST-ABC requires only gene tree topologies, and the STEM and STEAC require gene tree estimates with branch lengths. MP-EST can use gene trees with or without branch lengths. All the methods, which directly use gene trees, assume that under a coalescent model, the gene trees are estimated correctly. However, in reality, the estimation for gene trees inevitably comes with uncertainty, especially in the estimation for the branch lengths of the gene trees.

1.6 Overview of this dissertation

In Chapter 2, I will propose a method using the idea of ACs to estimate the likelihood of a species tree, given SNP data. This method of species tree likelihood estimation will be used to infer the maximum likelihood species tree given data in Chapter 3. Chapter 4 will be an application of my method to the problem of species delimitation. In the end, I will discuss the methods I have developed in Chapter 5.

27

Chapter 2: Likelihood Computation Using Ancestral Configurations

In this chapter, we will introduce a method to compute the species tree likelihood given

SNP data using ACs. ACs will be defined in Section 2.1, and Section 2.2 will compare

ACs with gene tree histories. In Section 2.3, we will illustrate how to use ACs in the likelihood computation. The efficiency of the method can be improved by processing the data and by sampling gene trees in Section 2.4. Section 2.5 will test the accuracy of this method. Using the estimated species tree likelihood, we will optimize the branch lengths on the species tree in Section 2.6.

2.1 Definition of AC

The concept of an AC was introduced by Wu (2011) as a method of speeding the computation of gene tree probabilities given a species tree under the coalescent model. In this section, the concept of an AC is reviewed, and the method used by Wu (2011) for enumerating ACs is explained.

Recall in chapter 1 that we considered gene histories nested within a species tree. Here we partition a species tree, along with its gene histories, into a collection of intervals, where each interval is defined to begin at the bottom and to end at the top.

28

We define an AC as a set of gene lineages at the bottom of one interval, right before the speciation time. Every interval has its own set of ACs. In a tip interval, since there is only one possible set of gene lineages observed from data, there is only one AC. Internal intervals, including the root interval, can have multiple ACs. Depending on the gene tree being considered, the gene lineages in an AC may or may not coalesce during the interval. So at the end of interval (top of the interval), different possible sets of outgoing lineages (OLs) can be produced. OLs from two children intervals will contribute to the

ACs in the parent interval. An AC in the parent interval is composed of an OL from the left child interval and an OL from the right child interval.

To clarify the notation, let denote an arbitrary interval of the species tree. At the

beginning (bottom) of interval , let denote the AC, and denotes the set of

all ACs in interval . At the end (top) of interval (top of the interval), let denote the OL in the interval and denote the set of all OLs generated in the interval.

As an example, consider the symmetric 4-taxon species tree shown in Figure 5(a) and the gene tree with the same topology in Figure 5(b). A post-order traversal of the tree can be used to find all possible ACs and OLs within the intervals of the species tree, as shown in

Figure 5(c).

In Figure 5, we only have gene lineage 1 observed from the data in tip interval 1, and so

{ }. There is one outgoing lineage in this interval { }, since no coalescent

29 event are possible when there is only a single lineage. Similarly, in tip interval 2 we have

{ } and { }. Note that interval 6 is the parent interval for child intervals 1 and 2, and each of the child intervals only has one OL. Hence, for parent interval 6, we

only have one AC, which is { }. According to the gene tree in

Figure 5(b), the lineages 1 and 2 in can coalesce, but the coalescent event does not necessarily take place in interval 6. So if lineage 1 and 2 coalesce into lineage 6 within

the interval, we have { }; if they don’t coalesce, we have { }. Similarly

on the other side of the species tree, in interval 5, starting with

{ }, in the end, we have two possible OLs that satisfy the constraint of the gene tree,

{ } and { }.

In the interval 7 of Figure 5(c), two OLs from interval 5 and two OLs from interval 6 will independently contribute to the ACs in their parent interval 7, so there are four ACs. Note that interval 7 is a root interval, where the gene lineages in each AC will coalesce into the root lineage 7, so the four OLs formed from the four ACs hold the same root lineage 7,

{ }, shown in Figure 5(c).

30

(a) (b)

(c)

Figure 5. ACs and OLs in an example 4-taxon species tree. In the intervals of the species tree ((A, B), (C, D)) in (a), all the possible ACs and OLs are listed in (c), with the constraint of the gene tree ((A, B), (C, D)) in (b).

31

After generating the ACs and OLs, Wu (2011) used these to compute the probability of a gene tree topology given the species tree. In the spirit of the pruning algorithm

(Felsenstein, 1981), we begin with the tip intervals, traverse all of the internal intervals in a post-order traversal and stop at the root interval. In a general interval with length

in the species tree S, let be the probability of changing number of gene

lineages from to (Rosenberg, 2002). The probability of , |

conditioning on the species tree is given by the probability of the , |

updated by the probability .

| |

In the root interval (call it interval ), we sum over the probabilities of OLs to get the probability of a gene tree topology given the species tree. Assume we have OLs in interval , so the probability of the gene tree topology given the species tree is just to combine the probabilities of all the OLs in the root interval ( ), which is given by the following in equation (10),

| ∑ |

32

2.2 Advantages of using AC

After introducing ACs and OLs, it is useful to compare the efficiency of calculations based on ACs and gene tree histories. When using ACs, the probabilities for all the possible configurations within each interval are computed. Hence we only consider each interval once during the calculation of the probability of a gene tree given a species tree.

In contrast, Degnan and Salter (2005) define a gene tree history as a gene tree topology with coalescent times. The times are relative to the intervals in which coalescent events take place. In the algorithm using histories (Degnan and Salter, 2005), every time we compute the probability for one history, we have to go over all the intervals. So if a gene tree has H histories, we have to consider all of the intervals H times. Further, if multiple histories share the same AC in some interval, we have to repeatedly calculate the probability for the same AC. Therefore summing over ACs reduces the computational burden in comparison to summing over histories. Using ACs in the computation will thus be advantageous when the tree is large.

Wu (2011) compares the number ACs and the number of histories when the species tree and the gene tree share the same topology, shown in Table 1. Let N be the number of taxa in the species tree. In Table 1, as N goes up, the number of ACs grows much more slowly than the number of histories, for both the symmetric and the asymmetric case. Therefore, for a particular gene tree and species tree, the number of ACs is much smaller than the number of histories. The computation using ACs is thus more efficient than the one using

33 histories. As shown in Wu (2011), STELLS runs faster than COAL (Degnan and Salter,

2005) in getting the probability distribution of gene tree topologies given the species tree.

Table 1. Comparison of the numbers of ACs and histories as the number of taxa (N) goes up (Wu, 2011).

Number of ACs Number of Histories N Asymmetric Symmetric Asymmetric Symmetric 4 10 10 5 4 5 15 15 14 10 6 21 21 42 25 7 28 28 132 65 8 36 36 429 169 9 40 49 1,430 481 10 55 63 4,862 1,369 12 78 90 58,786 11,236 16 138 193 9,694,845 1,020,100 20 210 555 1,767,263,190 100,360,324 30 465 4,425

2.3 A method for the likelihood computation of a species tree given SNP data using ACs

34

I propose a new method to compute the likelihood of a single species tree given SNP data , | , under the coalescent model. My method uses ideas from peeling

(Felsenstein, 1981) and ACs (Wu, 2011).

My method is specifically designed for SNP data. Suppose that individuals are sampled from N taxa, with at least one individual sampled from each taxon, so . A SNP dataset is composed of SNPs collected from each of the individuals. Recall that in a tip interval, there is only one AC. In the tip interval corresponding to taxon , if one

linage in , say , has observed data at one SNP site, where { }, then the likelihood for lineage having state is

( ) {

For the method proposed here, we assume that

I. Each SNP is assumed to have a conditionally independent gene tree with tips given the species tree ;

II. The incongruence between gene trees and species tree exclusively results from incomplete lineage sorting; we employ the coalescent model and the coalescent density of a gene tree given the species tree is provided by Rannala and Yang (2003), shown in equation (5);

III. SNP evolution along the gene trees follows the same nucleotide substitution model (e.g., the GTR model or one of its sub-models; Section 1.2). 35

In the method proposed here, at first, we compute the likelihood of the gene tree topology

and the species tree given data at the SNP, , | , by enumerating ACs in each interval of the species tree in a post-order traversal. Peeling is then used to compute the overall likelihood for gene tree . Next, we add up over all of the possible gene tree topologies | to obtain the likelihood of the species tree

given the SNP data , | . Finally, we multiply | over all of the SNPs and re-write | as follows,

|

| ∏ ∑ ∫[ | | ]

⏟ { | }

| can be computed from tips to the root interval in a post-order traversal, similar to equation (10); | is the product of coalescent densities over the intervals, shown in (5). As a result, | has a nested structure that can be computed recursively by intervals.

We now define the calculation of the conditional likelihood of an AC having certain states for an arbitrary interval. Suppose that a parent interval has its left child interval

with length and right child interval with length . In the left child interval, the OL

( ) is generated from the AC ( ); in the right interval, the OL

36

( ) is generated from the AC ( ). Next, and combine

into the AC in the parent interval ( ).

In the parent interval, the conditional likelihood of having certain states given

data , ( | ), is combined from the conditional likelihood of

having certain states, ( | ), and the conditional likelihood of

having certain states, ( | ), shown as below.

( | ) ( | )

( | ) ( | ) ( )

( | ) ( | ) ( )

In (11), recursively, ( | ) and ( | ) can be computed

conditioning on their children intervals respectively. Here

( ) is the transition probability of the lineages in

having certain states, changed from the lineages in having certain states, in the right child interval with length , given the tree structure in the interval.

37

Figure 6. An interval composed of an AC and an OL in a general interval on the species tree. In the interval of the species tree, one AC composed of lineages goes into the interval, and after coalescing, develops into one OL composed of lineages going out of the interval.

Next I will illustrate the details of the recursion in one general interval , with branch length , shown in Figure 6. In interval , assume that there are I ACs and J OLs.

Consider the set of ACs, is composed of input lineages. produces the

set of OLs, , composed of outgoing lineages, where . Given that and

are both specified, referring to the gene tree topology, a unique tree structure (TS) given the species tree (S) is determined in the interval. If , (m-n) coalescent events have taken place and there are (m-n) coalescent times, , associated with the coalescent events. Given the tree structure, coalescent times follow the distribution with the coalescent density function | for one interval under the coalescent model, which can be found in equation (4) in Section 1.3.

38

At the beginning of the interval, we have the conditional likelihood of the lineages in

carrying certain states given data , ( | ). Given the determined tree

structure in the interval, if we know the coalescent times from the density function

| , we can compute the site pattern probability (Felsenstein, 1981), under

the chosen substitution model. If we integrate with respect to , we will have the

transition probability of having certain states changing into having certain states.

( ) ∫ |

Multiplying equation (13) by ( | ), we can compute the conditional likelihood

of the lineages in carrying certain states given data .

( | ) ( | ) ( )

Consider equation (13) further. Given a specific tree structure , | is the

probability density function for coalescent times t, and is a function of t, equation

(13) can be viewed as the expectation of , with respect to the random vector, t,

39

( ) ∫ | [ ]

However, the high-dimensional integral does not have a general closed-form expression, since it depends on the specific TS. Thus, Monte Carlo integration, a useful numerical technique for evaluating higher-dimensional integration, can be used to approximate the value of the integral. So W random samples of coalescent times vectors are

generated according to the coalescent density | , and is evaluated at each of the time vectors in the sample. The average ∑ is taken as the estimate for

the integral in equation (15). Since the average of the sample, ∑ , converges

to its expectation almost surely, as the sample size goes to infinity.

∑ → [ ]

Based on equation (16), we can approximate equation (15) by

( ) ∫ | [ ]

40

The next question is how to generate coalescent times from coalescent density

| . Note that the density differs based on the tree structures within the interval

given the length of the interval. So I will use importance sampling to overcome this issue.

The idea is if the target density is hard to sample from, instead, we sample from a

similar distribution which is easier to sample from.

The target density here is | in one interval, which is rewritten as below.

| { [ ]} ∏ [ { }]

∏ [ { }] { } ∏

From equation (18), we can see that the coalescent density belongs to the exponential

family, and the interval length of the species tree restricts the coalescent times. Hence we

generate the coalescent times from truncated exponential distributions. The easy density

function is a joint truncated exponential density for vector .

( ) { }

| ∏ [ ]

{ }

Then the weight for the importance sampling used here is

41

| { } |

∏ {[ ] [ { }]}

Hence, if we randomly sample W coalescent times from the density function

from equation (20), we can estimate the expectation in equation (16) and rewrite equation (18) as

( ) | ( ) ∑ [ ]

|

( | ) In equation (21), the weight can be computed from equation (20). ( | )

Thus far, we have computed the conditional likelihood of one OL given the likelihood of one AC, ( | ) in equation (14), in one general interval . Then we can

compute the likelihoods of ACs and OLs having certain states for all of the intervals, in a post-order traversal. Finally in the root interval, summing over the likelihood of the root lineage having all the four states, weighted by the probability of the root lineage at four

different states ( ⁄ , under JC69 model), we have |

∑ , where { }.

42

I will now introduce an algorithm for computing | , given a particular gene tree topology and a species tree for one SNP site .

1) Set the likelihood of ACs having certain states for all the tip intervals, as input data;

2) Always start from a pair of tip intervals. In a post-order traversal, visit one interval only if both of its children intervals have been visited. Stop after the visit to the root interval; for a general interval v:

2.1) Obtain the sets of , by taking from left

child interval and from right child interval, where

and ; obtain ( | ), as in equation (12);

2.2) From each , generate the possible OLs; for each generated ,

compute the transition probability ( ) according to

equation (21) and save ( | ) in equation (14) with ;

3) In the root interval, obtain sets of ACs in step (2.1). Note that in the root interval, each AC only generates one OL, which only has the root lineage going out. Compute the likelihood for the root lineage in each OL. The weighted sum of the likelihoods for root lineages will be | .

43

After computation of | , as in equation (11), we can sum over the K gene tree topologies and then multiply results over the L sites to get the likelihood of a specified species tree given SNP data | .

2.4 SNP data processing and gene tree sampling to improve computational efficiency

After computing | as just described, we need to repeat the computation for K gene tree topologies and L sites of SNP data, in total, times. For SNP data, L is usually large and for a large -tip gene tree, K is also large. If we can reduce the amount of computation in these two aspects, the computational efficiency can be greatly improved.

First, if we have L sites of SNP data, each site has its site pattern. Then there are usually repeated site patterns in the dataset. For instance, given four individuals, if the SNP for one site is AAAT, the pattern is XXXY ( { }), because the likelihood of the species tree given any possibilities in the pattern XXXY (AAAT, CCCG, etc.) is the same under the JC69 model. Under the JC69 model, for four tips, there are site patterns, but only 15 have unique probabilities, and for five tips, there are 24 site patterns having unique probabilities. So before likelihood computation, we will process the SNP data first and detect the distinct site patterns and their counts. If there are L sites in data

, after processing, we have site patterns having unique probabilities , 44 and site pattern counts , where . So equation (11) can be rewritten as

| ⏞

| ∑ ∑ ∫[ | | ]

⏟ { }

Note that for models besides JC69, the number of site patterns having unique site patterns

will increase.

Second, we need to sum over the gene tree topologies to get the likelihood for the species tree given data, however, for large gene trees (more than 5 tips), there are too may topologies to sum over. For an -tip rooted bifurcating tree ( ), there are

possible topologies. So if , there are 34,459,425 topologies. Hence, for

large trees, to estimate the inner sum in equation (22), we will sample gene trees topologies with replacement from the distribution of the gene tree topologies given a species tree under the coalescent model (Degnan and Salter, 2005). Note that given a species tree, the gene tree topologies have unequal probabilities. To estimate the population total by sampling from the categories with unequal probabilities, we use the unbiased Horvitz-Thompson estimator (HT estimator; Horvitz and Thompson, 1952).

45

We sample U (here we use U=100) gene tree topologies given a species tree, using the software COAL (Degnan and Salter, 2005), and also obtain the probabilities associated with the gene tree topologies, . Among these U gene tree topologies, we only pick out the Z distinct gene tree topologies, , with their respective probabilities,

. We calculate the likelihood of each of the Z sampled distinct gene tree topologies and the species tree given data, | . The HT estimator for the population total in equation (22) is

| ∑

Thus after sampling gene tree topologies , the log likelihood of a species tree given data (equation (22)) is

| | ∑ {∑ }

Therefore, for small trees we will use equation (22) and for large trees we will use equation (24) to compute the likelihood of a specified species tree given data.

2.5 Accuracy of likelihood estimation

46

As described above, we use importance sampling to estimate the expectation in equation

(15). The estimate in equation (21) will converge to the expectation as the sample size goes to infinity. Thus we expect that the bigger sample size we take, the more accurate the estimate will be. We must therefore choose a proper sample size to balance between the accuracy of the estimate and the computation time. Let W be the number of repetitions in the importance sampling. We will study how W affects the accuracy of likelihood estimation.

First, we examine trees with fixed branch lengths, such as two 4-taxon trees: the symmetric tree ((A, B), (C, D)) and the asymmetric tree (((A, B), C), D); and one asymmetric 5-taxon tree (((A, B), C), (D, E)). The choices for W are 100, 250 and 500.

For each case, we repeat the computation 100 times and present the results as boxplots

(Figure 7). For the 4-taxon trees, we compare the estimated likelihood with the exact likelihood (Chifman and Kubatko, 2013); we do not have the exact likelihood formula for the 5-taxon tree, so we just simply present the results without comparison (Figure 7).

47

(a)

Figure 7. Boxplots of the estimates of the negative log likelihood of the species tree for different numbers of samples (W) in importance sampling. The boxplot in (a) is for the 4- taxon symmetric species tree ((A, B), (C, D)), (b) is for the 4-taxon asymmetric species tree (((A, B), C), D) and (c) is for a asymmetric taxa tree (((A, B), C), (D, E)). The dashed line in (a) and (b) mark the true likelihoods for the 4-taxon species trees. Continued

48

Figure 7 continued

(b)

Continued

49

Figure 7 continued

(c)

From Figure 7, we can see a similar pattern from all the three boxplots: as W goes up, the negative log likelihood estimates show less variation, and the 4-taxon tree likelihood estimates get closer to the true exact likelihood (Figure 7(a) and 7(b)). These results confirm our expectation for importance sampling. Also, even when W is 100, the spread for the 4-taxon symmetric tree is about 80, the spread for the 4-taxon asymmetric tree is

50 about 60, and the spread for the 5-taxon tree is about 30, which are all very small numbers comparing to the scale of true negative log likelihood.

Next, since the likelihood function of a species tree is an exponential function of the branch lengths, we expect that the log likelihood function of the species tree is a concave function of the branch lengths. So we examine the relationship between the log likelihood function and the branch lengths, for different sample sizes. For the same 4-taxon species trees, we only focus on one interval length on each tree. For the interval length, we divide the interval between 0.001 and 0.05 into 100 subintervals, and we compute and plot the likelihood at each subinterval for different choices of W. We also compare the results with the exact likelihood (Chifman and Kubatko, 2013) for each tree, which can be seen

Figure 8.

51

Asymmetric Tree Symmetric Tree

Figure 8. Plots of log likelihood estimates and one interval length. The three plots on the left are for the 4-taxa symmetric species tree ((A, B), (C, D)), where the red solid lines are the log likelihood estimates. The three plots on the right are for the 4-taxa asymmetric tree (((A, B), C), D), where the blue solid lines are the log likelihood estimates. For all 6 plots, the black dashed lines are the true log likelihoods for the respective species trees.

52

From Figure 8, for both trees, for all sample sizes, all the plots are concave curves and all of the curves from the estimated log likelihood show similar trends as the exact log likelihood curve. When W increases, the estimated likelihood curves become smoother and closer to the exact curve, which again fits our expectation and is consistent with the results shown in Figure 7. For symmetric tree, there is some difficulty for small branch lengths.

Therefore, our expectation is confirmed. In terms of the number of importance samples, we can see an improvement in the estimation accuracy as the sample size increases.

Considering the computation time, we can see that the results between 250 and 500 samples are quite similar. So we would recommend using . For large trees, we can choose to use for the purpose of computational efficiency, without losing very much estimation accuracy.

2.6 Optimization of branch lengths on the species tree

The algorithm described above gives a method for computing the species tree likelihood given data | . The specified species tree includes the topology, which we denote , and branch lengths, a vector of species times, denoted . Given , we would like to optimize the species tree branch lengths in order to maximize the species tree likelihood for the fixed topology given data.

53

includes speciation times, which are all continuous positive random variables.

Because there is not a closed-form density function for this multidimensional random vector , so differentiation cannot be used here. Instead, to optimize without using derivatives, Brent’s method (Brent, 1973) is used.

Brent’s method is used for minimization of a convex function in a given interval. From

Figure 8, we note that the plot of the log likelihood of a single branch length for a fixed tree topology with all other branch lengths fixed is concave. Thus Brent’s method (Brent,

1973) can be used to minimize the negative log likelihood function (a convex function) of that branch length. Since Brent’s method (Brent, 1973) can only be used for a single variable, we will optimize one of the branch lengths, holding all of the other branch lengths fixed. We cycle through all the branch lengths and optimize them one at a time.

Notice that here we assume that the branch lengths of the species tree affect the likelihood of the species tree independently, which is the best we can do. We will optimize all the branch lengths twice to eliminate the potential influence caused by the dependence among the branch lengths, considering time efficiency.

After optimizing the branch lengths of the species tree, we obtain the likelihood of the species tree with optimized branch lengths given data, denoted by | .

In this chapter, the method of computing the likelihood of a species tree given SNP data is described. The performance of this method is tested in different scenarios. The use of

54

SNP data processing and gene tree sampling can greatly improve the computation efficiency. The species tree likelihood for the fixed species tree topology is optimized with respect to the branch lengths, which will be used in the inference and application later in this dissertation.

55

Chapter 3 Species tree inference

In Chapter 2, we proposed a method for computing the likelihood of a species tree with a specified topology and optimized branch lengths, given SNP data. Now the natural question is how to use this method to infer the maximum likelihood (ML) species tree given SNP data. In this chapter, we will investigate the problem of searching for the ML species tree. We start with a review of tree searching algorithms in Section 3.1. In Section

3.2, we will apply one of the tree-searching algorithms to our problem. We also run simulations to test the performance of the proposed method.

3.1 Tree searching strategies

Considering all of the possible trees with the same number of taxa, we want to find the tree with the highest likelihood, the ML tree. However, recall that there are many

possible tree topologies. For a -tips rooted bifurcating tree ( ), there are

possible topologies, and for each topology, and there are ( ) branch lengths along the tree. For example, a rooted 5-taxon tree has 105 possible topologies, while a rooted 10- taxon tree has 34,459,425 possible topologies. In addition, given a tree topology, we need to optimize the tree branch lengths of the tree, and obviously larger trees have more branch lengths to optimize. Consequently, the task of searching the entire space of trees 56 for the particular tree or trees that maximizes the likelihood is a very challenging problem. Many algorithms have been developed to address this problem. They can be divided into three categories: exact methods, heuristic methods, and stochastic methods.

3.1.1 Exact methods

Based on the computed likelihood for a specified species tree topology, we can simply evaluate the entire set of tree topologies, which is called an exhaustive search.

Apparently, an exhaustive search would be guaranteed to find the ML tree, since it evaluates all possible trees. However, as described above, the problem is that the number of topologies increases dramatically as the tree gets large. Hence, an exhaustive search is only feasible for small trees.

To avoid enumerating all possible topologies, an alternative exact algorithm, Branch and

Bound (BB), can be used to find the optimal tree (Hendy and Penny, 1981). BB starts with an initial tree (a 2-taxon tree) and a predetermined bound. To begin, a new branch is added to all possible branches of the initial tree. Every time a new branch (taxon) is added to one of the possible branching locations of the current tree. If this branch direction is worse than the bound according to some criterion, this “bad” branching direction is terminated; if the branch direction is better than the bound, this “good” branching direction is kept for further branching. From the “good” trees, we repeat the branching and bounding process until we reach to the desired number of taxa. Pick the

“best” tree in the final step as the estimate of the optimal tree. 57

As an example, we demonstrate the process of BB for the 4-taxon rooted tree in Figure 9.

We have a predetermined bound and randomly pick a pair of taxa for the initial 2-taxon tree. We choose taxon 1 and 2 to form the initial tree (1,2). Let 5 be the MRCA for all 4 taxa, which is the root of the tree. There are three branching locations (labeled with circles) on this initial tree, and we add a new branch (taxon 3) to all of the branching locations, forming three distinct 3-taxon trees. We compare the three 3-taxon trees with the bound, and only retain the trees that are better than the bound according to some criterion. We may retain one or two or even all of the trees in this step. If all three trees are retained, BB becomes an exhaustive search. For the sake of this example, assume that the tree circled in a red rectangle (Fig. 9(a)) is kept. Then we add another branch (taxon

4) onto the five branches of the circled tree ((1, 3), 2), making five 4-taxon trees, listed in

Fig. 9(b). Here we stop branching and obtain the optimal tree among the five 4-taxon trees according to the criterion. In this example, we are guaranteed to find the optimal tree by two levels of branching.

58

(a)

(b)

Figure 9. A Branch and Bound example for a 4-taxon species tree. In (a), by adding taxon 3, the initial tree is branched to three 3-taxon trees. In the top left tree in (a) with the red circle, taxon 3 has been inserted on the external branch leading to taxon 1. Adding taxon 4 to the 3-taxon tree with the red circle in (a), the five 4-taxon trees in (b) are generated.

59

BB is guaranteed to find the globally optimal tree. From the example, we can see it is essential to pick an efficient bound that we have the least amount of branching (only one

3-taxon tree is more optimal than the bound in Figure 9). If the bound is less efficient, we have to evaluate more trees and the worst case of BB is an exhaustive search. For an N- taxon tree problem, the bound is usually the value of an N-taxon tree with respect to the criterion.

Furthermore, the criterion used for comparison needs to satisfy the requirement that the computed values of trees change monotonically as more taxa are added to the tree.

Researches have used parsimony scores of gene trees as the criterion in BB to search for the most parsimonious gene tree. For example, Felsenstein has implemented BB in parsimony criterion in program PENNY, included in the software package PHYLIP

(version 3.5c, 1993).

3.1.2 Heuristic methods

Several types of heuristic methods are developed to reduce the portion of the space that is searched. The obvious strength of the heuristic approaches is time efficiency, which comes at the cost of no guarantee of finding the globally optimal tree.

One type of heuristic method is the branch-swapping method. It starts with a randomly chosen initial tree and only examines the neighborhood of the initial tree, by rearranging 60 the branches. In other word, branch-swapping methods only search some local tree space, and therefore only find the locally optimal tree. Some common types of branch-swapping methods include nearest neighbor interchange (NNI), subtree pruning and regrafting

(SPR) and tree bisection and reconnection (TBR).

The second type of heuristic method is called divide-and-conquer. It divides the tree into several subtrees and for each subtree, it searches the subtree space to pick the locally optimal tree. Then the tree reassembled from the locally optimal subtrees is considered as the inferred optimal tree. The drawback of this method is the subjective decisions in the division of subtrees may lead to inconsistent estimates of the optimal tree.

The third and the most popular heuristic method is called stepwise addition, which is based on BB. It also starts with a 2-taxon initial tree. When we add one taxon, it compares all the trees that can be built from the current tree and only retains the best of these trees. This process is continued until the entire tree is built. Once all taxa have been added in the estimate, the best tree will be the estimate of the true tree. For example, in

Figure 9(a), the red-circled tree is the best tree among the three 3-taxon trees considered, and is retained for branching into five 4-taxon trees. The best tree after all the taxa have been added in the next step is the estimated tree. Stepwise addition is computationally efficient if the number of taxa is not too big, but also cannot be guaranteed to find the global optimal tree.

61

3.1.3 Stochastic methods

In addition to exact and heuristic methods, researchers have also employed the probabilistic models in tree searching. Simulated annealing is one good representative of stochastic methods. Salter and Pearl (2001) apply the simulated annealing algorithm to infer the ML gene tree. The idea is that we move step by step randomly in the tree space.

By randomly branch swapping the current tree, we move to a new tree in the neighborhood. We compare the current tree and new tree according to some criterion, leading us two choices. If the new tree is better, we move to the new tree with certainty; if the new tree is worse, we still accept the new tree with some probability proportional to the likelihood difference between the two trees, which potentially increases our searching area in the tree space. Simulated annealing is more likely to find the global optimal tree than the heuristic methods, yet, this is still not guaranteed.

By far one of the most popular stochastic search method is the genetic algorithm used in the software GARLI (Zwickl, 2006). The tree searching in generic algorithm mimics the evolution of a population under natural selection, where the individuals in the population are trees composed of topologies, branch lengths and a set of model parameters. Starting with a random initial population, in every generation, random mutations are applied to individuals, in the sense that the mutations on tree topologies can be an NNI or SPR rearrangement and the mutations on branch lengths and model parameters can be multiplication of a Gamma-distributed factor. After mutations, the fitness of each individual is evaluated by some criterion acting as natural selection, and by a chance

62 proportional to the fitness of an individual (as measured by the criterion of interest, for example, the likelihood), the individual is randomly chosen to be the parent for the next generation. The individual with the highest fitness is always kept in the population to avoid the genetic drift. As a result, after many generations, the population of trees evolves towards the trees with higher fitness (e.g., higher likelihood), where hopefully we can find the optimal tree. Hence generic algorithm is also heuristic with no assurance of finding the globally optimal tree.

3.2 A likelihood-based tree searching approach using stepwise addition

3.2.1 Algorithm

To infer the ML species tree given SNP data, , I will use the method of stepwise addition. For an N-taxon tree problem, let be the current level (number of taxa) of the species tree, where . We denote a species tree of level as , and the likelihood of computed is denoted by | , with the branch lengths optimized. In level , there are ( trees that can be formed from the selected

tree. Denote the tree with the highest likelihood in level by and the tree with the highest likelihood in level N would be the inferred maximum likelihood tree, .

To infer , we do the following

63

(1) Start from level 2, add one more taxon to all of the possible branches of the initial

tree . Set

(2) At level , compute the likelihood of the ( trees,

| | ; retain the tree with the highest

likelihood, recorded as . Add one more taxon to all of the possible

branches of to the next level .

(3) Set . Repeat step (2), until . Compute the likelihood of

trees. The tree with the highest likelihood will be our inferred ML

tree.

3.2.2 Simulation

We will test the performance of this algorithm by inferring a 5-taxon ML species tree given simulated data. We have two tree models for 5-taxon trees: an asymmetric tree,

((((A, B), C), D), E), and a symmetric tree (not perfect symmetry), (((A, B), C), (D, E)), shown in Figure 10. Both species trees have four branch lengths, and , labeled in the respective topologies in Figure 10. For both topologies, we consider two sets of branch lengths, short and long, listed in Table 2, so we have four scenarios in this simulation. When the branch lengths of the species tree are short, deep coalescence is more likely to take place, leading to the discordance between the gene tree and the species tree, and finding the ML tree given SNP data is also more challenging.

64

Table 2. Branch lengths settings for the simulations in stepwise addition.

Branch Asymmetric Tree Symmetric Tree Lengths Long Short Long Short

0.01 0.005 0.01 0.005

0.02 0.005 0.01 0.005

0.03 0.005 0.015 0.005

0.01 0.005 0.025 0.005

(a) (b)

Figure 10. Two 5-taxon species tree models, asymmetric and symmetric, in stepwise addition simulations. (a) is the asymmetric tree model and (b) is the symmetric tree model. The branch lengths are labeled on the species trees.

Under each of the four scenarios, we have 100 replicates with 100 simulated SNP datasets. For one SNP dataset, we simulate 5,000 SNPs for five individuals, one from each taxon. We first generate 5,000 gene trees with 5 tips from the given 5-taxon species tree using program ms (Hudson, 2002) (Asymmetric Normal: ./ms 5 5000 -T -I 5 1 1 1 1

1 -ej 0.5 1 2 -ej 1.5 2 3 -ej 3.0 3 4 -ej 3.5 4 5; Asymmetric Short: ./ms 5 5000 -T -I 5 1 1 1 65

1 1 -ej 0.25 1 2 -ej 0.5 2 3 -ej 0.75 3 4 -ej 1.0 4 5; Symmetric Normal: ./ms 5 5000 -T -I 5

1 1 1 1 1 -ej 0.5 1 2 -ej 1.25 2 3 -ej 2.5 3 5 -ej 0.5 4 5 -ej 2.5 5 3; Symmetric Short: ./ms 5

5000 -T -I 5 1 1 1 1 1 -ej 0.25 1 2 -ej 0.5 2 3 -ej 0.75 3 5 -ej 0.25 4 5 -ej 0.75 5 3). For each gene tree, we generate a SNP in R using the function simSeq (package “phangorn”,

1.7-4).

For each SNP dataset, we apply the stepwise addition algorithm to estimate the ML species tree. Next in each scenario, we compare the 100 ML species tree estimates with the true species tree that the SNP datasets are simulated from.

Given two bifurcating trees with the same set of taxa, we can use the Robinson-Foulds

(RF) distance (Robinson and Foulds, 1981) to estimate the quantitative dissimilarities

between the two tree topologies, denoted by , where and represent the two bifurcating trees. If the two bifurcating trees are both rooted, we count the number of on each tree, where a is defined as a group including an ancestor and all its descendants. Then the RF distance for two rooted trees is number of clades that only appear in one of the tree. As an example, for the two 5-taxon species trees in Figure 10, the clades for the asymmetric tree ((((A, B), C), D), E) in (a) are (A, B), (A, B, C) and (A,

B, C, D), and in (b), the symmetric tree (((A, B), C), (D, E)) has clades (A, B), (A, B, C), and (D, E). The clades that only exist in one of the two trees are (A, B, C, D) from the asymmetric tree and (D, E) from the symmetric tree. Hence the RF distance between these two trees is 2. Note that the RF distance between two rooted bifurcating tree that

66 share the same set of taxa is always an even number. The smallest RF distance between two different trees is 2; the RF distance between a tree and itself is 0.

Table 3. Frequencies of the estimated true species trees and the averaged RF distances of the four simulation settings in stepwise addition.

Asymmetric Asymmetric Symmetric Symmetric

Long Short Long Short True tree% 85% 65% 100% 35% ̅̅̅̅̅̅̅̅̅̅̅̅̅̅ 0.30 0.82 0.00 1.40

Under each of the four scenarios, we compute the frequency of the estimated true species tree among the 100 estimated ML species trees (True tree% in Table 3). In both asymmetric and symmetric tree models, when the branch lengths of the species tree are shorter, the number of times the true species tree is found decreases. When the trees have

“normal” branch lengths (Table 2), most of the ML tree estimates correctly find the true tree, with 100% accuracy in the symmetric tree model. It seems more difficult to estimate the true symmetric tree when branch lengths are short (35%), compared with the asymmetric short case (65%), although in the symmetric short case, the true tree is still the most frequently occurring tree estimate.

67

Figure 11. Distributions of the RF distance between the ML species tree estimates and the true species tree under the four scenarios in stepwise addition simulation.

We also compute the RF distances between the estimated ML species trees and the true tree for all 100 replicates. RF distance being 0 means that the true tree has been found, and the smaller the RF distance, the more similar the estimated three is to the true tree.

68

̅̅̅̅̅̅̅̅̅̅̅̅̅̅ The mean of the RF distances ( ) can be found in Table 3 and the distribution of RF distances are shown in Figure 11. The results of RF distances are consistent with those the frequencies of the true estimated species trees. When the species trees have

“normal” branch lengths (in Table 3), for the asymmetric tree model, we have a small

̅̅̅̅̅̅̅̅̅̅̅̅̅̅ averaged RF distance , with 85% of the RF distances being 0’s; for the symmetric model, all the 100 RF distances are 0’s. When branch lengths are shorter, the estimated ML species trees have a higher average RF distance from the true species tree, which indicates that the estimated trees tend to be more different from the true tree.

̅̅̅̅̅̅̅̅̅̅̅̅̅̅ Symmetric short is still the most difficult case, with . From the two plots on the right in Figure 11, even for short branch lengths in both models, only a few tree estimates have 4 as the value for RF distance (6 out of 100 for asymmetric model, and 5 out of 100 for symmetric model), and the rest of the tree estimates have the RF distance values of 0 or 2.

The above set of simulations all start with the same 2-taxon initial tree with taxon 1 and

2. To investigate the influence from the initial tree, three additional sets of simulations are conducted with different initial trees. All the three additional sets of simulations have different starting pairs of taxa in the initial trees and use 100 samples in the importance sampling. The rest settings are the same as the previous set of simulations.

69

Table 4. Percentages of the true species trees found in three additional sets of simulations with different initial trees.

Asymmetric Asymmetric Symmetric Symmetric Initial Tree Long Short Long Short (1,3) 80% 56% 97% 80% (2,4) 82% 54% 98% 76% (2,5) 100% 64% 95% 61%

Table 4 shows the results for the three simulations with different initial trees. The distributions of percentages of the true trees in the three extra simulations are similar to each other. Symmetric tree model tends to do better than the asymmetric tree model.

Longer branch lengths have higher accuracy in finding the true species tree. Even for the asymmetric short case, the percentages of finding the true tree are above 50%. In the symmetric short case, the results in these three sets of simulations are better than 35% in the previous simulation (Table 3), which might be due to random errors brought by short branch lengths. In addition, although the sample size in importance sampling used here is

100, the results look consistent with the previous set of simulations, which indicates that sample size 100 is sufficient to use regarding to accuracy.

To infer a 5-taxon ML species tree (one replicate in our simulation), if the size for importance sampling is 250, our method takes about 6 hours for the symmetric species tree and about 8 hours for the asymmetric species tree; if the sample size is 100, symmetric tree model takes about 1 hour and asymmetric tree model takes about 1.5 hours. To infer a 4-taxon ML species tree with sample size 250, it takes about 50 minutes. 70

In summary, when the branch lengths are not too short, our method using stepwise addition performs fairly well in inferring the ML species tree given data, in both accuracy and time efficiency. If the inferred ML species tree has a symmetric topology and short optimized branch lengths, we should be more careful and using different initial trees is recommended.

In addition to stepwise addition, I also attempted to apply BB to infer the ML tree using my likelihood computation method. For an N-taxon tree, the likelihood of a random N- taxon tree was picked as the initial bound. However, I was not able to find a bound that would allow me to eliminate much of the tree space, and thus BB turned out to give an exhaustive search.

Chapter 3 is the inference of estimating the ML species tree in the tree space using my likelihood computation method given SNP. The current tree search strategies are surveyed. The likelihood tree searching method using stepwise addition algorithm is proposed and tested by simulations. The simulations have shown satisfying results when the branch lengths are not too short.

71

Chapter 4: Species Delimitation

In this chapter, the method of likelihood computation of a species tree will be applied to species delimitation under the coalescent model. Section 4.1 will be an introduction to the problem of species delimitation and will also describe the current approaches to address the problem. Thereupon, I will apply my method to the species delimitation problem and test the algorithm with simulations in Section 4.2. Finally, the results from the simulations will be discussed in Section 4.3.

4.1 Current validation methods in species delimitation

Species delimitation refers to the determination of the species boundaries between populations of organisms, which then be used to classify individuals into species. Proper assignment of individuals to species is important, because it is a necessary first step in estimating a species tree. In the biological perspective, proper species definitions are also important, for example, in conservation.

Genetic data are commonly used to delimit the species. Given a sample of genetic data collected from different populations, a species delimitation can be viewed as an assignment of the individuals in the sample into species. According to the review by

72

Carstens et. al. (2013), based on whether there is prior information about the sample assignment, there are two classes of methods for the species delimitation problem: validation and discovery. Validation methods require sample assignments beforehand and seek to confirm them under some criterion. Discovery methods explore the possible sample assignments using some criterion with the goal of defining species boundaries.

With a priori sample assignment, validation methods actually serve as hypothesis tests.

Only validation methods are discussed here. In addition, we consider only the coalescent model as the platform to develop a probabilistic model for species delimitation (Knowles and Carstens, 2007). All of the existing validation methods are also coalescent-based

(Carstens et. al., 2013). There are different categories of validation methods based on different criterions and different types of genetic data, and we give a brief review of these methods here.

Most methods for species delimitation that use gene trees as input data employ the likelihood criterion. Knowles and Carstens (2007) compute the likelihood of a hypothesized species delimitation of assigning gene lineages into different species given a sample of gene tree topologies, using the software COAL (Degan and Salter, 2005), then use the likelihood-ratio test to test for the null hypothesis of assigning all the gene lineages into one species. Carstens and Dewey (2010) use the gene trees with branch lengths as the input data, obtain a species tree estimate using STEM (Kubatko et. al.,

2009) and BEST (Liu and Pearl, 2007; Edwards et. al., 2007), and compare different species delimitation models according to some likelihood-based information criterion,

73 such as Akaike information criterion (AIC; Akaike, 1973). Ence and Carstens (2011) have also developed software called SpedeSTEM that uses gene trees with branch lengths as input data for species delimitation analysis. Given a specified species tree, they generate models with all the possible lineage assignments by doing the hierarchical

(within the species) permutations of the lineages, leading to different subspecies tree models to validate. STEM (Kubatko et. al., 2009) is used to compute the likelihood of the species trees for all the models, which are compared using the information theory statistics, such as AIC, AIC differences ( ), and Akaike weight ( ) (Anderson, 2008).

Yang and Rannala (2010) develop a method for species delimitation using a reversible- jump Bayesian MCMC (rjMCMC) approach given multi-locus DNA sequence data, implemented in the C program bpp. Given a user-specified species tree (guide tree), rjMCMC randomly splits and merges the internal nodes on the guide tree, resulting in different collapsed subtrees that correspond to different species delimitations. Then they calculate the posterior probabilities of the species delimitations (subtrees) given the species tree (guide tree). This method is affected by the choice of the guide tree.

The likelihood approaches using gene trees as input data inevitably ignore the variability brought by gene tree estimation. The program bpp which deploys rjMCMC has the trouble with convergence as a Bayesian approach. Now I will apply my method into species delimitation using SNP data, which incorporates all the variability of the data. At

74 the same time, my method provides a good approximation to the likelihood of the species tree, without suffering the convergence issue as Bayesian approaches.

4.2 A likelihood method for species delimitation using SNP data

4.2.1 Algorithm

Notice that there is no direct method for computing the likelihood of a species delimitation given SNP data. I will apply my method for species tree likelihood computation to develop a novel likelihood-based approach for species delimitation using

SNP data.

This method is a validation method, which requires a species delimitation beforehand. If we collect a sample of SNPs from individuals from taxa, , where

and is the row of the SNPs collected for the individual ( .

We define the assignment of individuals into species as a species delimitation,

denoted by , where ∑ and is the number of individuals in the species. Here we always have a species tree for a species delimitation. Given the species tree, a species delimitation is actually the assignment of individual into the tip intervals of the species tree. Therefore, delimiting the species is an extension of inferring the species tree. The likelihood of a species delimitation ( ) given SNP data ( ) is just

75 the likelihood of the associated species tree ( ) with optimized branch lengths given data, that is,

| |

Equation (25) can be computed using the method proposed in Chapter 2.

Given SNP data , and candidate species delimitation models with their associated

species trees, we compute the likelihood | for the species delimitation model given data, where . Next we select the best model for the data from the species delimitation models by using the following likelihood-based information- theoretic selection statistics (Anderson, 2008): AIC ( ), AIC difference ( ) and

Akaike weight ( ), where :

|

In equation (26), Z is the total number of the parameters in the model, which is the sum of the number of branch lengths and the number of the species, and in equation

(27) is the minimum AIC value among the M models. AIC estimates the relative expected Kullback-Leibler information (Kullback and Leibler, 1951), which measures the information loss of the current model from the truth given data. Hence we prefer the

76 model with smaller AIC, which indicates less information loss than other models. Since

AIC values could vary largely among the models, the AIC difference can be viewed as a rescaling among the AIC values for all the candidate models; the best model has an AIC difference of 0. The Akaike weight is a normalized likelihood of the model given data and can be interpreted as the model probability (in equation (28)).

(a) (b)

Figure 12. Two species delimitations and their associated species trees used in species delimitation simulation I, II, and III. (a) is the species delimitation { } with its associated 3-taxon species tree ((AB, C), D). (b) is the species delimitation { } with its associated 4-taxon species tree (((A, B), C), D).The species delimitations in (a) and (b) share the same SNP dataset, ( ).

4.2.2 Simulation

4.2.2.1 Simulation design

Simulations are performed to compare two species delimitations using this method. In

Figure 12, we have two species trees models: a 3-taxon species tree model, ,

77

with two branch lengths ( and ) and five parameters ( ), and a 4-taxon species

tree model, , with three branch lengths ( , and ) and seven parameters ( ). Under each tree model, we have 5000 SNPs from seven individuals, so one SNP dataset is 5000 aligned SNPs for seven individuals, . For the

3-taxon species tree, the seven individuals are assigned into three taxa, and the species

delimitation is . Similarly, for the 4-taxon species tree, the species delimitation, which is the assignment of seven individuals into 4 taxa, is

.

In the species delimitation of Figure 12(a), species AB have three individuals

( ), so . In the species delimitation of Figure 12(b), species AB in

Figure 12(a) are split into two separate species: species A with and one individual

( ), and species B with and two individuals ( ). The rest of the species assignments for both models are the same: species C and D have the same species delimitations and relationships to the rest of the tree for the two models. The comparison between of the 3-taxon species tree and of the 4-taxon species tree is to test whether to differentiate species A and B in their joint population given collected data.

There are three simulation settings (I, II and III) with similar structures. For each simulation setting, there are two comparison models: 3-taxon tree model and 4-taxon tree model. 100 SNP datasets are simulated based on each of the two models. Under each tree model, we have 100 replicates. In each replicate, we compute the likelihood of two species delimitations of the 3-taxon species tree and of the 4-taxon species tree,

78 given the same SNP dataset by equation (25), and we compare them according to the model selection statistics in equation (26). Since there are 7 individuals in one species delimitation, which implies the gene tree has 7 tips, we use the sample size for the importance sampling and sample gene trees in the species tree likelihood computation

(see Section 2.4 and 2.5).

Simulations I, II and III differ only in the method by which the data are generated.

Simulation I and II share the same branch length settings for the species trees, while

Simulation III uses longer branch lengths. In contrast to Simulation I, Simulations II and

III take subsamples from the data. In the following, I will illustrate the generation of data for each of the simulations.

In simulation I, first under the 3-taxon species tree model in Figure 12 (a), for each of the

100 replicates, based on and the branch lengths , (in the unit of expected number of mutations per site), we generate 5000 gene trees with 7 tips using program ms (./ms 7 5000 -T -I 3 3 2 2 -ej 1.0 1 2 -ej 2.0 2 3; Hudson, 2002). From each gene tree, we simulate a SNP using R function simSeq (package “phangorn”, 1.7-4).

From 5000 gene trees, we have a SNP dataset composed of 5000 SNPs for seven individuals. We generate 100 SNP datasets for the 100 replicates under the 3-taxon species tree model.

79

Similarly, under the 4-taxon species tree model and (Figure 12 (b)), we set the branch

lengths as . We also use program ms to simulate 5000 gene trees with 7 tips (./ms 7 5000 -T -I 4 1 2 2 2 -ej 1.0 1 2 -ej 2.0 2 3 -ej 3.0 3 4; Hudson, 2002).

Using R function simSeq (package “phangorn”, 1.7-4), we simulate one SNP site for 7 individuals given one gene tree, and obtain a SNP dataset with 5000 SNPs. With 100 simulations, we have 100 SNP datasets with 5000 SNPs under the 4-taxon species tree model.

In reality, researchers often collect SNPs from a large sample of individuals from species of interests. Recall that the number of individuals is just the number of tips on a gene tree and more individuals imply a larger gene tree. To avoid the computational inefficiency brought by the large sample size, we can subsample the individuals in the sample collected from the species (Ence and Carstens, 2011). To investigate the influence of subsampling to my method, we subsample the individuals in each taxa of the species tree in Simulation II and III.

To generate data, Simulation II used the same branch lengths setting as simulation I, except that each sample now has 250 individuals (rather than 7 individuals). Under the 3- taxon species tree model, the species delimitation with all the individuals in the sample is

. 5000 gene trees with 250 tips are simulated from ms (./ms 250

5000 -T -I 3 100 100 50 -ej 1.0 1 2 -ej 2.0 2 3; Hudson, 2002), and for each gene tree we simulate 5000 SNPs for 250 individuals, using simSeq in R (package “phangorn”, 1.7-4).

80

For each dataset, after resampling 7 individuals from , we have

, which is the same as in simulation I. Similarly under the 4-taxon species tree model, the species delimitation for the entire sample is

. By using ms (./ms 250 5000 -T -I 4 50 50 100 50 -ej 1.0 1 2 -ej

2.0 2 3 -ej 3.0 3 4; Hudson, 2002), we generate 5000 gene trees with 250 tips each. In R, we use simSeq (package “phangorn”, 1.7-4) to generate 5000 SNPs for 250 individuals.

We resample 7 individuals from , where

is the same as in simulation I. Under both models, we compare the log

likelihood and model selection statistics between and as in Simulation I.

The data generation used in Simulation III differs from Simulation II only by using longer branch lengths for both species trees. The branch lengths we used for the 3-taxon

species tree model are , , upon which we generate 5000 gene trees using ms (./ms 250 5000 -T -I 3 100 100 50 -ej 2.0 1 2 -ej 4.0 2 3; Hudson, 2002). For 4-

taxon species tree we set the branch lengths as and generate 5000 gene trees from ms (./ms 250 5000 -T -I 4 50 50 100 50 -ej 2.0 1 2 -ej 4.0 2 3 -ej 6.0 3 4;

Hudson, 2002).

4.3.2.2 Simulation results

For each one of the two tree models in each simulation, we compute the log likelihoods and the model selection statistics for the two species delimitations for each of the 100 81 replicates. Note that in the comparison between the two delimitations, the species delimitation which the data is generated from is the true species delimitation. From each comparison, the species delimitation with higher log likelihood is the estimated species delimitation given data. If the estimate is the true species delimitation, we call it a

“correct” estimate, and we count the number of correct estimates out of the 100 replicates.

We present the frequencies of the correct estimates, the averaged log likelihood and the averaged model selection statistics over the 100 replicates under each simulation setting in Table 4, 5 and 6.

Table 5. Results for Simulation I.

Tree Species Log- Correct% AIC Model Delimitation Likelihood (true) -9769.89 19549.78 1.07 0.970 3-taxon 97% -9879.96 19773.92 225.21 0 (true) -10799.18 21608.35 0 0.999 4-taxon 100% -10915.22 21844.45 236.10 0.001

From Simulation I (Table 4), both models are very accurate in the estimation for the true species delimitation given data. The 4-taxon tree model has 100% correct estimates, which is slightly higher than 97% in the 3-taxon tree model. In the comparison of each model, the true estimated species delimitation has higher average log likelihood and lower averaged AIC; the averaged AIC difference is close to 0 and the model probability is almost 1.The model selection statistics indicate the good performance of my method in

82 estimating the true species delimitation, which is consistent with the frequency of the correct estimates.

Table 6. Results for Simulation II with subsampling.

Tree Species Log- Correct% AIC Model Delimitation Likelihood

(true) -10235.80 20481.60 87.70 0.79 3-taxon 79%

-10368.03 20750.06 356.17 0

(true) -11012.88 22035.76 13.22 0.94 4-taxon 94%

-11946.05 23906.10 1883.55 0.06

Table 7. Results for Simulation III with subsampling and long branches.

Tree Species Log- Correct% AIC Model Delimitation Likelihood

(true) -12112.79 24235.57 320.45 0.76 3-taxon 76%

-12226.96 24467.92 552.80 0

(true) -12547.91 25105.82 3.9312 0.98 4-taxon 97%

-14692.04 29398.08 4296.19 0.02

Adding subsampling into the data does not affect the 4-taxon species tree model in both

Simulation II and III. In Simulation II (Table 5), the 4-taxon species tree model still has

94% correct rate in estimating the true species delimitation, while the 4-taxon model in

Simulation III (Table 6) reaches 97% accuracy in species delimitation estimation. For the true species delimitation in the 4-taxon models of the two simulations, the averaged AIC

83 differences of is small and model probabilities are close to 1, which are consistent with the results in the 4-taxon model in simulation I.

However, subsampling affects the accuracy for the 3-taxon models in both Simulation II and III. The 3-taxon model in Simulation II (Table 5) has 79% frequency for estimating the true species delimitation, similar with the accuracy of 76% in the 3-taxon model in

Simulation III (Table 6). The averaged AIC differences for of 3-taxon model in both simulations are larger than which in Simulation I.

Although we use different branch lengths settings for Simulation II and III given data with subsampling, the results from the two simulations are similar. It seems that the impact of branch lengths settings to the species delimitation estimation is not prominent.

Overall, the application of my method to species delimitation performs well when the entire sample of data is used. The 4-taxon tree model has a higher accuracy than the 3- taxon tree model in estimating the true species delimitations, especially when the data come with subsampling, which might imply that it is easier to recognize the split species

(4-taxon model) than the merged species (3-taxon model). The average time used for

likelihood computation of a species delimitation ( , , , or ) is only about one hour, which shows the time efficiency of my method.

84

In Chapter 4, we reviewed the problem of species delimitation and the current methods used in this field. Then an algorithm of estimating likelihood of the species tree given

SNP data is developed in the species delimitation problem. Simulations are designed and conducted to test the performance of this method.

85

Chapter 5: Discussion

In my dissertation, I proposed a new method of computing the likelihood of a species tree given SNP data using ACs under the coalescent model. Based on this likelihood computation method, I developed an algorithm using stepwise addition to infer the ML species tree given data. I also implemented my method to evaluate proposed species delimitations by estimating their likelihoods given data, and compared them according to information-theoretic selection statistics.

In the likelihood computation of the species tree, using ACs can avoid the issue of enumeration of gene tree histories, which greatly reduces the burden in the likelihood computation for large trees. In this method, we visited the intervals on the species trees in a post-order traversal, and in each interval, we used Monte Carlo integration and importance sampling to approximate the integral in equation (16). This method estimates the likelihood of the species tree directly from SNP data, which can be used for likelihood-based inferences.

Alternatively, SNAPP (Bryant et. al., 2012) infers the species tree by computing the exact likelihood of the species tree using biallelic markers, where a single biallelic marker has only two states. For closely related species or within the species, most SNPs appear as biallelic markers, so both methods can be used. However, when the species of 86 interest are not closely related, one SNP site is likely to have more than two states, in which case my method is the only option available to approximate the likelihood.

We processed SNP data by sorting out the distinct site patterns under JC 69 models, which has greatly improves the efficiency of this method. Throughout the simulations used in my dissertation, we simulated 5,000 SNPs. For the given number of individuals in our sample, the number of SNPs collected does not affect the efficiency of the method. In other words, if we have 7 individuals in our sample, 10, 000 SNPs and 1,000,000 SNPs will have the same site patterns, leading to the same efficiency. Yet more individuals in the sample will generate more site patterns, which is linearly related to the computation time (see equation (22) and equation (24)). Note that the number of individuals in the sample is the number of the tips in the gene tree. After all, even with data processing, the efficiency of this method still depends on the size of the tree. Therefore, for large tree problems, we sample the gene trees given the species tree and use HT estimators to estimate the species tree likelihood. If six or more individuals are sampled in the SNP data, gene tree sampling is recommended.

The accuracy of the method was tested in different ways. Given a specific set of branch lengths, we compared the estimated log likelihood with the exact likelihood for 4-taxon species trees (Chifman and Kubatko, 2013) (Figure 7). For 4-taxon species trees, we also plotted both of the estimated log likelihoods and the exact likelihoods (Chifman and

Kubatko, 2013) over a branch length on the species tree, where the curves are concave

87

(Figure 8). Meanwhile, we investigated the influence of the sample size in importance sampling to the accuracy, for both 4-taxon species tree and 5-taxon species tree (Figure 7 and Figure 8). To balance computation efficiency and accuracy, we recommend using sample size of 250 for small trees and sample size of 100 for large trees.

Since the likelihood function of the species tree is a concave function with respect to one branch length, we optimized the branch lengths of the species tree one by one using

Brent’s method (Brent, 1973). Hence, we obtained the likelihood of the species tree topology given SNP data, which can be used in the tree searching inference of finding the

ML species tree.

Using my method in Chapter 2 and stepwise addition, I developed a likelihood-based heuristic approach to infer the ML species tree given SNP data. Four simulations were conducted on two 5-taxon species trees (Symmetric and Asymmetric) with “Normal” or

“Short” branch lengths (Table 2 and Figure 10). We found that when the branch lengths are not too short, most of the times the inferred ML tree is the true tree, which implies the reliability of this approach. The values of RF distance between the inferred species trees and the true tree are all quite small (< 2), which means the inferred ML species trees are very close to the true tree in the RF measure. Short branch lengths could affect the accuracy in inferring the true species tree, especially for symmetric species trees with short branch lengths.

88

In Chapter 4, we applied the likelihood computation method to the problem of species delimitation with SNP data. In the simulation study, we compared a proposed species delimitation and the true one according to their log likelihoods and model selection statistics. The comparison was tested under different species tree models: the 3-taxon tree model and 4-taxon tree model. Note that the 3-taxon model compared the split species

(taxon A and taxon B) with the original joined species (taxon AB, true species delimitation); the 4-taxon model compared the merged species (taxon AB) with the original split species (taxon A and taxon B, true species delimitation). We also considered subsampling of the individuals in the SNP data. The simulation results show that the 4-taxon tree model always has good performance of estimating the true species delimitation, regardless of the subsampling in the data or the branch lengths of the species tree. The 3-taxon tree model performs well with 97% accuracy without the presence of subsampling in the data. The accuracy of the 3-taxon tree model is brought down about 20% by the subsampling of data, and the branch lengths of the 3-taxon species tree do not seem to affect much. It seems that with subsampled data, my method potentially favors delimitating the species.

In the future we can extend this method for the multi-locus DNA sequence data. In this method, we assume with constant model parameters, such as , and we can relax this assumption and get the ML estimate of the parameters in different intervals of the species tree. To infer the ML species tree, since my method computes the likelihood of species tree topologies given SNP data, we can explore the stochastic tree searching

89 strategies, such as the genetic algorithm and the simulated annealing. For the species tree delimitation problem, we can investigate further on the influences from the species tree topology and branch lengths, and the sampling schemes of data to the likelihood of a species delimitation.

In all, my dissertation provides a new likelihood-based approach to infer species trees given SNP data. In this dissertation, this method has been implemented in tree searching inference and species delimitation problem. This likelihood method can also be used in other inferences in phylogenetics and population genetics.

90

References

Akaike, H. 1973. Information theory as an extension of the maximum likelihood principle.

Anderson, D. R. 2008, Model Based Inference in the Life Sciences, Springer. Pages 267-281 in B. N. Petrov and F. Csaki, editors. Second international symposium on information theory. Akademiai Kiado, Budapest, .

Brent, R. P. 1973. "Chapter 4", Algorithms for Minimization without Derivatives, Englewood Cliffs, NJ: Prentice-Hall, ISBN 0-13-022335-2.

Bryant D., Bouckaert R., Felsenstein J., Rosenberg N.A., and RoyChoudhury A. 2012. Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Molecular Biology Evolution 29:1917–1932.

Carstens B.C. and Dewey T.A. 2010. Species delimitation using a combined coalescent and information theoretic approach: An example from North American Myotis bats. Systematic Biology 59, 400-414.

Carstens, B. C. and Knowles, L. L. 2007. Estimating species phylogeny from gene-tree probabilities despite incomplete lineage sorting: an example from Melanoplus grasshoppers. Systematic Biology 56, 400-411.

Carstens, B. C., Pelletier, T. A., Reid, N. M. and Satler, J. D. (2013), How to fail at species delimitation. Molecular Ecology. doi: 10.1111/mec.12413

Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD (2003). "Multiple sequence alignment with the Clustal series of programs". Nucleic Acids Res 31 (13): 3497–3500.

91

Chifman, J. and L. Kubatko. 2013. Identifiability of species phylogenies from DNA sequences under the coalescent model. (In preparation)

Degnan, J and L. Salter. 2005. Gene tree distributions under the coalescent process. Evolution 59(1): 24-37.

Edwards. S. V., L. Liu, and D. K. Pearl. 2007. High resolution species trees without concatenation. Proceedings of the National Academy of Sciences USA 104: 5936–5941.

Ence. D.D and B. C. Carstens. 2011. SpedeSTEM: A rapid and accurate method for species delimitation. Molecular Ecology Resources 11: 473-480.

Fan, H. and L. S. Kubatko. 2011. Estimating species trees using approximate Bayesian comutation. Molecular Phylogenetic Evolution 59: 354-363.

Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17: 368-376

Griffiths, R. C. and S. Tavare, 1997. Computational methods for the coalescent. Progress in Populations Genetics and Human Evolution 87: 165-182.

Heled.J, and A. J. Drummond. 2010. Bayesian inference of species trees from multilocus data. Mol. Biol. Evol. 27(3): 570-580.

Hendy, M. D., and D. Penny. 1982. Branch and bound algorithms to determine minimal evolutionary trees. Mathematical Biosciences 59: 277-290.

Hudson, R. R. 2002. Generating samples under a Wright-Fisher neutral model. Bioinformatics 18:337-8.

Kingman, J. F. C. 1982. The coalescent. Stochastic Processes and their Applications. 13:235–248.

92

Knowles, L.L and B.C Carstens. 2007. Delimiting species without monophyletic gene trees. Systematic Biology 56(6): 887-895.

Kolaczkowski, B. and Thornton, J. W. 2004. Performance of maximum parsimony and maximum likelihood phylogenetics when evolution is heterogeneous. Nature 431: 980- 984

Kubatko, L, B. C. Carstens, and L. L. Knowles. 2009. STEM: Species Tree Estimation using Maximum likelihood for gene trees under coalescence. Bioinformatics 25(7): 971- 973.

Kubatko, L. and J. Degnan. 2007. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Systematic Biology 56: 17-24.

Kullback, S. and Leibler, R.A. 1951. On Information and Sufficiency. Annals of Mathematical Statistics 22 (1): 79–86. doi:10.1214/aoms/1177729694. MR 39968

Lipman, DJ; Pearson, WR (1985). "Rapid and sensitive protein similarity searches". Science 227 (4693): 1435–41.

Lipman DJ, Altschul SF, Kececioglu JD. A tool for multiple sequence alignment. Proceedings of the National Academy of Sciences of the USA. 1989; 86:4412-4415.

Liu, L., and D. K. Pearl. 2007. Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Systematic Biology 56:504–514.

Liu. 2008. BEST: Bayesian estimation of species trees under the coalescent model. Bioinformatics 24:2542-2543.

Liu, L., L. Yu, S.V. Edwards. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evolutionary Biology 2010, 10:302.

93

Liu, L., L Yu, D. K. Pearl and S. V. Edwards. 2009. Estimating species phylogenies using coalescence times among sequences. Systematic Biology 58(5): 468-477.

Maddison, W. 1997. Gene trees in species trees Systematic Biology 46:523–536.

Mossel, E. and Vigoda, E. 2005. Phylogenetic MCMC algorithms are misleading on mixtures of trees. Science 309: 2207-2209.

Pamilo, P. and M. Nei. 1988. Relationships between gene trees and species trees. Mol. Biol. Evol. 5: 568–583.

Robinson D. R. and L. R. Foulds. 1981. Comparison of phylogenetic trees, Mathematical Biosciences, 1981, volume 53, pages 131-147.

Ronquist, F., M. Teslenko, P. van der Mark, D.L. Ayres, A. Darling, S. Höhna, B. Larget, L. Liu, M.A. Suchard and J.P. Huelsenbeck. 2012. MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space. Systematic Biology 61(3):539–542.

Salter, L. and D. Pearl. 2001. Stochastic search strategy for estimation of maximum likelihood phylogenetic trees. Systematic Biology 50(1): 7-17.

Tavare, S. 1984. Line-of-descent and genealogical processes, and their applications in population genetics model. Theoretical Population Biology 26: 119-164

Wakeley, J. 2008. Coalescent theory: an introduction. Roberts and Company Publishers. Greenwood Village, CO.

Wu, Y. 2011. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution 66(3): 763-775.

Z. Yang and B. Rannala. 2010. Bayesian species delimitation using multilocus sequence data. Proceedings of the National Academy of Sciences USA 107: 9264-9269.

94

Zwickl, D. J.. 2006. Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. Ph.D. dissertation, The University of Texas at Austin.

95