INFORMATION TO USERS

This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some and dissertation copies are in typewriter face, while others may be from any type of computer printer.

The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction.

In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book.

Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. UMI A Bell & Howell Infonnation Company 300 North Zeeb Road, Ann Arbor MI 48106-1346 USA 313/761-4700 800/521-0600

Simulation-Based Estimation of Phylogenetic Trees

DISSERTATION

Presented in Partial Fulfillment of the Requirements for

the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Laura A. Salter, B.A., M.S.

* * * * *

The Ohio State University

1999

Dissertation Committee: Approved by

Professor Dennis K. Pearl, Adviser JT Professor L. Mark Berliner Adviser Professor Paul Fuerst Department of Statistics Professor Joseph Verducci UMI Number: 9931673

UMI Microform 9931673 Copyright 1999, by UMI Company. All rights reserved.

This microform edition is protected against unauthorized copying under Title 17, United States Code.

UMI 300 North Zeeb Road Ann Arbor, MX 48103 ABSTRACT

A common, goal in the analysis of nucleotide sequence data is the inference of

the phylogenetic history of the sequences under consideration. Many criteria for the

selection of a phylogenetic representation of the data have been developed. We focus

here on two criteria: the maximum likelihood criteria and the parsimony criteria.

The maximum likelihood method of phylogenetic tree construction has several

advantages over other criteria, including the interpretability of the underlying Markov

models, consistency in the statistical sense, and the possibility of statistical testing of

hypotheses using the likelihood framework. However, use of the maximum likelihood

method in practice has been limited because the method is computationally intensive, especially when the number of sequences under consideration is large. We therefore propose a stochastic search algorithm for estimation of the maximum likelihood tree.

The method significantly reduces the computation time involved in constructing the maximum likelihood tree, and in many cases returns an estimate of the phylogeny that has a higher likelihood than those returned by the methods currently in use. We give some convergence results for the algorithm and apply it to several theoretical and real data sets. The algorithm can also be extended to allow for simultaneous estimation of the tree and the model parameters. Examples of the application of the extended algorithm are also given.

11 Parsimoay is currently one of the most widely used phylogenetic tree construction methods. However, current implementations of the parsimony method can be shown to give locally optimal estimates of the phylogeny when a large number of sequences are considered. We have developed a simulated annealing algorithm for estimation of phylogenetic trees under the parsimony criteria, in the hope that such an algorithm would be less prone to entrapment in local minima. Though our algorithm does show reasonable ability to locate the most parsimonious tree, it does so at the expense of computing time. This result is in agreement with previous literature concerning the use of simulated annealing in estimating phylogenetic trees under the parsimony criteria. We provide convergence results for our algorithm and apply the method to several examples.

lU This is dedicated to my parents and my sister

IV ACKNOWLEDGMENTS

I would like to thank my family, my friends, and especially Justin, for the constant encouragement and support they have given me. I am grateful to the members of my committee for their insight and suggestions concerning this research. I would especially like to thank Dr. Berliner for his assistance with the results in Chapter 4.

Finally, I am extremely grateful to Dr. Pearl for his continued advice, encouragement, dedication, and understanding throughout my thesis work. VITA

April 11, 1972 ...... Bom - New Orleans, Louisiana USA

1994 ...... B.A. Biolog}% Mathematics

1996 ...... M.S. Statistics

1997-present ...... Graduate Research Associate, The Ohio State University.

FIELDS OF STUDY

Major Field: Biostatistics

VI TABLE OF CONTENTS

Page

A b stra c t...... ii

Dedication...... iv

Acknowledgments ...... v

V i t a ...... vi

List of Tables ...... ix

List of Figures ...... x

Chapters:

1. Introduction and Literature Review ...... 1

1.1 Phylogenetic Trees and Reconstruction Methods ...... 1 1.2 Simulated Annealing and Stochastic Probing ...... 8

2. A Stochastic Search Strategy for Estimation of the Maximum Likelihood T r e e ...... 14

2.1 Calculation of the L ik e lih o o d ...... 14 2.2 A Stochastic Search A lg o rith m...... 21 2.2.1 The Generation S c h e m e...... 22 2.2.2 The Cooling S c h e d u le ...... 26 2.2.3 The Stopping R u le ...... 27 2.3 Simultaneous Estimation of the Tree and the Substitution Model P a ra m e te rs ...... 29 2.3.1 Estimation of the Nucleotide Frequency Parameters ...... 30 2.3.2 Estimation of Other Substitution Model Parameters .... 31 2.4 Computer Implementation ...... 40

vii 3. A Simulated Annealing Algorithm for Estimation of Phylogenetic Trees Under the Parsimony C riteria ...... 43

3.1 The Parsimony C r ite r ia ...... 43 3.2 Estimating the Most Parsimonious Tree(s) Using Simulated Annealing 47 3.2.1 A New Simulated Annealing Algorithm for Estimation of the Most Parsimonious Tree(s) ...... 49 3.2.2 Computer Im plem entation ...... 52

4. Properties of the Algorithms ...... 53

4.1 Convergence Results for the Stochastic Search A lgorithm ...... 53 4.2 Convergence Results for the Simulated Annealing Algorithm for Es­ timation of the Most Parsimonious Tree(s) ...... 66

5. Applications ...... 73

5.1 Estimation of the ML Tree for Fixed Parameter Values ...... 73 5.1.1 Theoretical D a t a ...... 73 5.1.2 Mitochondrial DNA Sequences ...... 76 5.1.3 Group A Papillomavirus Sequences ...... 80 5.1.4 Analysis of the env Region for 30 HIV Sequences ...... 84 5.2 Simultaneous Estimation of the Tree and Substitution Model Pa­ rameters ...... 87 5.3 Estimation of the Most Parsimonious Tree(s) ...... 91

6. Conclusion and Future Directions ...... 96

Appendices:

A. Data Sets ...... 99

B. Newton-Raphson Calculations ...... 103

B.l The Newton-Raphson Method for the Assignment of a Time to the Target Node in the Local Rearrangement S trateg y ...... 103 B.2 The Newton-Raphson Method for Estimation of and K in the F84 Substitution Model ...... 106

Bibliography ...... 109

V lll LIST OF TABLES

Table Page

1.1 DNA sequences for a portion of the Ll gene for seven Group A9 human papillomaviruses ...... 2

3.1 Data from Table 1.1 with site 15 highlighted...... 45

5.1 Summary of the tree estimation results for the theoretical data examples 74

5.2 Summary of the tree estimation results for the real data examples . . 78

5.3 Group A papillomaviruses and genetic subtypes ...... 80

5.4 Summary of the tree estimation results for the HPV example ...... 81

5.5 Genetic subtype groupings for the HIV data set ...... 86

5.6 Summary of the tree estimation results for the HIV exam ple ...... 87

5.7 Parameter estimates for the mtDNA data set ...... 89

A.l Theoretical data used in the examples in Section 5.1.1 ...... 102

A.2 Theoretical data used in Section 2 .3 ...... 102

IX LIST OF FIGURES

Figure Page

1.1 Example rooted and unrooted trees for the seven sequences in Table 1.1 3

2.1 The local rearrangement strategy ...... 23

2.2 The log likelihood for Trees 1 and 2 for various values of the transi­ tion/transversion ratio for a theoretical data set...... 31

2.3 The log likelihood for varying values of the transition/ transversion ratio for the 14-sequence mtDNA data set ...... 32

2.4 Two trees that may each be the ML estimate for differing values of R for the 30-sequence HPV data set ...... 33

3.1 Demonstration of the calculation of the length of a tre e ...... 46

5.1 Plot of the log of time against the log of the number of sequences for the theoretical data exam ples...... 75

5.2 NJ tree for the 12 primate species from Hayasaka, Gojobori, and Horai [27] ...... 77

5.3 Three trees of high likelihood found by SSA for the 14-sequence mtDNA data set ...... 79

5.4 Four trees of high likelihood found by SSA for the HPV data set. . . 83

5.5 Three trees of high likelihood found by SSA for the HIV data set. . . 88

5.6 Three trees of high likelihood found by SSA when parameters are also estim ated ...... 90

X 5.7 MP tree for the HPV data set ...... 94

5.8 MP tree for the HPV data set ...... 95

B.l Subtree defined by the target node and its descendants ...... 105

XI CHAPTER 1

INTRODUCTION AND LITERATURE REVIEW

Study of the evolutionary^ relationships among organisms has been of interest to scientists for over 100 years. Historical attempts at inferring evolutionary history have relied on observable species characteristics. Modern molecular techniques, however, have made available an abundance of DNA sequence data which can be used to study these relationships. The results of such studies have implications on the understanding of not only the of plants and animals, but also viral transmission and evolution of resistance to treatment. Clearly a need for effective methods of analyzing

DNA sequence data exists. We discuss here one important aspect of the problem, the reconstruction of a phylogeny from DNA sequence data.

1.1 Phylogenetic Trees and Reconstruction Methods

Deoxyribonucleic acid (DNA) molecules specify the hereditary information for nearly all Ihdng organisms, with the exception of some viruses. DNA molecules consist of two complementary chains that are each composed of four nucleotides: adenine (A), guanine (G), cytosine (C), and thymine (T). Adenine and guanine are called purines, and cytosine and thymine are called pyrimidines. As an example, consider the data in Table 1.1, which shows DNA sequences for a portion of the Ll

gene of seven human papillomaviruses.

HPV16 ATGTGGCTGCCTAGTGAGGCCACTGTCTACTTGCCTCCTGTCCAGTATCTAAGGTTG HPV35h ATGTGGCGGTCTAACGAAGCCACTGTCTACCTGCCTCCAGTTCAGTGTCT.A.AGGTTG HPV31 ATGTGGCGGCCTAGCGAGGCTACTGTCTACTTACCACCTGTCCAGTGTCTAAAGTTG HPV52 ATGTGGCGGCCT.A.GTGAGGCCACTGTGTACCTGCCTCCTGTCCTGTCTCTAAGGTTG HPV33 ATGTGGCGGCCTAGTGAGGCCACAGTGTACCTGCCTCCTGTCCTGTATCTA.AA.GTTG HPV58 .A.TGTGGCGGCCT.A.GTGAGGCCACTGTGTACCTGCCTCCTGTCCTGTGTCTAAGGTTG RhPVl ATGTGGCGGCCTAGTGACTCC.AAGGTCT.A.CCTACCACCTGTCCTGTGTCTA.A.GGTGG

Table 1.1: DNA sequences for a portion of the Ll gene for seven Group A9 human papillomaviruses. See Section 5.1.3 for a description of the data.

Given DNA sequence data such as that in Table 1.1, it is often of interest to infer the evolutionary relationships among the sequences that are most consistent with this data. A common representation of such evolutionary relationships is by means of a phylogenetic tree. A phylogenetic tree contains both nodes and branches.

The nodes represent taxonomic units, and the branches connect the nodes according to descent-ancestry relationships. We distinguish between external nodes, which are the nodes at the tips of the tree representing the observed taxonomic units, or taxa, under consideration, and internal nodes, which represent ancestors of the taxa under consideration. The overall branching pattern of the tree, including the assignment of taxa to the external nodes of the tree, will be called the topology.

Phylogenetic trees are called rooted when the location of the common ancestor of all the taxa in the tree is identified, or unrooted when no such common ances­ tor is identified. Figure 1.1 shows examples of both rooted and unrooted trees for seven taxa. Rooted phylogenetic trees may or may not satisfy the assumption of a 7

(a) (b)

Figure 1.1: Example rooted (a) and unrooted (b) trees for the seven sequences in Table 1.1.

molecular clock. The molecular clock hypothesis states that the rate of evolution is approximately constant over time. When all of the sequences in the tree are contem­ poraneous, this assumption restricts the lengths of the branches so that the sum of the branch lengths connecting each taxa to the root is the same for all taxa. The units for the branch lengths are the expected number of nucleotide substitutions per site.

As the number of taxa under consideration grows, the number of distinct topolo­ gies increases rapidly. For n taxa, the number of distinct rooted topologies is

(1.1)

1 = 1

The number of unrooted topologies for n taxa is equal to the number of rooted topologies for n — 1 taxa. Many methods of phylogenetic tree construction from nucleotide sequence data have been proposed. These can generally be classified into one of three groups: dis­ tance methods, parsimony, and maximum likelihood (ML). Methods based on dis­ tances between DNA sequences are the most computationally efficient, but have other undesirable properties, such as inconsistency. Parsimony methods have been the most widely used methods to date, but these methods are generally computationally in­ tensive and can also produce inconsistent estimates in any situation in which the mutation rate is not low. The maximum likelihood approach has a more sound sta­ tistical basis than the other methods, but its use has been limited by its computational complexity. We briefly discuss these three classes of methods.

Distance methods use information about the evolutionary distance between pairs of sequences in constructing phylogenetic trees. Any tj'pe of distance measure can be used, but the distances typically considered are functions of the number of nu­ cleotide or amino acid differences between the two sequences. The most popular of the distance methods are the unweighted pair-group method with arithmetic mean

(UPGMA) method [47] and the neighbor-joining (NJ) method [58]. The UPGMA method sequentially joins nodes that are most similar (i.e., have the smallest dis­ tance). Once two nodes have been joined, they are treated as a siagle node, and distances are recomputed. The process continues until all nodes have been added to the tree. The NJ method begins with a star tree (a tree with only one internal node) and then sequentially separates pairs of taxa from the other taxa in such a way that the sum of all the branch lengths within the tree is minimized. The process continues until a bifurcating tree (one in which all ancestral nodes have exactly two descen­ dants) has been obtained. Distance methods are statistically consistent if the rates of evolution among the lineages are equal. However, such methods are inconsistent when there are unequal rates of evolution [38].

Parsimony methods attempt to minimize the number of evolutionary changes required to explain the differences in the nucleotide sequences of the taxa under consideration. For many data sets, there may be more than one most parsimonious

(MP) tree. Methods based on the idea of maximum parsimony have been criticized for several reasons. The first is that such methods assume that the evolutionary history representing the least amount of change is the one that is most likely to have occurred, which may not always be the case. A second and very important criticism is that the parsimony method is inconsistent when the rates of evolution are not small and/or are unequal between lineages [17, 38]. A third problem is that finding the MP tree(s) is difficult when the number of taxa under consideration is large, due to the fact that an exhaustive evaluation of the amount of evolutionary change for all possible topologies is computationally impossible. Despite these difficulties, parsimony is one of the most widely-used criteria for phylogenetic inference, due to both its availability in most of the popular software programs, and its ease of use even in the case of data sets that include a large number of taxa. Parsimony will be discussed in more detail in Chapter 3.

The maximum likelihood method was first applied to phylogenetic tree construc­ tion by Cavalli-Sforza and Edwards [8] for gene frequency data and by Felsenstein

[15, 16] for nucleotide data. Since then, various algorithms for estimation of the ML tree have been proposed. The most popular algorithms currently in use are those developed by Felsenstein [19] in his programs DNAML and DNAMLK and those developed by Swofford [60] and implemented in PAUP*. In the case of unrooted trees, Olsen et al. [49] provided an algorithm (fastDNAML) that is nearly identi­

cal to Felsenstein’s program DNAML, but is able to find an estimate more quickly.

These algorithms are all similar, and involve first building an initial tree by stepwise

addition of sequences to the tree. At each stage of addition of a new taxa, various re­

arrangements of the topology are attempted, and any rearrangement which results in

an increase in likelihood is accepted. Throughout the process, branch lengths which

maximize the likelihood for a specific placement of the new taxa are computed using

the Newton-Raphson method. After the entire tree has been constructed, branch

lengths are further optimized using the Newton-Raphson method by moving through

the tree and adjusting the branch lengths one at a time. This process continues until

an entire traversal can be made through the tree without a significant increase in the

likelihood.

These algorithms have several disadvantages. The first is that they can be quite

slow when the number of taxa under consideration is large. The time required for

the programs DNAML and DNAMLK in PHYLIP to produce a single estimate of the maximum likelihood tree using this algorithm is proportional to the cube of the number of taxa under consideration and is linear in the number of variable sites in the nucleotide sequence [19]. Olsen et al. [49] report that the time required for DNAML to find a single estimate of the topology for 16 taxa and 2,413 sites averaged 24,246 seconds. Because of the computational cost involved, the program is not usable for even moderately large numbers of taxa (say > 35) when there are many distinct variable sites in the nucleotide sequences.

The second disadvantage is that the method is completely dependent on the order in which the taxa are added to the tree. It is tempting therefore to try all possible orders of addition of the taxa in the hope of finding the maximum likelihood tree.

This approach will not work for two reasons. The first is that it is infeasible given the amount of time involved in the computation of a single estimate. The second and more important reeison is that the stepwise addition method may become trapped in local maxima if the data is suflSciently complex and therefore even the consideration of all possible orders of addition does not guarantee finding the maximum [19]. For example, the program DNAMLK failed to find the maximum likelihood topology on all of 17 attempted runs for the 16 sequence example described in Olsen et al. [49].

Aside from the difficulties involved in its implementation, there are several ad­ vantages to the likelihood method of phylogenetic tree reconstruction. The first is that the maximum likelihood method provides a consistent estimate of the topology, which was a shortcoming of both the distance-based and parsimony methods dis­ cussed above. The method also shows some degree of robustness to violations of the assumptions used in its models [65, 25]. In addition, the maximum likelihood method may be used to construct trees with or without the assumption of a molecular clock, and construction of trees under both methods for a data set allows for a likelihood- based test of the molecular clock hypothesis for the data set under consideration. We note that when trees are constructed without the assumption of a molecular clock, the maximum likelihood method allows the expected number of substitutions to be different for different branches of the tree, which was not possible for the parsimony and UPGMA methods.

The maximum likelihood method has an additional advantage over the parsimony method in that it considers information about the lengths of the branches in con­ structing trees, rather than simply considering the branching pattern. It also uses information from all the sites in the nucleotide sequence in the calculation process, rather then just the phylogenetically informative sites. The maximum likelihood method will be discussed in more detail in Chapter 2.

Considerable recent attention has been given to stochastic search strategies as a means of estimating the ML tree. The genetic algorithms of Lewis [35] for nucleotide sequence data and of Matsuda [42] for amino acid sequence data seem promising.

Several Markov Chain Monte Carlo (MCMC) methods [37, 43, 69] have been pro­ posed when an estimate of the posterior distribution of phylogenetic trees under specific prior assumptions is desired. However, when a large number of sequences is considered, MCMC methods give poor estimates of the posterior probability of any individual tree and are not designed to estimate the ML tree.

1.2 Simulated Annealing and Stochastic Probing

The method of simulated annealing was developed through the generalization of an algorithm proposed by Metropolis et al. in 1953 [45] to simulate the changes in the energy of a substance as it was being cooled. The goal of the cooling process is to obtain a solid that is in its ground state, i.e., the state at which the solid has minimum energy. However, the cooling process has the property that if the temperature is lowered too quickly, the resulting solid can become trapped in a meta­ stable state that is not its ground state. Several authors, notably Kirkpatrick et al.

[31] and Cerny [9], noticed the analogy between the cooling of a substance to its minimum energy state and the minimization of a function using a stochastic search strategy. In this analogy, the meta-stable states represent local minima, the ground state represents the global minimum, and the rate of lowering of the temperature

8 corresponds to some parameter which controls the possible solutions examined by the search procedure. Using this analogy and the details of the Metropolis algorithm, the simulated annealing method for function optimization was developed.

To describe the annealing algorithm in general, we suppose that the goal of the algorithm is the maximization of a function. The algorithm works by moving through the solution space in such a way that solutions which increase the value of the objective function are always accepted while those which decrease the value of the objective function are accepted with a varying probability, which depends on the amount that the objective function would be decreased. The probability of accepting a solution which lowers the value of the objective function is decreased as the algorithm proceeds according to a sequence of control parameters. This is analogous to the method of slowly decreasing the temperature in the cooling of a substance in the chemical physics setting described above. The idea behind the algorithm is that accepting poorer solutions with a certain probability will help the process avoid becoming trapped in local maxima.

We can formally define the simulated annealing algorithm as follows. Let /(•) be the function to be maximized, let x be an element of the solution space, S, and let cq be the initial value of the control parameter. The following steps are repeated until the value of the control parameter is sufficiently small or until the same solution is repeatedly generated in many consecutive iterations.

Step 1. From the current solution, Xi, generate a potential solution, Xj, according

to a specific generation scheme.

Step 2. If f{xj) > f{xi), then set Xj+i = Xj.

Otherwise, set Xj+i = Xj with probability exp{ ZlEzIzZlHil j._

9 Step 3. Update the value of the control parameter, Q, and set i to i -t-1. Go to Step

1.

Since the outcome of an iteration depends only on the outcome of the previous it­

eration, the simulated annealing algorithm generates a Markov chain whose transition

probabilities depend on the generation scheme and on the probability of accepting

a new solution into the chain. As it is described above, the process will generate

a time-inhomogeneous Markov chain. If the value of the control parameter is held

fixed for many iterations and then subsequently lowered, the process will generate a

series of time-homogeneous Markov chains, which can then be combined into a single

time-inhomogeneous Markov chain. Using Markov chain theory, the above algorithm

can be shown to converge to a stationary distribution for which the set of optimal so­

lutions has probability one, under certain conditions on both the sequence of control

parameters and on the generation scheme [41, 46, 2, 22, 39, 5].

We note that the algorithm requires several things. First, a method of generating

a candidate solution from any given solution must be determined, and this generation scheme must satisfy the requirement that every state is reachable from every other state in a finite number of applications of the generation procedure. Next, an initial value of the control parameter and the way that the control parameter will be up­ dated at each iteration of the algorithm must be specified. This is often called the

cooling schedule in the spirit of the analogy of the cooling of a substance, since this sequence of parameters determines the change in the probability with which solutions that decrease the value of the objective function are accepted (which corresponds to the rate at which the substance is cooled). Finally, the stopping criteria must be

10 determined. This is usually done by specification of either a final value of the control

parameter or a bound on the number of consecutive unsuccessful moves attempted.

Although the algorithm can be shown to converge under appropriate conditions,

the number of iterations required to find a solution arbitrarily close to the optimal

solution can be quite large for some choices of the cooling schedule and stopping rule.

Therefore, several cooling schedules which seem to provide near-optimal results in

reasonable time have been proposed [40, 41, 31, 3].

The simulated annealing algorithm has several advantages. The first is that it is

a very general algorithm that can easily be applied to many optimization problems.

Secondly, the annealing algorithm provides an effective method of searching the so­

lution space while avoiding the possibility of entrapment in local optima, even in

situations where the solution space is extremely complex. It is therefore ver}' useful

in combinatorial optimization problems, where an exhaustive search of the solution

space is impossible due to the large number of possible solutions. Further, the conver­

gence of the annealing method is independent of the starting point, and so a starting

point for the algorithm can be generated from any valid point in the solution space

in whatever manner is most convenient for the problem under consideration.

The annealing method also has several disadvantages. One is that the determina­

tion of an appropriate cooling schedule and the specification of the parameters within that cooling schedule is often difficult and problem-specific. A second disadvantage is that the algorithm in general is found to require many iterations to find near-optimal solutions. For certain problems, it has been found that the algorithm is less efficient than complete enumeration of the solution space [2]. In spite of these difficulties, the algorithm has still proven useful in some problems.

11 The simulated annealing method can also be modified to give other useful simulation-

based methods. One modification was the proposal of a generalized simulated an­

nealing method by Bohaclievsky, Johnson, and Stein [6]. These authors modified the

annealing algorithm by replacing the probability of acceptance of a solution which re­

sults in an increase of the objective function (in the case of minimization of a function)

by

exp{-P{f{Xj)- fminYifiXj) - (1.2)

where f{x) is the value of the objective function at point x, fmin is the minimum

value of the function, and (3 and g are parameters that must be determined. Note

that this method requires knowledge of the minimum value of the function. The

authors discuss modifications that can be made when this value is not known.

Another simulation-based estimation method related to the annealing algorithm

is a procedure developed by Laud, Berliner, and Goel [33] called stochastic prob­

ing. The method incorporates problem-specific information about the location of the

optimal point into the search procedure in the form of a probing distribution. The

probing distribution is used to guide the search procedure to regions in the solu­ tion space which are likely to contain good solutions in the hope of speeding up the convergence. The probing distribution is updated as the search proceeds, so that information gathered during the search is incorporated. The probing algorithm was found to be more eflJcient than the simulated annealing method for some problems.

In this we thesis, we propose simulation-based methods of estimating phylogenetic trees under both the maximum likelihood criteria and the parsimony criteria. In

Chapter 2, we give the details of the maximum likelihood method of phylogenetic tree construction and introduce a stochastic search strategy for estimation of the

12 maximum likelihood phylogenetic tree. The method incorporates features of both the simulated annealing algorithm and the stochastic probing algorithm. We also give an extension of the method which allows for simultaneous estimation of the tree and the substitution model parameters. In Chapter 3, we describe the parsimony criteria, and propose a simulated annealing algorithm for estimation of phylogenetic trees under the parsimony criteria. In Chapter 4, we describe some of the properties of the two methods. Chapter 5 provides some applications of each of the methods to both theoretical and real data examples. Chapter 6 gives a summary of this work and some directions for future research.

13 CHAPTER 2

A STOCHASTIC SEARCH STRATEGY FOR ESTIMATION OF THE MAXIMUM LIKELIHOOD TREE

We begin the description of our proposed stochastic search strategy’’ with an ex­ planation of how the likelihood is calculated for any given phylogenetic tree. We then describe our algorithm, and give the details of its implementation.

2.1 Calculation of the Likelihood

The first step in the calculation of the likelihood for a phylogenetic tree is the specification of a model, usually called a substitution model, for the change of one nucleotide sequence to another over time. The substitution models most commonly used are Markov models, which have the property that the probability of a nucleotide at a particular site changing from i to j does not depend on the past history of nucleotides at the site. The probability depends only on the nucleotide i that currently occupies the site and on the time over which the change has occurred. It is also generally assumed that the substitution probabilities do not vary throughout the tree, so that homogeneous Markov processes are used to model sequence change over time. A further assumption is time-reversibility, i.e., that the overall rate of change from nucleotide i to nucleotide j in time £ is the same as the rate of change from

14 nucleotide j to nucleotide i in time t. Four of the commonly used substitution models are described below. The tree estimation procedure described here is general enough to include each of these models.

The Jukes-Cantor Model

The Jukes-Cantor model (JC) [29] is the simplest of the substitution models, since it contains only one parameter. The model assumes that all nucleotide substitutions occur at the same rate, fx, and that the equilibrium frequencies of the nucleotides are all equal (and therefore all equal to j). The instantaneous rate matrix for this model is shown below.

1 : r (2.1)

4 /^

Using this instantaneous rate matrix, we can calculate the transition probabilities as follows. Denote the instantaneous rate m atrix defined in (2.1) above by Q. The matrix of transition probabilities is then given by

P (t) = eQ' (2.2)

To evaluate this expression, we use the fact the matrix Q can be written as

Q = B D B - \ (2.3) where B is a square matrix whose columns are the eigenvectors of Q, and D is a diagonal matrix with the eigenvalues of Q, which will be denoted by Ai on the

15 diagonal. We can therefore evaluate (2.2) by taking

P(t) = gQ' = (2.4) where D has the form,

0 0 0 \ 0 0 0 0 0 0 0 0 0

For the specific Q matrix defined in (2.1), we would have

fl 1 1 i\ A 0 0 0 \ /| 1 - 1 1 - 1 0 e -fit 0 0 P(f) = , —mut i Î (2.5) 1 1 - 1-1 0 0 0 ^—mut \1 —1 —1 1 J VO 0 0 vi 7 /

The matrix of transition probabilities thus has (z, j) entry

[P(i)lü = P,At) = \ - + iye-"' (2 .6) where Sij is the indicator function that is 1 when i = j and 0 otherwise.

Kimura’s Two Parameter Model

The assumption that all nucleotide substitutions occur at the same rate is very re­ strictive since it is known that transitions (changes from A -w- G and C T) are more likely to occur than transversions (changes from A -H- C, A T, G <-)■ C, G T).

Thus Kimura [30] introduced a model (K2P) which included a transition/ transversion parameter k. The instantaneous rate matrix for this model is shown below.

16 + 2) \^1K 4/^(« + 2) ï^^ \HK j/ZAC — T/i(/c + 2) V \^J.K

From this instantaneous rate matrix, we can calculate the transition probabilities as above,

Pii = \ - 4- (2.7) where 5ij is an indicator function as above, and Cy- is 1 if the change from i to j is a transition and 0 otherwise. We note that setting ac = 1 gives the JC model.

Hasegawa, Kishino and Yano’s Five Parameter Model

Kimura’s two parameter model can be generalized to incorporate differences in the equilibrium frequencies of the nucleotides. This generalization results in the five parameter model of Hasegawa, Kishino, and Yano (HKY85) [26] (there are five parameters instead of six because the four equilibrium frequencies must sum to 1).

The instantaneous rate matrix for this model is shown below.

f-fi{KTTc + rcr) IAKTTq IITTt \ jUTTA -fJ.{tZTCT + TTr) y^G jlKTZT flK-KA -fJ.{K7TA 4- TTy) y^T flKTTc y^G -fz{KTc + 7Tr )J where ttj is the equilibrium frequency of nucleotide J, ttr = tta+'^g and Try = Trc+Try.

From the instantaneous rate matrix, we can calculate the transition probabilities as above.

P y ( t ) = TTj - TTje-^^ + ^(e-^' - (2.8) Hj

17 where Hj = tt^ ifj is a purine and flj = xy if j is a pyrimidine. We note that setting

TTj = I for all i gives the K2P model and additionally letting /c = 1 gives the JC

model.

Felsenstein’s Five Parameter Model

Felsenstein [18] also proposed a five parameter model (the F84 model) as a gen­

eralization of Kimura’s model. Although the two models are similar, they are not

equivalent, except in the case that Xj = 0.25 for all i. The idea behind the F84 model

is that the substitution process can be thought of as having two separate components:

an overall substitution rate which produces both transitions and transversions, and a

within-group substitution rate which produces only transitions. Modeling the substi­

tution process in this way maJces the model computationally simpler than the HKY85

model, while still including the important features of the model. The instantaneous

rate matrix for this model is shown below.

— flXc + ■—) fiXT ^ Ixxa — Mttg

fJ,XA{l 4 - flXc — fiXT \ /XTTa — J

where Xj is the equilibrium frequency of nucleotide j, xr = x a + xq and xy = xc + t^t

and the diagonal elements are the negative of the sum of the corresponding row.

From the instantaneous rate matrix, we can calculate the transition probabilities as above:

Pij{i) = 7T; - 7rje-^‘ + -H (2.9)

18 where IIj = tîr if j is a purine and IIj = vry if j is a pyrimidine. We note that setting

TTj = i for all i gives the K2P model and additionally letting K = Q gives the JC model.

In applying these substitution models to the estimation of phylogenetic trees un­ der the maximum likelihood criteria, it is generally assumed that the substitution rate does not vary across sites. This assumption of rate homogeneity across sites is unrealistic, and can be relaxed in several ways. One possibility is to include a relative rate component in the transition probabilities. For example, in the JC model we could let r/i be the rate component at site h and modify the transition probability so that at site h we have

Py(t, r*) = i - ie-'"*'+

Alternatively, we can model the rate variation over sites by some probability distribu­ tion (discrete or continuous) and then modify the calculation of sitewise likelihoods

(see below) by either summing or integrating over this distribution. For example, the gamma distribution has been commonly used to model rate variation across sites [63].

Another common feature in the data which is not accounted for in the substitution models described above is the presence of insertions or deletions in the nucleotide sequences. An insertion is the addition of one or more nucleotides to a sequence, while a deletion is the removal of one or more nucleotides from a sequence. Most phylogenetic reconstruction methods in common use either exclude sites which contain an insertion or deletion from the data set that is used in constructing the tree or ignore the particular branch leading to an external node with an insertion or deletion in the likelihood calculation for that site. At the present time, we proceed in the same manner, with each of the above possibilities as an option, though we note that a

19 substitution model which takes into account the insertion/deletion process would be

much more realistic.

After the substitution model has been specified, the likelihood function can be cal­

culated. Before demonstrating the calculation of the likelihood of a tree, we note some

of the properties of the calculation. First, under the assumption time-reversibility,

the calculation of the likelihood does not depend on whether the tree is considered as rooted or unrooted. Therefore the first step in most calculations of the likelihood is to place the root at some point in the tree if it is unrooted. Second, under the assumption that sites evolve independently of one another, the likelihood can be cal­ culated by considering the sites one at a time and then forming the overall likelihood as the product of the sitewise hkelihoods. Formally,

m /(r.tlD ) = n > ( T ,t|D j) (2.10) j=l where l{-) is the likelihood function, r is the topology, t is a vector of branch lengths, m is the number of sites in the nucleotide sequences under consideration, D is the data for all of the sites, and Dj is the data for site j only. Usually it is easier to work with the log of the likelihood, so that we are considering

m . T (r,tlD ) = ln l(r,t|D ) = ^ l n l ( r , t |D j ) (2.11) j=i To calculate the likelihood of tree topology r with branch lengths t for site j, we must sum the probabilities of all possible scenarios by which the taxa at the tips of the tree could have evolved. Fortunately, Felsenstein [15, 16] provided a simplified version of the calculation which he termed “pruning” which allows the calculation to be completed without the enumeration of all these possible scenarios. Let Z^(s) be the likelihood for node k at site j given that node k has state s (s = A, G, C, or T). This

2 0 is actually the conditional likelihood of the subtree descending from node k given that

node k is in state s. For the tips of the tree, l].{s) will be 1 if the observed nucleotide

at site j is s and 0 otherwise. For the internal nodes of the tree, the conditional

likelihood will be

'i ( s ) = Y , Y (2.12) r y where x and y take on the values A, G, C, and T, dl and d2 are the descendants of

node k, ti and tz are the lengths of the branches connecting dl and k, and d2 and

k, respectively, and Pab{i) is the transition probability for the change from a to ô in

time t under the chosen substitution model. Once the conditional likelihoods for all

nodes at all sites have been calculated, the overall log likelihood for the tree is given

by m lnZ(r, t|D) = y~^lnZ(r, t|Dj) 1=1

= (2.13) 1=1 I s 1 where r is the root node and tTs is the equilibrium frequency of nucleotide s.

2.2 A Stochastic Search Algorithm

Once we can evaluate the likelihood for any given topology and branch lengths, the

problem then becomes the location of the particular topology and branch lengths that

maximize the likelihood. Here we consider only rooted trees under the assumption of a molecular clock, though the method can be generalized to include estimation of unrooted trees as well. We begin with some notation.

Let Tn be the set of all rooted topologies with n external nodes, and let t be the

(n — 2)-dimensional vector of evolutionary splitting times of the n — 2 internal nodes,

21 excluding the root. Without loss of generality, we set the time of the root node to be 0 and the time of the external nodes to be 1.0, so that the total time represented by the tree is 1.0. The estimate of the parameterfj. in the substitution model (see

Section 2.1) will then act as a scaling factor for the total time of the tree. Note that the splitting time of each internal node is constrained to be between the time of its ancestor and the time of the closest of its two descendents, and that the length of each branch in the tree is determined by taking the difference in the times of the two nodes on either end of the branch. The units of time are the expected number of nucleotide substitutions per site.

We recall from Section 1.2 that development of a simulated annealing algorithm requires specification of three components of the algorithm: the generation scheme, the cooling schedule, and the stopping rule. We describe each of these components in the sections that follow, and finally provide an overall outline of the algorithm.

2.2.1 The Generation Scheme

The generation scheme proposed here is a stochastic modification of the Nearest

Neighbor Interchange strategy, used by the PAUP* software (also called the “local rearrangement” strategy in the stochastic search methods used by Kuhner, Yamato, and Felsenstein [32] and by Li, Pearl and Doss [37]). This generation scheme satisfies the requirement that any tree be accessible from any other tree in a finite number of applications of the process [37]. The steps involved in the local rearrangement strategy in this setting for a tree which includes n sequences are as follows (see Figure 2.1 for an illustration).

22 Step 1. From the set of n — 2 internal nodes (excluding the root), select one of the

internal nodes at random (each equally likely). Call this node the target

node.

Step 2. From the set containing the two children nodes of the target node and the

sibling of the target node, randomly select two of the three members of the

set to become the new children of the target node. The subtrees descending

from these children are carried along in the rearrangement process.

Step 3. Generate a new time for the target node.

Parent P arent ParentParent

Sibling Sibling Sibling

Child 2Child 2 Ch Id 2 Child 2

Child I Child 1 Child 1 Child 1

(a) (b) (C) (d)

Figure 2.1: Illustration o f the local rearrangement strategy. First, a target node is chosen from the set of iuternal nodes, and the children, sibling, and parent of the target node are identified (a). Two new children of the target node are then selected, resulting in trees (b) or (c), in which the topology has changed, or in tree (d), in which the topology- remains the same. The time of the target node is changed regardless of which new tree topology is produced. From Li, Pearl, and Doss [37].

23 Note that the time of the target node is constrained to fall between the time of the target’s parent node and the time of the closest of the two new children. To assign a time to the target node following local rearrangement, we draw a random number from a beta distribution whose mean is suggested by a single step Newton-Raphson

(NRl) estimate of the time (see Appendix B for the details of the calculation). The actual value of the mean is set to be | of the distance from the current time of the target node to the NRl estimate if this estimate falls within the allowable interval and

I of the distance to the endpoint of the interval in the direction of the NRl estimate otherwise. For computational speed, the NRl estimate is based only on the subtree descending from the target node. This simplifies the computation enormously, since by the pruning algorithm of Felsenstein [16] the likelihood of the subtree depends only on the likelihoods of the two children, which have already been calculated. The first and second derivative of the likelihood of the subtree are thus easily calculable. The variance of the beta distribution used to place the time within the allowable interval is decreased as the algorithm proceeds according to the formula

— z r r (2.14) w + where i is the number of iterations in the annealing algorithm and w must be > 51 to ensure that a unimodal beta distribution is obtained. The rationale behind de­ creasing the variance of the beta distribution is that we rely more and more on the information from the Newton-Raphson calculation as the algorithm proceeds. We note that in this step we have generalized the simulated annealing algorithm some­ what, since most instances of the algorithm have generation probabilities that depend only on the current solution. Because our generation probabilities depend also on the iteration number, our stochastic optimization algorithm shares some of the features

24 of the stochastic probing algorithm described by Laud et al. [33]. In particular, the decreasing of the variance of the distribution used to determine the next candidate point in the algorithm is analogous to their use of a “probing distribution” to generate a set of candidate points at each iteration.

It is worth noting that we do not spend too much effort on the estimation of the node times and have chosen to estimate only one node time at each iteration. There are several reasons for this. First, the topology of the tree contributes more to the overall log likelihood than do the node times within the tree. In addition, having a tree with node times that have been chosen to maximize the likelihood could result in the rejection of moves that would eventually be beneficial, since the probability of accepting a move depends on the difference in log likelihood between the candidate tree and the proposal tree. Finally, optimization of node times is computationally intensive and would slow the algorithm, making it less applicable to large data sets

[49]. In spite of placing only a limited effort on the estimation of optimal node times, the algorithm generally moves toward the optimal estimates as it proceeds. The final trees reported by the algorithm will typically have node times that are close to the optimal values, though they will not be exact due to the stochastic nature of the algorithm.

A tree on which to start the local rearrangement procedure must also be specified.

Since convergence of the annealing algorithm occurs independently of the starting point, we choose to start with a tree that is randomly generated from a Yule model

[24]. The total time represented by the tree is initially set to 1.0, and initial node times are spaced evenly throughout the tree. That convergence of the algorithm as

25 implemented here is independent of the starting point will be examined with several examples in Chapter 5.

2.2.2 The Cooling Schedule

The cooling schedule used in the stochastic search algorithm is modeled after that used by Lundy [40] and Lundy and Mees [41]. The updating of the control parameter is accomplished by

where U is an upper bound on the change in the likelihood in one application of the generation scheme and is a parameter whose value is less than one. We note that the parameter /? controls the rate of cooling. Values of 0 close to 1 will result in faster cooling but wül increase the chance of becoming trapped in local maxima, while values of 0 much less than 1 will result in slower cooling, which can lead to longer run times.

To set reasonable values of 0 and U, we run the algorithm for an initial “burn-in” period. Once this burn-in period has been completed, we set U to the maximum change in the log likelihood observed during the burn-in. The initial value of the control parameter, cq, is set to U, and 0 is determined by

/3 = ------r (2.16) {l-a)n + {a)-^ where n is the total number of sequences, m is the number of sites, d and o. are between 0 and 1, and L is the negative of the log likelihood of the UPGMA tree

[47]. We note that the term ^ represents, on average, the log likelihood of each site in the sequence for a tree that will generally have a log likelihood that is close

2 6 to that of the ML tree. The motivation for setting in this manner is that the

difficulty of the problem is affected by both the number of sequences in the tree and

the magnitude of the log likehhood attributed to each site (lower likelihoods imply less phylogenetic information). Since lower values of /? will provide slower cooling, which is desirable in more difficult problems, /? is related to the reciprocal of a linear combination of these two factors. The constant d in the numerator allows the user to alter the rate of cooling while not changing the particular linear combination selected in the denominator. For most problems, including those considered in Chapter 5, d = a = 1/2 seems to work well.

2.2.3 The Stopping Rule

Although many different stopping ciiteria are possible, the rule used here specifies that the algorithm terminate when a bound on the number of iterations since the proposal of an “uninvestigated” topology is reached. An uninvestigated topology is a topology that either has never been previously proposed or has not been eligible for local rearrangement an adequate number of times. The number of times a topology must be eligible for local rearrangement to be classified as “investigated” is determined by setting a bound on the probability that an internal node would not have been selected as the target node in the local rearrangement. For a tree with n sequences, there are n — 2 internal nodes eligible for local rearrangement, and so the probability that an arbitrary internal node is not selected for local rearrangement in xn trials is

(1 - - L - f " (2.17) n — z

Specifying this probability will therefore determine x. The default in our implemen­ tation is to set this probability to 0.05. The bound on the number of iterations since

27 the proposed of an uninvestigated topology is also taken to be a multiple of n. The

default we have set in this case is lOn iterations.

To use this stochastic search method for inference of the ML tree, the likelihood

must be recalculated after every local rearrangement and compared to the previous

likelihood. However, this does not require complete recalculation of the likelihood

from the tips of the tree. Because of the pruning algorithm [16], the only nodes

whose likelihoods will change after rearrangement are the target and the target’s

ancestors. The likelihood of the rest of the nodes in the tree will not be affected.

Thus storing the likelihoods for all the nodes in the tree at each step in the algorithm

and only recalculating likelihoods for nodes affected by a local rearrangement results

in a considerable economy in terms of computation time.

Additionally, the program has an option that allows for the k best trees encoun­

tered by the algorithm to be saved and printed to a file in Ne wick format, where k can

be set by the user. This allows the program to obtain information about competing

trees of high likelihood, which is an additional advantage over existing algorithms

which only provide a single estimate of the ML tree. As mentioned previously, esti­

mates of the node times for these k trees will generally be close to optimal, though

they will not be exact. Therefore, the Newton-Raphson algorithm should be used to

optimize the node times for these trees.

We can summarize the stochastic search algorithm (SSA) as follows. Let L(-) represent the log likelihood function, and let Ti 6 and r* G r„. The stochastic search algorithm consists of the following four steps:

Step 1. From tree Tj, generate candidate tree r* using the local rearrangement strat­

egy'-

28 Step 2. If L(r*) > L{ri), set — r*. If L(r*) < L{Ti), set = r* with, probability

Step 3. Update the value of the control parameter, Q, and set i to i + 1. If the

stopping criteria are not satisfied, return to Step 1. If the stopping criteria

are satisfied, return the k best trees found by the algorithm and proceed to

Step 4.

Step 4. Optimize node times for the k best trees found by the algorithm, and deter­

mine the ML estimate.

2.3 Simultaneous Estimation of the Tree and the Substitu­ tion Model Parameters

In the preceding discussion, we assumed that the parameters in the various sub­ stitution models in Section 2.1 (e.g., fi, k or K. and the tTj’s) were known. These values are generally not known, and are usually estimated from the data in some manner and then fixed during the tree search procedure. Thus it is of interest to provide simultaneous estimates of the tree and the substitution model parameters.

We propose a modification of the algorithm described above to include estimation of the substitution model parameters as part of the tree estimation procedure.

Before describing this modification, we demonstrate that proper estimation of these parameters may be essential to the determination of which tree is the maximum likelihood tree. Consider the estimation of the transition/ transversion parameter, K, in the F84 model. Figure 2.2 plots the value of the log likelihood against the value of the transition/trans version ratio, R, (which is a function of K) for a contrived data set of four taxa (see Appendix A). We notice that for small values of R {R < ~ 2.1),

29 Tree 1 is the ML tree, while for larger values of R, Tree 2 is the ML tree. We cbji demonstrate a similar effect with a real data set consisting of 14 mitochondrial DNA sequences (see Section 5.1.2 for a description of the data). Figure 2.3(a) shows that for small values of R {R between 0.5 and 0.54), Tree 2 is preferred, while for larger values of i? (i? > 2), Tree 1 is the ML tree (Figure 2.3(b)). An even more convincing example is given by a 30-sequence real data set consisting of papillomavirus sequences

(see Section 5.1.3 for a description of the data). When the two trees in Figure 2.4 are examined for R = 1.12, the tree in (a) has a much higher log likelihood (-27,756.50) than the tree in (b) (-27,768.14). However, if R is changed to 2.0, the tree in (a) will have log likelihood -27,976.52, while the log likelihood of the tree in (b) is -27,977.63.

Thus the tree in (a) would be quite strongly preferred if 7% — 1.12, while the tree in (b) would be preferred if 7? = 2.0. While it does seem that estimation of the tree is somewhat robust to the specification of the parameters in the substitution models (see, for example, [25] and [28]), proper estimation of these parameters can be important.

2.3.1 Estimation of the Nucleotide Frequency Parameters

Two of the four models described in Section 2.1, the F84 model and the HKY85 model, include nucleotide frequency parameters, tTj for i = A, G, C, or T. These parameters are commonly estimated by their empirical frequency in the data set, i.e.,

number of occurrences of nucleotide A total number of nucleotides in the data set ’ where the total number of nucleotides is the just the product of the number of sites and the number of sequences. Several authors (e.g., [19] and [60]) state that estimates obtained in this way are “close” to the ML estimates. Therefore, we will generally

30 1 Tree 1 m g m Tree 2 c I

4 6 10

Transition/Transversion Ratio

Figure 2.2: The log likelihood for Trees 1 and 2 for various values of the transi­ tion/ transversion ratio for a theoretical data set.

be content to use these empirical estimates for the tt/ s. However, the program for the SSA contains an option to obtain ML estimates of these parameters using a derivative-free estimation procedure developed by M.J.D. Powell [53, 54]. The esti­ mation method assumes fixed values of the other parameters (i.e., we assume that the tree, the node times, and the values of the remaining substitution model parameters are fixed). When this option is selected, the ttî’s are estimated at the beginning of the algorithm, and then held fixed throughout the tree search procedure.

2.3.2 Estimation of Other Substitution Model Parameters

Before describing the methods we use to estimate the other substitution model parameters (i.e., fs and K) we discuss some of the other methods that can be used to estimate these parameters. First, we note that in the context of phylogenetic tree

31 1 I i|

T reo 1 T ree 1 CO ro T ro o 2

050 0 51 052 053 0 54 2 4 6 a 10 12

Translllonn’ransversion Ratio TransitlotVTransvorslon Ratio

(a) (b)

Figure 2.3: The log likelihood for Ti'ees 1 and 2 for values of the transition/transversion ratio between 0.5 and 0.54 (a) and between 1.0 and 10.0 (b) for the 14-scquence mtDNA data set. HPV 57 —I UPV 27 HPV 57 HPV 2a HPV 27 nhPVI HPV 2» PCPV1 PCPV1 HPV 13 HPV 13 HPV Bb HPV 6b HPV II HPV 11 HPV 32 HPV 32 HPV 12 HPV 12 HPV 31 HPV 31 nn p v i HPV IB HPV IB HPV 31 HPV 31 HPV 351, HPV 351, HPV 52 HPV 52 HPV 58 HPV 50 HPV 33 HPV 33 HPV 7 HPV 7 HPV 10 HPV 10

HPV 39 HPV 30 CO HPV 59 CO HPV 59 HPV 15 HPV 15 HPV 18 HPV 18 HPV 26 HPV 28 HPV 51 HPV 51 HPV 3 HPV 3 HPV 10 HPV 10 HPV 56 HPV 50 HPV 30 HPV 30 HPV 53 HPV 53

(a) (b)

Figure 2.4: Two trees that may each be the ML estimate for differing values of R for the 30-sequence HPV data set. If R is 1.12, the tree in (a) is preferred. If R is 2.0, the tree in (b) is preferred. estimation the parameter /i cannot be estimated separately from the node times, in

the case of rooted trees, or from the branch lengths, in the case of unrooted trees. In

our stochastic search algorithm described above for rooted trees, we fixed the total

time of the tree to be 1.0, which then allowed estimation of the parameter fj.. Most

other phylogenetic programs, e.g., PHYLIP [19] and PAUP* [60], fix the parameter

fj. according to

^ ^ 27Tr Xy {1 + R)' where R is the transition/ transversion ratio which, along with the nucleotide fre­

quency parameters, determines both K in the F84 model and k in the HKY85 model.

They then estimate either the total time of the tree, in the case of rooted trees, or

branch lengths, in the case of unrooted trees, using the Newton-Raphson method.

This procedure is equivalent to our approach, e.g., an estimate obtained from one of

these programs can be made comparable to an estimate obtained from SSA by simply

rescaling the node times so that the total time of the tree is 1.0 and then adjusting

^ accordingly.

The classical method of estimating the transition/transversion ratio R (and there­

fore of estimating K in the F84 model or k in the HKY85 model) is to consider the

ratio of the observed number of transitions to the observed number of transversions, with the idea that these observed numbers of transitions and transversions should reflect the true rates of transitional and transversional changes. Various methods can

be used to count the numbers of transitions and transversions. One is that pairs of sequences can be considered, and a ratio of the numbers of transitions to transver­ sions can then be constructed for each pair. However, it is not clear how to combine each of these pairwise estimates into an overall estimate of R for the entire data set

34 under consideration. An alternative method of counting the numbers of transitions and transversions is to use parsimony to reconstruct the states at the internal nodes for a particular tree. For each site in the data set, the number of transitions and transversions is then the average of the numbers observed for each most parsimo­ nious reconstruction of the states at the internal nodes for that site. We note that this method requires that the topology be specified prior to estimation of R. See

Wakeley [62] for examples of these two methods.

Wakeley [62, 61] points out that when the sequences under consideration are sub­ stantially diverged from one another, both the pairwise method and the parsimony method give underestimates of the transition/ transversion ratio. Even when recently diverged sequences are considered, the estimate of the transition/transversion ratio can be biased and quite variable [62, 61]. Furthermore, failure to take rate varia­ tion among sites into consideration can also result in underestimation of the transi­ tion/ transversion ratio [62, 61, 67].

Several authors [52, 68, 55, 28] have proposed alternative methods of estimating the transition/transversion ratio. Pollock and Goldstein’s [52] method is a modifi­ cation of the pairwise method mentioned above which involves taking a particular weighted average of the ratios calculated for all pairs of sequences in the data set. Ina

[28] also modified the pairwise method, by using Haldane’s correction for each of the pairwise ratios obtained from the data. Variances for these proposed estimates are also given. Purvis and Bromham [55] used iterative-weighted least-squares regression to estimate the instantaneous transversion rate for a particular data set, assuming

35 that the topology and branch lengths of the tree representing the evolutionary rela­ tionships among the sequences were known. Yang and Kumar [68] proposed a mod­ ification of the parsimony method described above to correct the observed numbers of transitions and transversions for multiple substitutions.

Maximum likelihood has also been used previously [66, 56, 63, 60, 64] to esti­ mate the transition/ transversion ratio. Swofford’s program PAUP* [60], Yang’s pro­ gram PAML [66], and Rambaut’s program SPOT [56] will calculate maximum like­ lihood estimates of the transition/ transversion ratio when the topologies and branch lengths are fixed. PAUP* and PAML then incorporate estimation of the transi­ tion/transversion ratio into the tree search procedure by periodically estimating this parameter for the current tree. However, both of these methods have the disadvantage that when the number of sequences under consideration is large, they have difficulty in estimating the maximum likelihood tree (see Chapter 5 for examples). Given the ex­ amples above which show that estimation of the tree and the transition/transversion ratio are related problems, we might be suspicious of estimates obtained in this man­ ner. SPOT, likewise, iterates between tree estimation and parameter estimation by calling the program fastDNAML [49], and will subsequently have the same difficulties as PAUP* and PAML.

The method we propose here also involves iterating between parameter estimation and tree estimation. However, we will see in Chapter 5 that the SSA proposed in

Section 2.2 is better able to estimate the tree for large data sets. We might therefore expect the method to be better able to provide a simultaneous estimate of the tree and the parameters. We will introduce several modifications of the algorithm to allow for simultaneous estimation of the tree and the parameters {j. and K in the

36 F84 model and its associated sub-models. The general idea is that the algorithm will

periodically pause in its cycling of the steps described at the end of Section 2.2 to

estimate these two parameters simultaneously using the Newton-Raphson method,

for the current values of the topology and node times. The details of the Newton-

Raphson calculations are given in Appendix B. The specific modifications of the

algorithm will be described below.

We begin by estimating ^ and K prior to the start of the algorithm. These initial

estimates are then used as the values of the parameters as we iterate through Steps

1 - 3 of the algorithm. At each iteration, we compare the current value of the log

likelihood with the log likelihood that resulted from the last estimation of ^ and

K. When the difierence between the current log likelihood and the log likelihood

at the time of last estimation of these parameters exceeds p % of the value of the

log likelihood at the time of last estimation, we pause to estimate the values of the

parameters again before continuing the iterations of the algorithm. The idea is that

the parameters p and K do not have so large an effect on the log likelihood that they

need to be estimated at each iteration. Instead, it is sufficient to estimate them only when the log likelihood has changed substantially since they were last estimated.

Once the stopping criteria have been satisfied, we optimize the branch lengths of the k best trees found by the algorithm. We perform this optimization under two conditions: p and K fixed at their current values, and p and K estimated simulta­ neously with the node times. We then set our current values of the tree, node times, and parameters to the best estimates obtained in this initial search, and then perform a shorter version of the search to ensure that with our newly updated values there is no tree with a higher log likelihood than that of the current estimates. Therefore, we

37 must give some consideration to how to set the parameters of the cooling schedule so

that this second iteration through the steps of the SSA provides an efficient search.

We have chosen to keep the (3 and U values in the cooling schedule the same and to

determine the starting value Cq by letting

f (accept A£) = f?" ^ ^ (2.20) I P2 i if tree 1 is different than tree 2,

where tree 1 is the tree with the maximum log likelihood (of the k best trees) after

optimization of node times with the current parameter values, and tree 2 is the tree

with the maximum log likelihood (of the k best trees) after simultaneous optimization

of node times and parameters. Generally, p 2 > Pi, since we would want to cool more slowly if simultaneous optimization of parameter values for the given trees changed which tree would be considered the ML tree. If the two trees are the same, then we can cool more quickly, since we feel more confident that we have already located the

ML tree.

In the formulation described above, p, pi, p -2 and AL are all quantities which must be specified. For all of the problems we have considered, including those in Chapter

5, p = 0.25, Pi = 0.8, p 2 = 0.9, and AL = 3.0.

To summarize, the stochastic search algorithm to simultaneously estimate the tree and the parameters p and K in the F84 substitution model consists of the following steps.

Step 1. From tree rj, generate candidate tree r ' using the local rearrangement strat­

egy.

Step 2. If L(r*) > L(n), set Ti = r*. If L(r’) < L(n), set n = r* with probability

38 Step 3. Update the value of the control parameter, Q, and set i to i + 1 .

(a) If the log likelihood has changed by more than p % since the parameters

were last estimated, estimate the parameters p and K.

(b) If the stopping criteria are not satisfied, return to Step I. If the stopping

criteria are satisfied, return the k best trees found by the algorithm and

proceed to Step 4.

Step 4. Optimize node times for the k best trees found by the algorithm.

Step 5. Optimize node times and parameters for the k best trees found by the algo­

rithm. Set cooling schedule parameters using Equation 2.20 and set i to 0.

Set the current tree, Tj, to the best tree found by the algorithm.

Step 6 . From tree Tj, generate candidate tree r* using the local rearrangement strat­

egy-

Step 7. If L(r*) > L(ri), set Ti = r*. If L{r*) < L^n), set = r* with probability

Step 8. Update the value of the control parameter, Q, and set i to i + 1.

(a) If the log likelihood has changed by more than p % since the parameters

were last estimated, estimate the parameters p and K.

(b) If the stopping criteria are not satisfied, return to Step 6 . If the stopping

criteria are satisfied, return the k best trees found by the algorithm and

proceed to Step 9.

39 Step 9. Optimize node times and parameters for the k best trees found by the algo­

rithm, and return the ML estimates.

2.4 Computer Implementation

We discuss here several components of the computer implementation of the algo­

rithm which contribute to its efficiency. We begin by noting the method we use for

computer storage of phylogenetic trees.

Let the external nodes of the tree be numbered 1 through n, where n is the number

of taxa to be placed in the tree. Let the root node be numbered n -h 1, and assign the

numbers n + 2 through 2n — 1 to the other internal nodes of the tree. Note that we

consider here only bifurcating trees, so that every node has either two descendents

or no descendents. The tree topology is stored using what is called the three-row

representation. The three-row representation consists of a 3 x (n — l)-dimensional

matrix. The first row of the matrix contains the numbers n -b 1 through 2n — 1 , which correspond to the ancestral nodes within the tree. The second two rows contain the numbers of the nodes of the two descendents of the nodes listed in the first row. As an example, the three-row representation for the tree in Figure 1 .1 (a) is shown below.

8 9 10 11 12 13^ 9 1 3 12 5 2 11 10 7 4 13 6

The three-row representation is convenient for several reasons. First, it is quite efficient in terms of memory, since we really need only store the bottom two rows of the matrix. Secondly, any two trees which differ from one another by only a single local rearrangement will have three-row representations which only differ in that two of the

40 columns have been, s-witched. This makes the three-row representation a convenient

format for the tree when the local rearrangement strategy is being applied.

The three-row representation, however, is not a unique representation for the tree.

Therefore, we must transform the three-row representation into a form which does provide a unique representation of the tree so that we can save the k best unique trees found by the algorithm. To uniquely represent a phylogenetic tree, we use the

intemodal distance matrix (IND) representation. The IND representation consists of an n X rz-dimensional matrix whose {i,jY^ element represents the number of nodes separating nodes i and j. The IND matrix for the tree in Figure 1.1(a) is sho^vm below.

2 2 3 4 2 \ 5 3 2 1 5 3 4 5 1 9 3 3 2 4 5 V - }

The method we use to store and update the likelihood function is also worth noting. As mentioned in Section 2 .2 .1 , we store the conditional likelihood given each of the four nucleotides for every node of the tree at every iteration of the algorithm.

This simplifies the overall computation in two ways. First, we note that following local rearrangement, only those nodes which are ancestral to the target node need to have conditional likelihoods recalculated - the conditional likelihood for all descendents of the target and for nodes which share no ancestors except the root with the target will remain unchanged. Second, information from the stored conditional likelihoods

41 is used in the computation of the NRl estimates described in Section 2.2.1 (see also

Appendix B).

We next note the computational différences between the substitution models de­

scribed in Section 2.1. Although both the F84 model and the HKY85 model are

five parameter models, the F84 model is computationally simpler than the HKY85

model, which is why it is the model we have used in our estimation of parameters.

The simplification comes from the fact that the F84 model does not include nucleotide

frequency parameters in the exponential terms in the transition probabilities, as does

the HKY85 model. This simplifies computation of both transition probabilities used

in calculation of the likelihood and derivatives of the transition probabilities, which appear in many of the expressions in Appendix B.

Finally, we note that several other packages and programs are used during the algorithm. Generation of random variables from a specified beta distribution is ac­ complished through the “imslsJf_random_beta'’ subroutine in IMSL. Derivative-free estimation of the 7r/s is done using the “imsl_f_min_con_genJin” subroutine in IMSL.

Estimation of node times alone or node times and parameters together is done by calling the program SPOT [56], which has been modified to make it compatible with

SSA. Note, however, that estimation of parameters for a given topolog}' and node times is done using the Newton-Raphson method, as shown in Appendix B.

42 CHAPTER 3

A SIMULATED ANNEALING ALGORITHM FOR ESTIMATION OF PHYLOGENETIC TREES UNDER THE PARSIMONY CRITERIA

Before describing our proposed simulated annealing algorithm, we briefly explain how the number of evolutionary changes required by a phylogenetic tree is calculated according to the parsimony criteria. We also describe other approaches to the problem of finding of the most parsimonious (MP) tree using simulated annealing, including those by Felsenstein and Barker [4]. We then describe our algorithm and give the details of its implementation

3.1 The Parsimony Criteria

As mentioned in Chapter 1 , parsimony is a very commonly used method of phy­ logenetic tree reconstruction. The concept of parsimony in general is that simpler hypotheses should be prefered over more complex hypotheses. In the context of the reconstruction of phylogenetic trees, “simplicity” is translated to mean the “least amount of evolutionary'- change” . Thus trees that minimize the total amount of evo­ lutionary change for a given data set are preferred, and the tree with the minimum amount of evolutionary change is the MP tree. We note that there is not necessarily one MP tree, since it is often the case that two or more trees may share the same

43 minimal length. For this reason, it is common to report a set of MP trees, rather than a single MP topology.

Parsimony methods involve several assumptions. Like maximum likelihood, parsi­ mony a.ssumes that sites evolve independently of one another and that hneages evolve independently of one another. Parsimony also assumes that the probability of a nu­ cleotide change at a given site is small over the length of time represented by the branches of the tree. Finally, it is assumed that the expected amount of change does not vary over different branches of the tree or over different sites in the sequence [19].

To use the parsimony method for nucleotide sequence data, we must first define how the amount of evolutionary change required by a given tree, which is often called the length of the tree, will be calculated. The most common definition is to add one to the length of the tree whenever a nucleotide substitution would be required to explain the data. We note that since parsimony assumes that sites are independent, the length of a given tree can be found by calculating the length for each individual site and then adding the lengths across sites. For an individual site, the following algorithm can then be used to calculate the length of a given tree. First, represent the state at each external node by a set containing only the actual observed state at that node. Then, the set at each internal node in the tree is the intersection of the two sets for the node’s two immediate descendants if the intersection is nonempty; otherwise, it is the union of the two sets for the node’s immediate descendants. Once sets have been formed for each of the internal nodes on the tree, the length of the tree is simply the sum of the number of sets which were formed by taking unions rather than intersections, since unions represent the event that an evolutionary change must have occurred [38].

44 1 HPV16 ATGTGGCTGCCTAG GAGGCCACTGTCTACTTGCCTCCTGTCCAGTATCTAAGGTTG 2 HPV35h ATGTGGCGGTCTAA GAAGCCACTGTCTACCTGCCTCCAGTTCAGTGTCTAAGGTTG 3 HPV31 ATGTGGCGGCCTAG GAGGCTACTGTCTACTTACCACCTGTCCAGTGTCTAAAGTTG 4 HPV52 ATGTGGCGGCCTAG GAGGCCACTGTGTACCTGCCTCCTGTCCTGTCTCTAAGGTTG 5 HPV33 ATGTGGCGGCCTAG GAGGCCACAGTGTACCTGCCTCCTGTCCTGTATCTAAAGTTG 6 HPV58 ATGTGGGGGCCTAG GAGGCCACTGTGTACCTGCCTCCTGTCCTGTGTCTAAGGTTG 7 RhPVl ATGTGGCGGCCTAG GACTCCAAGGTCTACCTACCACCTGTCCTGTGTCTAAGGTGG

Table 3.1: Data from Table 1.1 with site 15 highlighted

To demonstrate the calculation of the length of a phylogenetic tree under the parsimony criteria, consider the example data set from Chapter 1 (see Table 1 .1 ).

This data is reproduced in Table 3.1, with site 15 highlighted. Since we assume that the sites evolve independently of one another, each site may be considered separately in the calculation of the length. Figure 3.1 shows the calculation of the length for site 15 according to the algorithm described above for two possible trees. Based on the data for this site alone, the topology in Figure 3.1(b) is the preferred topology of the two shown.

From the algorithm discussed above, we note several things. First, the length of the tree is not affected by whether the tree is considered to be rooted or unrooted.

The second is that some of the sites in the sequences will be noninformative in the sense that they do not favor one topology over any of the others. For example, in the data shown above, the first 14 sites are all noninformative. Noninformative sites may be omitted from the parsimony analysis, although their inclusion will not change the determination of which trees are MP (including these sites may change the length of the tree, however). Third, we note that this algorithm gives us a method only of evaluating the length of a given topology for the sequences under consideration. It

45 m m

1 2 3 6 5 4 7 m (C) m m {q t {} m m {C} {C}{T} m m m

(a) LENGTH = 2 (b) LENGTH = 1

Figure 3.1: Lengths of two possible trees for the site 15 of the example data in Table 3.1. The tree in (b) is preferred by the parsimony criteria.

does not provide a method for finding the topologies that are MP. Ideally we would calculate the length of aU possible topologies and then select the ones which have minimum length to be the MP topologies.

Several of the currently available phylogenetic programs include procedures for estimation of phylogenetic trees under the parsimony criteria. The two most popular are the package PAUP* [60] and the program DNAPARS in the PHYLIP package [19].

The methods used by these algorithms are similar to those used for the estimation of trees under the maximum likelihood criteria, except that in the case of parsimony the length is minimized, rather than the likelihood being maximized. Although these programs are much quicker in the case of parsimony than for maximum likelihood, they may require long run times for very large data sets. The far more important problem in this case, however, is that the tree-building strategy used quite often leads

46 to entrapment in local optima, as was the ceise for the maximum likelihood problem.

This will be explored in more detail in Section 5.3.

3.2 Estimating the Most Parsimonious Tree(s) Using Simu­ lated Annealing

In addition to the simulated annealing algorithm that will be proposed here, sev­ eral other attempts at estimating the set of most parsimonious trees using the general simulated annealing framework have been made. We begin by briefly reviewing some of these.

In early releases of his phylogenetic analysis package PHYLIP, Felsenstein included a simulated annealing type algorithm for estimation of phylogenetic trees under the parsimony criteria, called METRO (presumably because the method was derived from the Metropolis algorithm). The details of this algorithm were never published, and the program is now obsolete. However, a review of the program given in [20] states that METRO was both very time-consuming and did not generally yield trees with shorter lengths than other those found by other programs for the data sets examined by the author. Felsenstein cites both of these shortcomings as reasons that METRO was excluded from future releases of PHYLIP [19].

A more recent simulated annealing algorithm for estimation of trees under the parsimony criteria is the program LVB (which was named in honor of Ludwig van

Beethoven) written by Daniel Barker [4]. LVB implements the simulated annealing algorithm using a cooling schedule for which the sequence of control parameters is not adjusted at every step in the algorithm. Rather, the value of the control parameter is held fixed until either a bound on the number of trees accepted by the algorithm is reached or a bound on the number of trees proposed by the algorithm is reached. The

47 control parameter is then lowered and once again held fixed until one of these criteria is satisfied. This procedure continues, until the tree at the time the control parameter value is changed remains the same in a certain number of successive changes in the control parameter. This is similar to the logarithmic cooling schedule of Kirkpatrick et al. [31].

Several other components of the implementation of the simulated annealing algo­ rithm in LVB are of interest. One is the method that is used to generate candidate trees for the algorithm. LVB uses a strategy in which an external branch in the tree

(i.e., a branch which links an external and an internal node) is selected at random from the tree, disconnected from the tree, and then re-attached at a new random point within the tree. This is what is called the pendant branch transition strategy by Li [36]. A second point worth noting is that LVB uses the homoplasy index [60], instead of the length of the tree, in the criteria for determining whether or not a tree will be accepted by the algorithm. The homoplasy index is the ratio of the length of the tree with no homoplasy to the observed length of the tree, subtracted from 1 .

The index is a number between 0 and 1 , with the property that the index will be smaller for shorter trees.

The author of LVB [4] gives some conditions under which the program might be useful, stating that LVB is helpful only when the number of sequences under consid­ eration is so large that the algorithms currently in use (e.g., PAUP* and DNAPARS in PHYLIP) are likely to fail. He further states that the performance of the algorithm may be “poor” and “inefficient” when the number of sequences is less than 50.

Two authors have applied the simulated annealing algorithm to the Steiner tree problem, w^hich is similar to the parsimony problem. Dress and Kruger [12] use a

48 cooling schedule similar to the logarithm cooling schedule of Kirkpatrick et al. [31].

Lundy [40] proposed a new cooling schedule. Some of the theory underlying a similar family of cooling schedules is then provided in a subsequent paper [41]. The cooling schedule of Lundy and Mees [41] is the basis for the cooling schedule we have chosen to implement for both the stochastic search algorithm in the case of the maximum likelihood criteria and the simulated annealing algorithm for the parsimony criteria, which will be described below.

3.2.1 A New Simulated Annealing Algorithm for Estimation of the Most Parsimonious Tree(s)

We have also implemented a simulated annealing algorithm for estimation of the set of most parsimonious phylogenetic trees. This algorithm is similar to the algorithm used for estimation of phylogenetic trees under the maximum Hkelihood criteria in

Chapter 2. As in Chapter 2, we let be the set of all rooted topologies with n external nodes. The components of the algorithm are described below.

The Generation Scheme

The generation scheme used here is similar to that used in the stochastic search strategy for estimation of the maximum likelihood tree in Section 2.2.1. In particular, the first two steps described there are used to generate candidate topologies. We note that branch lengths are not used in the parsimony criteria, so that the third step in the local rearrangement strategy in Section 2 .2 .1 is unnecessary. We note also that identical topologies will have identical lengths, and therefore we consider only moves which actually change the topology, i.e., at every application of the local

49 rearrangement strategy, we generate either the tree in Figure 2.1(b) or the tree in

Figure 2.1(c) if we start with the tree in Figure 2.1(a).

There is one additional modification of the strategy that is used to generate a new candidate topology in this case. Prior to the application of the local rearrangement strategy, a random number, x, is drawn from a Poisson distribution with parameter A.

This number determines the number of times that the local rearrangement strategy will be applied before the new proposal tree is obtained, i.e., the local rearrangement strategy will be applied x + 1 times. The motivation behind this process is that allowing the local rearrangement strategy to be applied more than once at each iter­ ation allows for better mixing in topology space, and helps to prevent the algorithm from becoming trapped or from bouncing back and forth betw^een trees with only locally minimum lengths. The parameter A can be set by the user. The default is to set A = 0.28, so that most often only one local rearrangement is performed, but sometimes two or more local rearrangements are used.

The Cooling Schedule

As in the estimation of phylogenetic trees under the maximum likelihood criteria, the cooling schedule is modeled after that used by Lundy [40] and Lundy and Mees

[41]. The updating of the control parameter is accomplished as in (2.15), with U now representing an upper bound on the change in the length in one application of the generation scheme. The interpretation of the /? parameter remains the same. As in the former case, U is estimated during an initial burn-in period at the start of the algorithm, and the initial value of the control parameter, cq, is set to U.

50 The parameter /5 is determined in a manner similar to (2.16), using

^ = (l_a)n + (a)£ where n is the total number of sequences, m is the number of sites, d and a are

between 0 and 1, and C, is the length of the UPGMA tree [47]. The motivation for

setting (3 in this way is the same as in the case of maximum likelihood estimation -

that more difficult problems require smaller values of 0, and that two indications of

the difficulty of a problem are its size (n) and the amount of phylogenetic information

the problem contains (^). In implementing this cooling schedule, we generally set

d = 0.10 and a = 0.25.

The Stopping Rule

The stopping rule used here is identical to that described in Section 2.2.3, namely

that we stop when a bound on the number of iterations since the proposal of an

“uninvestigated” topology is reached. The default settings for this stopping rule are the same as those specified in that section.

The steps involved in the simulated annealing algorithm for estimation of the set of most parsimonious trees can be summarized as follows. Let £(-) represent the length, and let r,- G and r* € r„. The simulated annealing algorithm consists of the following three steps:

Step 1. From tree -q, generate candidate tree r* by generating an observation, x,

from a Poisson(A) distribution and applying the local rearrangement strat­

egy X -i - 1 times.

Step 2. If £(r*) < £(ri), set n = r*. If £(r*) > Cin), set n = r* with probability

51 Step 3. Update the value of the control parameter, Q, and set i to i + 1 . If the

stopping criteria are not satisfied, return to Step 1 . If the stopping criteria

are satisfied, return the k best trees found by the algorithm.

3.2.2 Computer Implementation

Much of the computer implementation for the simulated annealing algorithm dis­ cussed in this chapter is the same as that described in Section 2.4 for the stochastic search algorithm. We note additionally only that the “imsls_random_poisson” sub­ routine in IMSL is used to generate observations from a Poisson distribution with specified parameter A.

52 CHAPTER 4

PROPERTIES OF THE ALGORITHMS

This chapter provides some convergence results for the algorithms discussed in

Chapters 2 and 3. It is shown that under certain conditions, each of the algorithms converges to the set of optimal trees. We begin with some notation.

A phylogenetic tree consists of a topology and a set of evolutionary splitting times for each of the nodes within that topology. Let be the set of all topologies with n external nodes. Under the assumption of a molecular clock as discussed in Chapter 2, the root node is assigned a time of 0 and the external nodes are each assigned a time of 1 .0 , so it remains only to specify the evolutionary splitting times of the other n — 2 internal nodes. Let t be an (n — 2)-dimensional vector containing the times of these internal nodes, and note that t G (0,1)"”^. A phylogenetic tree is then represented by a pair (r, t), where r G r„ and t G (0,1)""^. The space of phylogenetic trees having n external nodes is given by = r„ x (0, 1 )"”^. The natural cr-field on is thus the product of the Borel fields on and (0, 1 )"“^.

4.1 Convergence Results for the Stochastic Search Algorithm

In this section we give some convergence results for the stochastic search algo­ rithm for the estimation of the maximum likelihood phylogenetic tree on the space

53 In particular, we show that the stochastic search algorithm generates a Harris recurrent Markov chain (see Definition 4.1.1) on the space, which leads to an invariant distribution for each fixed value of the control parameter. The limit of the invariant distribution is then examined as the control parameter decreases toward 0 , and it is shown that this limit is a degenerate probability distribution concentrated on the set of optimal solutions. We begin by formally defining the Markov chain generated by the stochastic search algorithm.

Let X and y be arbitrary elements of the set S = and let A (Z S. For a fixed value, c, of the control parameter, define the transition kernel on the state space S by

K{x, A) = y a{y | x, y)p{dy | x) + ^ ( 1 - a{y I2 :, y))p{dy | x) (4.1) A s where a(- | x,y) is the acceptance distribution, p{- | •) is the generating distribution, and

= ^ (4.2) 1 0 , otherwise.

The acceptance distribution gives the probability of accepting tree y generated from tree x, and is given by

a{y\x,y)=exp(-^^^ ------(4.3)

where L(-) is the log likelihood function discussed in Chapter 2 .

We note that every point x = (r, t) in S is composed of a topology, r, and a vector of node times, t, corresponding to that topolog}% The generation distribution is thus a “mixed” distribution, which consists of the product of a probability mass function on the discrete set of topologies and a continuous density on a single element of the

54 vector of node times. As described in Section 2.2.1, a topolog}-^, r*, is first obtained by

randomly selecting a target node, j*, and rearranging the current topology in the area

of this target node. The new vector of node times, t*, is then obtained by altering

the time, t*, for node j*. Let the allowable interval into which the node time must

be placed be (P*, C*), where P* is the time of the parent of the target node, and C*

is the time of the closest of the two children in the new tree y = (r*,t*). Here we

consider instead placing the node time into the interval [P* d-s, C* — s] for some small

£• > 0 , which introduces the constraint that any two nodes times in the tree must

be separated by a distance of at least 2e. From now on, we restrict our space S to

include only trees that satisfy this constraint. As s 4- 0, this algorithm approximates

the algorithm that was used to generate the results discussed in Chapter 5.

The generation distribution is thus given by

P{y I ar) = /(P I r*,t) P(r* | r)

where is the probability with which tree r* is obtained from tree r by making a local rearrangement with node j* as the target. To generate a time for t* in the interval

[P* + e, C* — £■] as described above, we generate an observation from a truncated beta distribution on the interval [p(s), c(£)], where p(s) = and c{e) = 1 —

The density f{t’ \ r*,t) is then given by

= (4.5) where g{~) is the beta density described in Section 2.2.1 with fixed variance (we wül consider decreasing the variance as the algorithm proceeds at the end of this section) and G(-) is the cumulative distribution function for g{-). The distribution for g{-)

55 has been truncated here, rather than using a strict beta distribution over the entire

allowable interval as described in Section 2.2.1, so that the density will be bounded

above 0 over this interval. This feature will be useful in establishing the results which

follow. We note in particular th at the density in (4.5) implies the existence of some

^ > 0 such that /(t* | r*, t) > d for all G [P* + e, C* — e].

Our goal will be to show that the Markov chain with transition kernel defined

above is Harris recurrent. We provide a definition of Harris recurrence below. See

Grey [51], Re\Tiz [57], Nummelin [48], and Fei [13] for more details.

Definition 4.1.1 A Markov chain X n is said to be Harris recurrent if there exists a

positive, a-finite measure p on the space S such that p{A) > 0 implies that for any

X,

P { X n e A i.o. I = r} = 1 , (4.6)

where “i.o.” stands for infinitely often.

We will see that under suitable conditions on the transition kernel defined in

(4.1), the Markov chain considered here will be Harris recurrent. The two required conditions are given below, and it is verified that each of the conditions holds for the transition kernel defined here.

Condition 4.1.1 There is a probability measure p on the space S and an integer uq such that for any .4 C S' and x Ç. S, p{A) > 0 implies K^°{x, A) > 0.

Verification of Condition 4.1.1: We note that the transition kernel in (4.1) is composed of the acceptance distribution and the generation distribution. Since the acceptance probabilities are always positive, it remains to show that the generation

56 probabilities for a sequence of moves from x into A are also positive. First, we note that Li, Pearl, and Doss [37] showed that for any two topologies r and r' in S, topology r can be transformed into topologj’’ r' in a finite number of applications of the local rearrangement strategy defined in Chapter 2. This means that there exists an integer n-r > 1 such that for any a: in 5 and any set A C S, the topology r can be converted to some topology in the set A in Ur or fewer steps. Since each topological change happens with equal probability and the acceptance probabilities are always positive, this sequence of changes has positive probability. Next, note that once a topology in the set A has been attained, there is a sequence of n — 2 node time adjustments that are sufficient to move the node times into the set A. If n{A) > 0, then this sequence of changes happens with positive probability. Letting tiq = Ur + n — 2, we have established Condition 4.1.1. ■

Condition 4.1.2 For any increasing sequence of subsets Si of S such that Si t S, there is some zq such that

(4.7) res

Verification of Condition 4.1.2: First, we note that since the number of topologies is finite for a fixed number of sequences, and since Si t 5", there exists an Zq large enough so that all topologies are contained in S^^. For any topology r G let

{tr(5j )} denote the set of all node time vectors t which correspond to points (r, t) which are contained in S-'^, i.e., {trC-Fj^}} = ( t : (r, t) 6

Now consider the probability associated with moving a single node time, tj, into the set of node times for the component that are included in Let

57 {tr(5'i^)}i denote this set. One way to do this would be to keep the topology of the tree fixed, which happens with probability We would also need to select the appropriate node time to move, which has probability We consider the probability associated with generating a time for tj that falls in the set {tr(5'^)}y. Because the probability of generating a time within the allow^able interval for every component of t is always at least 5 by the definition of the generation density given in (4.5), this probability must be at least <^A({tr(5'j'^)}j), where A(-) represents Lebesgue measure. Note that for any ^ > 0 , we can pick io > Iq large enough so th at A({t,-(5j.'^)}j) > f for all j since Si t S (see Stromberg, 1981 [59], Theorem 6.62(iv) ). We thus replace Iq by z'o-

Having generated a node time in the set {tr(5'io}}_/, we next consider the prob­ ability of accepting the newly generated time. To bound the acceptance probability

(4.3), we define,

Mj{r, t) = sup I T(r, t(iy)) - L(r, t(i*)) | for 1 < y < n - 2 (4.8) tj,t'-e[P'+£,C'-e] where tj is the current time and t*j is the newly generated time. That Mj{r,t) exists and is finite results from the continuity of the log likelihood function L(-) as a function of tj over the compact interval [P* + s, C* — ej. For each topology r, we can then define

M (r) = max Mj{T,t) (4.9)

Finally, we provide an overall bound on the difference in log likelihood by

M = max M{r) (4.10) r^Tn

58 Then for any x ,y E S, the acceptance probabilities are bounded away from 0 by

a (y\x,y) = ~ j

= Ma{c) > 0. (4.11)

Hence, the generation probabilities are strictly positive also.

Next, we note that for any x E S, there exists a sequence of n — 2 node time adjustments that can be used to move the node times for the current tree x into the set {trl^io)}^. Let (z) be the first node to be selected, ^ 2 (2:) be the second, etc.

We note that this ordering is not necessarily unique. We will consider systematically adjusting the node times so that this particular sequence of n — 2 steps is used to change the times for x to node times for the same topology that are in {trC-S'io)} (and therefore to change tree x to a tree that is in

Let be the set of all trees with the same topology as x and with node jk E

{tr(‘5io)}ifc for all k < i. The superscript I denotes the fact that these sets have at most

I node times different from the set {tT-(Sio)}. We note that the are the sequence of sets that we move through in order to reach 5,„ by making the sequence of node time adjustments ji{x), ^’2 (2:), ... jn- 2 {x) described above. For any i, 0 < z < n — 2, consider moving the time for node ji{x) into the set {tr(5'io For any x E

59 this probability is given by

K{x, = J a{y\ x, y)p{dy j x) + (x) ^(1 - a{y | x, y))p{dy \ x)

ç ( n —t — 2 ) S

> j a{y\x,y)p{dy\x) ^(n—i—2) ^3i > j Ma(c)p{dy I x) ^(n—i —2)

= W{c) > 0. (4.12)

In particular,

Now consider moving from any x E S into the set with at most n — 4 node times different from {tr(5',o)} by moving node ji followed by node into the set. This probability is given by

sjr"') = j K(y,Sf,-^'>)K(xAv) 5 > J K{v.Sl"'‘‘')K(x,dy)

c ( n - 3 )

> W{c) J K(x,dy) using (4.12)

^ ( n - 3 )

= M/(c) f'T s j r " )

> ^VT(c)^ using (4.13)

> 0. (4.14)

60 Proceeding recursively, we have

K^-^x.S,,) = j K{y,Su,)ir-\x,dy) S > I K{v,S,,)IC'-^(x,dy) 4 .- , > j ir-\x,dy) ^S.-, = W{c) IC'-\x,S}^_,) n—2

> 0. (4.15)

Using (4.15), we then have

i r ‘{x,s,,) = J ic'-Hy,s,,)ir«-^’'-^Hx:dy) S > (»'(=)) J K ^ - ^ - ^ ' i x . d y )

> 0, (4.16) for all X G S. Hence in f K'^°{x, Sig) > 0 and Condition 4.1.2 holds. ■ 165

The following theorem from Fei [13] can then be applied to show the Harris re­ currence.

Theorem 4.1.1 For each fixed value of the control parameter c > 0, assuming that

Conditions (4-1.1) and (4-d-^) hold, the Markov chain with transition kernel defined in (4-1) is an irreducible, aperiodic, positive Harris recurrent Markov chain on the state space S.

61 P ro o f: Fei (1992), pp. 34-35. ■

Harris recurrence guarantees the existence of an invariant distribution for each fixed value of the control parameter, c. For fixed c > 0, let the invariant measure be denoted by /ic and note that this implies that for every A d S,

fj.c{A) = J K{x,A)}j.cidx) (4.17) s We next consider the hmiting behavior of the fj,c as c ^ 0. We will show that, under certain conditions on the transition kernel, fj,c converges to a measure that is concentrated on Sopt, the set of optimal solutions, as c 4- 0. We first prove that the two required conditions are satisfied by the transition kernel defined in (4.1).

Let Sopt be the set of optimal solutions, and note that each y in Sopt can be ex­ pressed as y = {vopt, topt)- For 7 > 0, define the set = {(r^pt, : the topology Topt matches a topology in Sopt and | (t^Jj - {topt)j |< t}- Let = 5 \ S2pt-

Condition. 4.1.3 For some Uq, in f K^°{x,S2pt) > 0.

Verification of Condition 4.1.3: This condition can be established in a manner similar to the verification of Condition 4.1.2. Note first that, as discussed in the verification of Condition 4.1.1, there exists an integer n,- > 1 such that any topolog}'

X € C2pt can be changed to a topology y 6 S^pt in Ur or fewer steps, each of which is selected with probability 3( ~ 2) • establish a bound on the acceptance probability for each of these steps, we consider, for every pair of topologies and Ty in Tn and all pairs of node times i and j within trees x and y,

Qij{Tx,Ty)= sup \ L{T^,tj:{ti)) - L(Ty,ty{tj)) \, (4.18)

62 i.e., Qij{x, y) is an upper bound on the difference in log likelihood for any two topolo­ gies when one node time within each topology is varied.

We can then find the maximum over all pairs of node times to obtain

Q{r^.Ty) = ^max_^Qi^j{x,y) (4.19) for each pair of topologies and Ty in r„. Finally, we take

Q = max Q(Ti,ry) (4.20) to obtain the following bound on the acceptance probability,

a(y\x,y) =

>

= Qa(c) > 0. (4.21)

Thus the acceptance probabilities associated with each of the topolog}'" changes are bounded above 0 .

Once we have obtained a topology that is contained in the set it remains only to adjust the node times so that they fall within the set t^j. We proceed cis in the verification of Condition 4.1.2 to select a sequence of n — 2 node time changes that will move the node times for any x in into the set t^j. Let j\{x), ^2 (3;),

... , jn_ 2 (z) be the sequence of node times selected. The acceptance probabilities are bounded as in (4.11), and the probability associated with generating each of the node time changes required is greater than using an argument similar to that used in verifying Condition 4.1.2.

Let S2pt{Topt) be the set of all trees that have topologies that are found in the set

Sjpf Setting Uq = n~ + n — 2 and using a recursive argument as in the verification of

63 Condition 4.1.2, we have, for all x G C^pt,

s > I ir-Hv,Sl^)K'''(x,dy)

^opt (“opt ) I . 1

> 0. 0L22)

Hence, Condition 4.1.3 holds. ■

Condition 4.1.4 ^ 5 c i 0, sup K'^°{x,C'Lt) 0-

V erification of C o n d itio n 4.1.4: We first consider leaving the set 5jpt and moving to C2pt ici one step. Let x € S'^f For 7 sufficiently small, L(x) > L{y) for all y G C^j.

Then for any x G S2pt,

K{x,C2pt) = J a{y\x,y)p{dy\x), (4.23) since x G S^pf

Now we note that lim a{y | x, y) = 0 since L(x) > L{y). Then we have that c4.0

limK{x,C2pt) = lim / a(î/ 1 x, y) p(dy | x) ciO clO J

= / p{dy I x) lim a{y | x, y) J «40 ^opt 0. (4.24)

64 Hence, lim K{x, Cl,t) = 0 for all x € Sjpf c4-0 We now consider tiq = 2,

= J K{y,Cl,)K{x,dy) s = fK{y,Cl,,)K{x,dy)+ J K(y,Cl,)K{x,dy) (4.23) C7 ^7 •^ o p t ^opt

Since both terms in (4.25) include a term involving a move from the set to

S^pt in one step and the probability of such moves go to 0 as in (4.24), we have

Zfm K^{x,C2pt) = 0 for all X G S2pt also. Proceeding recursively gives Condition

4.1.4. ■

Conditions 4.1.3 and 4.1.4 can then be used to show the convergence of the se­ quence of stationary distributions to a distribution that is concentrated on the set of near-optimal solutions, using the following result from Fei [13].

Theorem 4.1.2 Suppose that Conditions (4-1-3) and (4-1-4) hold for all 7 > 0.

Then for all 7 > 0, as c 0,

P .( % ) ->■ 1 . (4.26)

Proof: The proof is identical to Fei (1992), pp. 36-37, with Sopt replaced by S2pt and

Copt replaced by ■

Since the result in Theorem 4.1.2 holds for every 7 > 0 , we have convergence to the set of optimal solutions, i.e., pdSopt) —> 1 as c 4 - 0 .

Finally, we show that the convergence results still apply even when the generat­ ing distribution is updated at each iteration of the algorithm. Thus decreasing the

65 variance of the beta distribution used to select the new time of the target node in

the local rearrangement strategy will still give convergence of the algorithm. The

following theorem from Fei [13] gives the desired result.

Theorem 4.1.3 Suppose that Conditions (4-1-3) and (4-1-4) o.re satisfied uniformly

in the generating distributions given in (4-4)■ Then, as c 4- 0, the result of Theorem

4-1.2 holds.

Proof: Fei (1992), pp. 43. ■

4.2 Convergence Results for the Simulated Annealing Algo­ rithm for Estimation of the Most Parsimonious Tree(s)

The parsimony criteria does not incorporate branch lengths into the calculation

of the length of the tree, so the space of possible trees in the parsimony case is

simply the space of all topologies having n external nodes, which will be denoted

by Tn as above. Recall from Chapter 3 that at each iteration of the algorithm, a

new topology is selected by applying the local rearrangement strategy to the current tree x times, where x is drawn from a Poisson distribution with mean A, and then determining whether the newly generated topology should be accepted. Since the outcome of any iteration of the algorithm depends only on the topology that resulted from the previous iteration, the simulated annealing algorithm generates a Markov chain on the topology space r^. We begin by giving some convergence results for the general simulated annealing algorithm on a finite state space. We then show that the simulated algorithm proposed here satisfies the criteria for convergence.

66 To begin, define the transition probability for the trial of a Markov chain for

a pair of outcomes i and j by

Pijik) = P{Xk = j I = 0 (4.27)

A Markov chain is called finite if it is defined on a finite set of outcomes. If the

transition probabilities associated with a Markov chain do not depend on the trial

number k, then the chain is called homogeneous. Otherwise, the chain is called

inhomogeneous. For the simulated annealing algorithm at a fixed value, c, of the control parameter, we have a finite homogeneous Markov chain on the state space S, with transition probabihties defined by r Gij (c, k) Aij (c, &), if i #j Pij(k) = Pij(c, ;k) = < 1 __ ^ k), if i = j, I where theGij{c, k) are the generation probabilities, i.e., the probability of generating state i from state j at trial k when the control parameter has value c, and the Aific, k) are the acceptance probabilities, i.e., the probability of accepting state j over state i at trial k when the control parameter has value c.

The generation probabilities are assumed here to satisfy the condition that Gific, k)

Gji{c, k). The acceptance probabilities are given by

A,ick) = e.p(^A n j)-n ')r y <4 .29) where i G S, j E S, /(•) is the function to be optimized, and for all a G R, = a if a > 0, and a"*’ = 0 otherwise. Using the generation probabilities and acceptance probabilities defined here, the stationary distribution of the Markov chain for the simulated annealing algorithm can be determined. We first provide a definition of the stationary distribution, and then given some results for the simulated annealing algorithm.

67 D efinition 4.2.1 [Feller, 1950] The stationary distribution of a finite homogeneous

Markov chain with transition matrix P is defined as the vector q, whose component

is given by

Qi = limP{Xk — i I %o = j}, for all j. (4.30) fc->-oo

The following theorem gives conditions for the existence of a stationary distribu­ tion for the simulated annealing algorithm at a fixed value of the control parameter, and gives the form of the stationary distribution as a function of this control param­ eter.

Theorem 4.2.1 Let Pij{c,k) be a Markov chain for the simulated annealing algo­ rithm with acceptance probabilities Aij{c, k) as in (4-29) and generation probabilities

Gij{c, k) which satisfy Gij = Gj{. Furthermore, let the following condition be satisfied

Vz, J £ S ,3 p > 1 ,3^0) li, ■■■Ip £ S,

with Iq = z. Ip = j, and

G/thn-i > 0, A: = 0 ,1 , ...,p — 1. (4-31)

Then the Markov chain has a stationary distribution q{c), whose components are given by

?i(c) = > /o r all i £ S, (4.32) where

■^o(c) = ^ e x p ( - . (4.33) j€S V c y

68 Proof: Aarts and Korst (1989), pp. 38-40. ■

Given the stationary distribution for any fixed value of the control parameter, we

now examine the behavior of the stationary distribution as the value of the control

parameter decreases. In particular, we see that the stationary distribution converges

to a distribution which is concentrated on the set of optimal solutions as the control

parameter decreases to 0 .

Corollary 4.2.1 For the stationary distribution defined in (4-32).

lim qi (c) = I iXs.pt (%), (4.34) (40 I O o p t

where Sopt denotes the set of globally optimal solutions, and

xs.„(i) = l^ (4,35)

P ro o f: Aarts and Korst (1989), pp. 18-19. ■

Thus we see that the simulated annealing algorithm converges asjunptotically

to the set of globally optimal solutions, provided that the stationary distribution

given in (4.32) is attained at each value of the control parameter. Several cooling

schedules which specify a fixed number of iterations at each value of c and still give

asymptotic convergence have been proposed in the literature (for example, [46, 23, 3]).

In practice, however, we follow the approach of Lundy and Mees [41] and choose to

use the cooling schedule described in Chapter 3, which involves lowering the value of

the control parameter very slightly at each iteration of the algorithm.

We now show that the simulated annealing algorithm proposed here for for the parsimony problem shares the convergence properties of the general simulated an­

nealing algorithm described above. In this case, the state space S is simply Tn, the

69 space of all topologies with n external nodes. The generation probabilities are given

by

Gij{c, k) = Gij{k) = P(moving from topology i to topology j)

= ^ 2 P (moving from i to j \ x rearrangements) P(x rearrangements)

= (4.36)

X ^ ' where i E j € r„, and (/>ij{x) is the number of ways of moving from topology i to

topology j in x applications of the local rearrangement strategy. The terms invoking

the factor 2n — 4 result from the fact that there aren — 2 internal nodes eligible

for local rearrangement, and there are two possible topolog}’’ changes associated with

each of these nodes, resulting in 2n — 4 possible local rearrangements of any given

topology. Each of these rearrangements is chosen with equal probability. We check

to see that the matrix of generation probabilities is stochastic.

^2 n —4y x\

= E(^)'4^(2-4r2n — 4 / x\ X ' ' A \ i = E ' "x\ ' ’ X = 1, since for each fixed value of x, '^4>ij{x) = (2 n — 4)^. j The acceptance probabilities for the simulated annealing algorithm for the parsi­ mony problem are given by

A i,- ( c ,A ) = e ip f - ^ ^ k lz ^ W ll) , (4,37)

70 where i € r„, j E r„, £(•) is the length function defined in Chapter 3, and is

defined £is above. We can now state the following result.

Corollary 4.2.2 For the simulated annealing algorithm on the space defined by

(4-36) and (4-37), the results of Theorem 4-2.1 and Corollary 4-2.1 hold, i.e., there

exists a stationary distribution q such that

lim Qiic) = r ~ - .XTn.„,(i), (4-38) I ‘nopt I where Tnopt is the set of topologies having minimum length, and Xr^opt W i^ defined as in (4.3 5).

Proof: First, we note that the condition Gij = Gji in Theorem 4.2.1 is satisfied, since

x\ e-^A= x\ X = Gji(k), (4.39) where (j)ij{x) = Çji{x) since every path from i to j in x applications of the local rearrangement strategy" can be reversed to give a path from j to i in x applications.

Next, we consider the condition in (4.31). This condition states that for any two topologies i and j in r„, we must be able to move from i to j in a finite number of ap­ plications of the generation procedure, which in this setting is the local rearrangement procedure. That this holds for the local rearrangement strategy as described here was shown by Li, Pearl, and Doss [37]. Finally, we see that the acceptance probabilities specified by (4.37) are identical to those specified in (4.29) with /(•) replaced by £(•).

71 The conditions of Theorem 4.2.1 and Corollary 4.2.1 are therefore satisfied, and thus their results hold. ■

72 CHAPTER 5

APPLICATIONS

5.1 Estimation of the ML Tree for Fixed Parameter Values

We now demonstrate the application of the SS A to several theoretical and real data examples in the case where we assume that the parameters in the substitution model have been fixed prior to the search for the ML tree and node times. We compare the

SSA in all cases to one of the most commonly used algorithms for ML tree estimation, the DNAMLK program in PHYLIP [19]. These examples demonstrate that the SSA is able to ( 1 ) find an estimate of the ML tree more quickly than existing algorithms;

(2) return an estimate of the ML tree with a higher likelihood than that returned by existing algorithms in many cases; and (3) provide additional information about other trees of high likelihood. We begin by examining several theoretical data sets.

5.1.1 Theoretical Data

It is desirable to create a data set for which the ML topology is known in order to test our stochastic optimization method and to compare the method to an existing ML program (Felsenstein’s DNAMLK). To do this, data was specified when the number of sequences, n, is 2^, Z = 2, 3, ..., 7, so that a completely symmetric tree must be the ML

73 tree under the Jukes-Cantor (JC) model [29]. For each n, the number of sites in the

data set was 128. Table A.l in Appendix A shows the data set used for the largest

example, n = 128. The other data sets follow this general pattern. Ten trials for

each data set were performed using both DNAMLK and SSA, with different starting

points. For DNAMLK, different starting points means that different orderings of the

sequences were used (i.e., the “ jumble “ option in PHYLIP), while SSA was started

from different random trees as described previously (see Section 2.2.1). The number

of times the ML topology was recovered for each value of n was recorded, as well as

the cpu time required for each trial (Table 5.1). All computations were performed

on an HP Series 9000 machine performing 24 million floating point operations per second.

Number of Time (seconds) Number Correct Maximum Log Likelihood Taxa SSA DNAMLK SSA DNAMLK SSA DNAMLK 4 0.33 0.18 1 0 / 1 0 1 0 / 1 0 -547.77 -547.77 8 1 . 2 2 2.41 1 0 / 1 0 1 0 / 1 0 -798.88 -798.88 16 4.26 26.79 1 0 / 1 0 1 0 / 1 0 -1,085.24 -1,085.24 32 24.22 258.39 1 0 / 1 0 1 0 / 1 0 -1,405.92 -1,405.92 64 128.46 3,072.32 1 0 / 1 0 3/10 -1,760.38 -1,764.52 128 946.51 26,092.22 1 0 / 1 0 0 / 1 0 -2,148.38 -2,160.94

Table 5.1: Summary of the tree estimation results for the theoretical data examples. All results are averages over the 10 runs.

The cpu time (averaged over the runs) used by the two programs for each value of n is shown in Table 5.1. We observe that SSA is faster than DNAMLK for n > 8 . The difference is especially notable for the largest example attempted, n = 128, where SSA took an average of 16 minutes, as compared to over 7 hours for DNAMLK. Figure 5.1 graphs the log of the cpu times in seconds against the log of the number of sequences.

74 o

DNAMLK 00 SSA i (O

(D E

CM g O

CM

2 3 4

Log of the number of sequences

Figure 5.1: Plot of the log of the cpu time La seconds against the log of the number of sequences for both DNAMLK and SSA for the theoretical data. The plot indicates that the time is proportional to the cube of the number of sequences for DNAMLK and to the square of the number of sequences for SSA.

Felsenstein [19] states that the time required by DNAMLK is proportional to the cube of the number of sequences, which is supported by the fact that the slope of the line representing the times for DNAMLK in Figure 5.1 is approximately three. The line representing the times for SSA has a slope of approximately two, which indicates that the time required by the algorithm is proportional to the square of the number of sequences for the cooling schedule and stopping rule considered here.

The two methods can also be compared in terms of their ability to recover the ML tree (Table 5.1). We note that SSA recovered the ML tree for all values of n, while

DNAMLK found non-optimal topologies for n = 64 and n = 128. This indicates that

SSA is better able to avoid entrapment in local maxima than are existing algorithms.

75 It also demonstrates that convergence of the algorithm occurs independently of the

initial tree.

5.1.2 Mitochondrial DNA Sequences

We next consider a real data set consisting of mitochondrial DNA (mtDNA) se­

quences for 14 primate species. The data used here is a subset of that discussed

by Hayasaka, Gojobori, and Horai [27], who present a 0.9-kb fragment of mtDNA

sequences for 12 of the 14 species discussed here. In this study, a smaller data set

consisting of 231 of the available sites, selected because they are believed to be “clock­

like”, was obtained from Joseph Felsenstein [14]. Hayasaka, Gojobori, and Horai [27]

use the NJ method of phylogenetic tree construction (see Chapter 1 ) to infer the

phylogenetic relationships among this group of species. Their resulting phylogenetic

tree is shown in Figure 5.2.

We compared SSA and DNAMLK for the 14 species data set, which includes the

12 species shown in Figure 5.2 along with Bovine and Mouse sequences, for the 231 sites which are believed to follow a molecular clock. Ten trials were performed for both

SSA with random starting trees as described in Section 2.2.1 and for DNAMLK with the jumble option assuming the F84 model with transition/trans version parameter

2.0. SSA performed an average of 3,609 iterations, and took an average of 539.41 seconds (see Table 5.2). We note that one of the ten trials took more than double the time required for the next longest trial. Therefore, the median time required by the algorithm, 450.70 seconds, may provide a better representation of the time required.

In all of the ten trials, a tree with a log likelihood of -2,677.22, which is presumed to be the ML tree, was listed in the top ten trees found by the algorithm. Two trees

76 Tarsier LerrxMT SquirreL Tjumkey Barbary macaque Crab—eating macaque Japanese macaque Rhesus macaque Gibbon O rangutan Gorilla C him panzee H u m a n

Figure 5.2: NJ tree for the 12 primate species based on 0.9-kb region of mtDNA. From Hayasaka, Gojobori, and Horai [27]

with log likelihoods nearly as high, -2,677.88 and -2,678.11, were found in the ten best trees on eight and five of the trials, respectively. These three trees are shown in

Figure 5.3. Twenty other trees appeared in the top ten trees on more than one trial, with only two of these occurring more than five times. A total of 34 unique trees were listed in the ten best trees over the ten trials performed.

DNAMLK took an average of 940.34 seconds to estimate the ML tree, and always returned the ML tree, which is shown in Figure 5.3(a). We find that the two programs always returned the same estimate of the ML tree, but that SSA was able to do so in about 1/2 of the time. In addition, SSA gives valuable information about other trees of high likelihood. For example, inspection of the trees in Figure 5.3 shows that these three trees differ only in the relationship between Bovine, Tarsier, and Lemur. These

77 observations support several of the statements made by Hayasaka, Gojobori, and

Horai [27]. The authors comment that they omitted the Bovine and Mouse sequences from their analysis since "... when we used the nucleotide sequences of mouse and cow in the phylogenetic analysis, the primate species were not monophyletic ...”. A similar type of grouping is observed here also since the three trees in Figure 5.3 show the Bovine sequence clustering with Lemur and Tarsier. Also, the authors state that

“the phylogenetic relationship between the tarsier and the other primate groups has long been controversial”, which perhaps gives some indication of why rearranging the area of the tree which contains the Tarsier sequence yields three differing trees, all of which have high likelihood.

Data Time (seconds) Number Correct* Maximum Log Likelihood Set SSA DNAMLK SSA DNAMLK SSA DNAMLK mtDNA (n=14) 539.41 940.34 1 0 / 1 0 1 0 / 1 0 -2,677.22 -2,677.22 (231 sites) HPV data (n=30) 8,145.60 50,297.70 5/5 3/5 -27,976.52 -27,976.91 (1,379 sites) HIV data (n=30) 18,169.22 42,828.77 5/5 4/5 -24,793.13 -24,792.75 (2,245 sites) *The number of times the tree with the highest known likelihood is found.

Table 5.2: Summary of the tree estimation results for the real data examples. All results are averages over the total number of runs.

78 •Mouse Mouse ‘ Mouse r — Tarsier I Bovine • Lemur “I t — Bovine • Bovine ^— Lemur ' Tarsier — ^Squirrel monkey Squirrel monkey •Squirrel monkey Barbary macaque Barbary macaque ‘Barbary macaque r—^ Crab-eating macaque r—^ Crab-eating macaque ‘Crab-eating macaque —I r-Japanese macaque •Japanese macaque ^Bhesus macaque •Rhesus macaque ‘Rhesus macaque •Gibbon • Gibbon • Gibbon

•Orangutan • O r a n g u t a n •Orangtilan

I Gorilla I Gorilla I Gorilla

- I f — CMmpanzee Chim pam ee ^ Human Human Human

(a) Tree 1: In likelihood = -2,677.22 (b) Tree 2: In likelihood = -2,677.8 (c) Ti-ee 3: In likelihood = -2,678.11

Figure 5.3: Three trees of high likelihood found by SSA for the 14-sequence mtDNA data set. 5.1.3 Group A Papillomavirus Sequences

Papillomaviruses are known to cause benign and malignant neoplasia in humans, other mammals, and birds [ 1 1 ]. Genetically distinct papillomavirus strains are clas­ sified into groups based on sequence similarity, host type, and pathogenic charac­ teristics. Group A papillomaviruses are those which are commonly associated with mucosal or genital neoplasias, primarily in humans. Within Group A, the papillo­ maviruses are classified into genetic subtypes, which are labeled A1 - A14 [Ij. DNA sequences for several regions of the genome have been sequenced and are available on the internet for many of these genetic subtypes. In this study, we consider ahgned

DNA sequences for 30 Group A papillomaviruses (28 human papillomaviruses (HPVs), a rhesus papillomavirus, and a pygmy chimpanzee papillomavirus), which were ob­ tained from the Los Alamos National Database website (http://hpv-web.lanl.gov).

A portion of the Ll gene that is 1,379 nucleotides in length for which all insertion and deletion sites in the sequences have been removed was used. Table 5.3 shows the sequences used and the genetic subtype classification for each.

Consensus Genetic Consensus Genetic Consensus Genetic Sequence Subtype Sequence Subtype Sequence Subtype HPV32 A 1 HPV53 A6 HPV31 A9 HPV42 A 1 HPV56 A6 HPV52 A9 HPV3 A2 HPV18 A7 HPV33 A9 HPVIO A2 HPV45 A7 HPV58 A9 HPV2a A4 HPV39 A7 R hPV l A9 HPV27 A4 HPV59 A7 HPV 6 b AID HPV57 A4 HPV7 A8 H P V ll AlO HPV26 A5 HPV40 A8 HPV13 AlO HPV51 A5 HPV16 A9 PC PV l AlO HPV30 A6 HPV35h A9 HPV34 A ll

Table 5.3: Group A papillomaviruses and genetic subtypes

80 Similar phylogenetic studies of the evolutionary'’ relationships among papillomaviruses have been conducted by Chan et al. [10], Chan et al. [ 1 1 ], and Ong et al. [50]. In particular. Ong et al. [50] used phylogenetic tree construction methods to test the molecular clock hypothesis in the region of the Ll gene for the 30 sequences in Ta­ ble 5.3, and to study the rate of population growth among group A papillomaviruses.

Based on results obtained with the programs DNAML and DNAMLK in the PHAXIP package [19], the authors conclude that (1) evolution of this particular gene is consis­ tent with the molecular clock hj'pothesis, and (2) group A papillomavirus populations seem to have grown at an exponential rate, though there is evidence that the popu­ lation size is no longer increasing eis rapidly.

SSA and DNAMLK were both used to estimate the ML phylogenetic tree for these 30 sequences under the F84 model with transition/ transversion ratio 2.0. Five trials were performed with each of the methods. Random starting trees as described previously were used as the initial trees for our algorithm, and the jumble option was used to generate a random order of addition of sequences for DNAMLK. The results are summarized in Tables 5.2 and 5.4.

Tree Log Times Found Likelihood SSA DNAMLK 1 -27,976.52 5 3 2 -27,976.62 5 0 3 -27,977.16 0 2 4 -27,977.69 5 0

Table 5.4: Summary of the tree estimation results for the HPV example.

81 SSA performed an average of 11,196 iterations, which took an average of 8,145.60

seconds (about 2.3 hours). The highest log hkelihood found by the algorithm was a

tree with a log likelihood of-27,976.52 (see Figure 5.4(a)), which was listed in the top

ten trees found in all of the five trials. Two other trees of high likelihood, -27,976.62

and -27,977.69, were always listed in the top ten trees found by SSA (see Tables 5.2

and 5-4, and Figure 5.4). Nine other trees were found more than once, with only one

of those appearing in all five runs, and a total of 2 0 unique trees were found in the

five trials.

DNAMLK took an average of 50,297.70 seconds (about 14 hours) to estimate the

ML tree. It returned the tree found by SSA to be the ML tree (Figure 5.4(a)) on three of the five trials, and returned a tree of lower log likelihood (-27,977.16) on the two remaining trials (Figure 5.4(c)). Curiously, this tree is never listed in the top ten trees found by SSA. We do note, however, that SSA always found the presumed ML tree, and took only about 1/6 of the time required by DNAMLK.

Examination of the four trees in Figure 5.4 reveals some interesting features of the data. First, we note that the sequences generally cluster in the tree according to the genetic subtype relationships identified in Table 5.3, with the exception of the RhPVl (rhesus papillomavirus) sequence. In the trees obtained by SSA (trees 1,

2, and 4), RhPVl appears to be more closely related to the subtype A ll sequence

HPV34 than to the other A9 sequences. The RhPVl-HPV34 clade then joins the subtype A9 group. In tree 3, however, the R hPV l sequence clusters most closely with the subtype A4 sequences. These results suggest that, at least when the Ll region of the genome is considered, the phylogenetic position of the RhPVl sequence is somewhat uncertain.

82 ' HPV 57 HPV 57 • HPV 27 HPV 27 • HPV 2a HPV 2 a • P C P V l PC PV l • HPV 13 HPV 13 ■ HPV 6t> HPV 6b - HPV 11 HPV 11 ■ HPV 32 HPV 3 2 ■ HPV 4 2 HPV 4 2 " HPV 34 HPV 34 • R hP V l RhPV l HPV Ifi HPV 16 ■ HPV 31 HPV 31 HPV 3Sh HPV 35h ■ HPV 52 HPV 52 ■ HPV 58 HPV 58 ■ HPV 33 HPV 33 HPV 7 HPV 7 HPV 40 H P V 40

HPV 39 HPV 39 HPV 59 HPV 59 HPV 4 5 HPV 4 5 HPV 18 HPV 18 HPV 26 HPV 26 HPV 51 HPV 51 HPV 3 HPV 3 HPV 10 HPV 10 HPV 56 HPV 56 HPV 30 HPV 30 HPV 53 HPV 53

(a) Tree 1; In likelihood = -27,976.52 (b) Tree 2: In likelihood = -27,976.62

HPV 57 HPV 57 HPV 27 HPV 27 HPV 2 a HPV 2 a P C PV l HPV 13 HPV 13 HPV 6b HPV Gb HPV 11 HPV 11 HPV 32 HPV 32 HPV 4 2 HPV 4 2 HPV 34 HPV 34

HPV 16 HPV 16 HPV 31 HPV 31 HPV 35h HPV 35h H PV 5 2HPV 52 HPV 5 2HPV HPV 58 HPV 58 HPV 33 HPV 33 HPV 7 HPV 7 P H V 4 0 HPV 4 0HP

HPV 39 HPV 39 HPV 59 HPV 59 HPV 4 5 HPV 4 5 H P V 18HPV 18 H P V 18HPV HPV 26 HPV 3 HPV 51 HPV 10 HPV 3 HPV 28 HPV t o HPV 51 HPV 56 HPV 56 HP V 30HPV 30 HPV 30HPV HPV 53 HPV 53

(c) Tree 3: In likelihood = -27,977.16 (d) Tree 4: In likelihood = -27,977.69

Figure 5.4: Four trees of high likelihood found by SSA for the HPV data set.

83 It is also interesting to note that the three trees found by SSA (trees 1, 2, and 4), differ only in the arrangement of HPV types 26, 51, 3, and 10, in relation to the two groups they are most closely related to (HPV types 56, 30, and 53, and HPV types 39,

59, 45, and 18). This suggests to us that the evolutionary relationships in the other parts of the tree may be very well-established, but that there is some uncertainty in this portion of the tree for the Ll gene. Interestingly, tree 3 (the tree identified by

DNAMLK on two of the trials, but never found by SSA) is identical to tree 2 in the portion of the tree just discussed. It differs from the other three trees, however, in the placement of the Rhesus papillomavirus sequence, RhPVl, as discussed above.

Discussion of the results and comparison of trees with high likelihood is not possible with programs such as DNAMLK which report only a single estimate of the tree.

Thus, we see the advantage of returning information concerning not only the ML tree, but also other trees of high likelihood.

5.1.4 Analysis of the env Region for 30 HIV Sequences

Human Immunodeficiency Viruses (HIV) are broadly classified into two distinct groups, HIV - 1 and HLV^-2, which are believed to have different evolutionary origins in humans. Within HIV-1 , \dral strains are further classified into two groups: group

M (main), which contains the majority of all HLV-l strains, and group O (outlier), which contains only a few strains. The O group sequences differ significantly from the M group strains, but are still believed to be HIV-1 viruses. Within group M,

HIV-1 strains are classified into genetic subtypes, in a manner similar to the HPV viruses discussed in Section 5.1.3. Ten genetic subtypes within group M have now been identified [ 2 1 ], and sequencing of viruses of these particular subtypes has been

84 underway. In 1997, it weis estimated that about 50 % of all reported HIV-1 strains

were of subtype B, while an additional 45 % of all reported strains consisted of

subtypes A (16 %), G (9 %), D (10 %), and E (10 %) [7]. We note, however,

the difficulty in obtaining accurate estimates of HIV - 1 subtype prevalence, due to

underrepresentation of HR-^-l strains from less-developed countries [7]. More recent

estimates [2 1 ] show an increased occurrence of subtype C viruses, estimating that

subtype C accounts for about 36 % of all globally circulating HIV - 1 sequences. For

further reviews, see [44, 7, 21].

The HIV database maintained by Los Alamos National Labs (http://hiv-web.lanl.gov)

contains DNA sequence data for various regions of the genome for many of these HIV-

1 genetic subtypes. We obtained aligned sequences in the env region for 30 HIV - 1

genetic subtypes. These sequences are listed in Table 5.5, where they are classified by

genetic subtype and countrj' of origin. The region considered here is 2,245 nucleotides

in length and includes insertion and deletion sites. As above, both SSA and DNAMLK

were used to analyze the data under the F84 model with transition/transversion ratio

1.5, as described in [34]. Five trials were performed with each of the methods with

random starting points for each, as described previously. The results are summarized

in Tables 5.2 and 5.6.

SSA took an average of 18,169.22 seconds (about 5 hours). The highest log likeli­

hood found by the algorithm was a tree with a log likelihood of -24,792.72 (see Figure

5.5(a)), which was listed in the top ten trees found in all of the five trials. Two other trees of high likelihood, -24,793.93 and -24,796.12, were listed in the top ten trees found by SSA for five and four of the trials, respectively (see Table 5.6 and Figure

85 5.5). Fifteen other trees were found more than once, with none of those appearing in all five runs, and a total of 2 0 unique trees were found in the five trials.

Sequence Genetic Sampling countrj’- Subtype (origin) U455 A Uganda 92UG037.1 A Uganda K89 A Kenya SF170 A Rwanda HXB2 B France JRFLB US OYI B Gabon RFB US (Haiti) ETH2220 C Ethiopia 92BR025.8 C Brazil UG268 C Uganda ZAM18 C Zambia NDKD Zaire Z2 Z6 D Zaire ELI D Zaire 94UG114.1 D Uganda CM240 E Thailand TN235 E Thailand 90CR402.1 E Central African Republic 93TH253.3 E Thailand BZ163 F Brazil BZ126 F Brazil 93BR020.1 F Brazil 92NG0G3.2 G Nigeria 92NG083.1 G Nigeria 92UG975.10 G Uganda 92RU131.9 G Russia 90CF056.1 H Central African Republic ANT70 0 Group Cameroon MVP5180 0 Group Cameroon

Table 5.5: Genetic subtype groupings for the HIV data set.

86 DNAMLK took an average of 42,828.77 seconds (about 12 hours) to estimate the

ML tree. It returned the tree found by SSA to be the ML tree (Figure 5.5(a)) on four of the five trials, and returned a tree of lower log likelihood (-24,793.93) on the remaining trial (Figure 5.5(b)). We note that SSA always found the presumed ML tree, and took less than half of the time required by DNAMLK.

Tree Log Times Found Likelihood SSA DNAMLK 1 -24,792.75 5 4 2 -24,793.93 5 1 3 -24,796.12 4 0

Table 5.6: Summary of the tree estimation results for the HIV example.

Comparison of the three trees in Figure 5.5 gives some interesting information.

We see that these trees diflfer only in the placement of the C-H clade in relation to the other portions of the tree. This suggests that the evolutionary origin of the subtypes

C and H may not be well-established, at least for the env gene. In light of the recently estimated increase in occurrence of subtype C strains of the virus [ 2 1 ], understanding of the evolutionary origins of this viral subtype may be very important.

5.2 Simultaneous Estimation of the Tree and Substitution Model Parameters

We now consider the problem of simultaneous estimation of the tree and the parameters in the underlying substitution models. In the examples which follow, we consider estimation of the parameters p and K in the F84 model. The results obtained with SSA will be compared with those obtained by the program PAUP*

87 '— ^

Tree 1: In likelihood = -24,792.75 (b) Tree 2: In likelihood = -24,793.93

(c) Tree 3: In likelihood = -24,796.12

Figure 5.5: Three trees of high likelihood found by SSA for the HTV data set.

88 [60], using a beta release of the program for the Macintosh. Results for PAUP* will thus be obtained from runs on the Power Macintosh G3, while those for SSA will be obtained from an HP Series 9000 machine. Thus, timing data for the two methods will not be comparable. Nevertheless, we have reported information about run times in the results which follow.

For the mtDNA data set (see Section 5.1.2), both programs always find the tree with the highest known likelihood (-2,612.79), which is shown in Figure 5.6(a). The parameter estimates obtained using each of the methods are very close also (see Table

5.7). For SSA, the ten best trees found by the algorithm were saved. Comparison of these trees for each of the five runs shows that there are two trees with likelihoods nearly as high as that of the optimal tree (-2613.1633 and -2613.1646), which are also shown in Figure 5.6. These trees were listed in the top ten trees found by the algorithm in all five of the trials. There were eight trees which appeared more than once, with only four of these appearing on four or more of the trials. A total of 20 unique trees appeared in the top ten trees found by the algorithm. SSA took an average of 5,426.40 seconds to run on an HP machine, and PAUP"^ took an average of 26.653.08 seconds on a Macintosh 03.

Parameter SSA PAUP* K 40.24 39.53 A 0.7083 0.7082

Table 5.7: Parameter estimates for the mtDNA data set. Values shown in the table are averages over the five runs.

89 Mouse — Mouse Mouse Tarsier Bovine ■ Tarsier Devine CLemur Lemur Lemur — Tarsier ■c Bovine Squirrel monkey — Squirrel monkey Squirrel monkey Barbary macaque — Gibbon Gibbon I Crab-eating macaque — Orangutan Orangutan - j i —i/apanese macaque r— Gorilla Gorilla '—Rhesus macaque - | i — Human I— Human — Cibbon I— Chimpanzee ^ Chimpanzee — Orangutan Barbary macaque Barbary macaque o I Gorilla Crab-eating macaque I Crab-eating macaque - j r — Chimpanzee Rhesus macaque - jr —/î/iesus macaque Human Japanese macaque ^Japanese macaque

(a) TVee 1: In likelihood = -2,612.7952 (b) Tiee 2; In likelihood = -2,613.1633 (c) Tree 3; In likelihood = -2,613.1646

Figure 5.6: Three trees of high likelihood found by SSA for the 14-sequence mtDNA data set when the substitution model parameters are estimated simultaneously. PAUP* also has an option which allows the k best trees found by the algorithm to be saved. However, this option is not available when a random order of addition of sequences to the initial tree is used, and therefore this option was not in effect for the runs discussed above. PAUP* was thus also run for a non-random order of addition of sequences to the tree, and the ten best trees found by the algorithm were saved.

This run took approximately 46,694.60 seconds (about 13 hours) on the Macintosh.

The three trees shown in Figure 5.6 were all listed in the ten best trees found by

PAUP*.

For the 14-sequence mtDNA set considered here, it seems that SSA performs comparably to PAUP*, with the additional advantage of being able to return several trees of high likelihood in addition to the ML tree.

5.3 Estimation of the Most Parsimonious Tree(s)

We next apply the simulated annealing algorithm for the parsimony problem to the same three real data sets considered in Section 5.1. For comparison, we have considered here the program DNAPARS in PHYLIP [19] and the program PAUP*

[60] (but note that the times for PAUP* are from the Macintosh and are therefore not comparable with times for the simulated annealing algorithm (SA) and DNAPARS).

Ten trials were performed for each of the data sets with each of the programs. For

SA, random starting trees were generated from a Yule model [24], while for both

DNAPARS and PAUP* random orders of addition of sequences to the trees were used. The results are shown in Table 5.8.

91 D ata Time (seconds) Number Correct* Minimum Length Set SA DNAPARS PAUP* SA DNAPARS PAUP* SA DNAPARS PAUP* mtDNA (n=14) 119.80 1.77 0.06 10/10 10/10 10/10 742 742 742 (231 sites) CD HPV data to (n=30) 7,131.21 65.04 0.58 9/10 7/10 9/10 6,058.2 6,058.0 6,057.3 (1,379 sites) HIV data (n=30) 16,273.27 85.66 0.34 9/10 8/10 10/10 4,524.2 4,523.9 4,523.0 (2,245 sites) *The number of times the tree with the minimum known length is found. All three of the programs returned trees that are only locally optimal for at least one of the trials for the larger data sets. This problem appears to be most severe for

DNAPARS and least severe for PAUP*. SA performed nearly as well as PAUP*. and better than DNAPARS, but at the expense of a very large increase in computation time. That the simulated annealing algorithm sometimes leads to long run times to obtain an optimal solution is well-documented [2]. However, the simulated annealing algorithm may have an advantage in very large problems, where methods like those used in DNAPARS and PAUP* will almost certainly obtain sub-optimal trees. Al­ though SA in this type of setting may require long run times, it may find solutions of shorter length than those found by the existing methods. This type of behavior is what Daniel Barker observed with his simulated annealing program LVB [4].

It is also interesting to compare the trees obtained by the parsimony method with those obtained using ML in Sections 5.1 and 5.2. For the 14-sequence mtDNA data set, the MP tree shows the same topology as the tree with the second highest log likelihood, shown in Figure 5.3(b). For comparison, we can compute the lengths of the trees in Figure 5.3(a) and (c). We see that the tree with the highest log likelihood

(Figure 5.3(a)) has a length of 747, while the tree with the third high log likelihood

(Figure 5.3(c)) has a length of 749. Because the mutation rate is estimated to be quite high for mtDNA (see Section 5.2) we do not expect the ML and MP trees to agree with one another. We prefer the ML tree as an estimate of the evolutionary relationships among these 14 species.

For the HPV data set, the MP tree differs from all of the trees shown in Figure

5.4 and is shown in Figure 5.7. Grouping into the various subtypes defined in Table

5 . 3 is maintained in this tree, but the placement of the subtypes in relation to one

93 rC

HPV19 HPV14 HPV40

tVJ^t HPVST HPvrr HPV2a HPT59 HP7S9 HPV4S

HPV28 HPVS8 HPVS3 HPV30 HPVtO HPV3 ■HPV4Z

Figure 5.7: MP tree for the HPV data set.

another is very different. One of the most striking differences is that in the ML trees

in Figure 5.4 the subtype A4 sequences are the “outgroup” sequences, while in the

MP tree, the Al sequences are most distantly related to the others. We also note that the placement of the subtype A9 sequence RhPVl with the subtype A4 clade is in agreement with the tree which had the third highest log likelihood, and differed from the other three trees of high likelihood.

As was the case for the HPV data set, the MP tree for the HIV data set differs from all three of the trees of high likelihood found in Section 5.1.4 and is shown in

Figure 5.8. Comparison of the MP tree with the ML trees in Figure 5.5 again shows some interesting relationships among the C-H subtj’pes and the remaining sequences.

94 Figure 5.8: MP tree for the HIV data set.

In the MP tree, the H sequence is grouped with the two subtype O sequences, which are then grouped with the four subtype C sequences. This tree is most similar to the tree with the third highest likelihood, since in that tree the C-H clade was closest to the O subtype sequences. Other groupings and relationships among clades are the same in the MP tree as in the ML trees in Figure 5.5.

95 CHAPTER 6

CONCLUSION AND FUTURE DIRECTIONS

The stochastic search algorithm for estimation of maximum likelihood phyloge­ netic trees has been shown to perform quite well in comparison with existing methods.

The algorithm is able to reduce the time required to estimate the maximum likeli­ hood tree, while at the same time showing an improved ability to locate the true maximum. A further advantage to the method is that it allows information about multiple trees to be returned. Such information often provides interesting insight into the phylogenetic relationships observed in the maximum likelihood tree.

It would be useful to extend the stochastic search algorithm to the case of unrooted phylogenetic trees as well. Because there are no constraints on branch lengths in unrooted trees, the local rearrangement strategy would need to be modified to allow individual branch lengths, rather than node times, to be adjusted. Since there is no longer an “allowable interval" to be considered, the beta distribution could not as easily be used to suggest changes in branch lengths, and perhaps some other random perturbation of branch lengths might be more appropriate. Otherwise, however, the algorithm would remain essentially unchanged, and so we might expect it to be a useful improvement in the estimation of unrooted trees as well.

96 The stochastic search algorithm could further be improved by incorporating fea­ tures that would allow some of the assumptions made in the calculation of the likeli­ hood to be relaxed. For example, implementation of a method which allows site-to-site rate variation to be modeled in some manner would be useful. It would also be ben­ eficial to develop a substitution model which takes insertion and deletion events into account. Further, the assumption of site-wise independence should be considered.

In Chapter 2, we demonstrated that the values of the substitution model param­ eters can have an important effect on the resulting estimate of the phylogenetic tree.

For this reason, it is important that careful attention be given to the problem of si­ multaneous estimation of the tree and the substitution model parameters. Although the PAUP* package does seem to do a reasonable job of simultaneously estimating the tree and the transition/ transversion ratio for the 14-sequence mtDNA data set, the HPV and HIV examples in Sections 5.1.3 and 5.1.4 show that such an algorithm may return suboptimal estimates of the ML tree when the number of sequences is large. It is expected that this problem will grow worse when a larger number of sequences are examined. Our proposed algorithm shows reasonable performance for the 14-sequence mtDNA example. It will be interesting to test the method for larger examples, and to compare these results to currently used methods such as those used by PAUP*.

The simulated annealing algorithm for the estimation of phylogenetic trees under the parsimony criteria shows the expected performance based on the previous litera­ ture in this area. Although the algorithm does seem to show increased performance over the DNAPARS program in PHYLIP, its performance is very similar to that of the package PAUP*. However, it is much more time-consuming than either of these

97 programs, and so it seems that, at least for examples that are not extremely large, a more beneficial strategy for estimation of phylogenetic trees under the parsimony criteria might be to perform many repeated runs of a deterministic algorithm with varied starting points. Whether a simulation-based algorithm such as simulated an­ nealing would be helpful in a very large example, say over 1 0 0 sequences, has been unexplored in the literature and is, perhaps, worth investigating further.

In conclusion, the simulation-based methods for estimation of phylogenetic trees proposed here should be a useful contribution to the collection of phylogenetic tree reconstruction methods. As more DMA sequence data is accumulated and we seek to apply phylogenetic methods to larger and larger data sets, the algorithms available today will almost certainly fail. Methods such as those proposed here are therefore essential to our future understanding and interpretation of large DMA sequence data sets.

98 co m 00-vJO>Ütt|i^OOtOl-‘OCDOO^C3>a

OOOOOOOOOOOOOOQOOOOOOOOOOOOO § d g' OOOOOOOOOQOOOOOOOQQOOQQOQOOO C MO & (OCD ü § C/3 OOOOOOQOOOOOOOOOOOOOOOOOOOOO œ i M- H I fü H R T) œ I rt 3 O O O O O O O O o o O o >> >>>> > > > > >> > > > > HHHHooonooonoooooooo>>>>>>>>

00000000>>>>HHHHOOOOOQQO>>>>

QOt>t>HHOOQQr^î^^HHQOOQtJ^!>HHOOOQr^!^ H O O >H O Q > H O O >H O O >-H O O ’^HOO>>HOO',> c/3 OiO>o>a>OiaîO)OîOîüiüiüiü.tU^,ti^(ti.rfi>.((x^oooocococococAjcowcoto OO^OiÜltfc>COK5t-‘ OCOOO^a5Cn>JikCOlOh-‘ OtOOO-^050lt»^COtOl-'OCDOO-vJ01Ü1tP^COtOh-‘ OCDI i (D

§ i OOOOOOOOOOOOOOOOOOOOOOOOQOOOOOQOOOOOOOOO g- B- O o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o I I o o g I HHHHOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO & ! I g oooooooooooooooooooooooooooooooooooo>>>>

r>»î=^î>î>HHHHHHHHHH*-3HHHHHOOQOOOOOOOOOOOOOQOOQ

f>^r^rJ^t^HHHHC)OOOOOQOÎ^!J^!>^HHHHQOOQOOOQÎJ^Jt'P>!>HHHH 0 0 >;>HHQOOO>>HHOOQQ’^;i>HHOOQQ>>HHOOClO>>HHOCi HOO>HOQ>HOO>-HOQ>>HOQ>HOO>'HOO;>HOO>'HOO>HOO;i> continued ûrom previous page Sequence Site Pattern 69 A CG TTGCAA G C A 70 A CG TT G CAA GC G 71 A CG TT G CAA G T C 72 A CG TTGC A AG T T 73 ACG T TGC A G C AA 74 A CG TTG C A GC AG 75 ACG T T G C A GC GC 76 ACG TTG C A G CG T 77 A CG T TGC A G T C A 78 ACG T TGC A G T C G 79 ACG T TGC A G TTC 80 A C G TTG CA G T TT 81 ACG T TGCGC AA A 82 ACG TT GCGC AAG 83 ACG T TGCGC A G C 84 A C G T TGCGC A G T 85 ACG T TG C GCG C A 8 6 A C G T TGCGCGC G 87 ACG T TGCGCG TC 88 A C G T TGCGCG T T 89 ACG T TG CG T C A A 90 ACG TTG c G T C AG 91 ACG T TG c G T C GC 92 ACG T TG c G T C G T 93 A C G T TG c G TTC A 94 A C G T TG c G T T C G 95 A CG TT G c G TT TC 96 A c G TTG c G TTT T 97 A c G T TG T C AA A A 98 A c G T TG T C A AA G 99 A c G T TG T C AAG C 1 0 0 A c G T TG T C A AG T 1 0 1 A c G T TG T C A G C A 1 0 2 A c G T TG T C A G C G 103 A c G T TG T C AG TC 104 A c G TT G T C AG T T 105 A c G T TG T CGC A A 106 A c G T TG T CGC AG 107 A c G T T G T C GCG C 108 A c G T TG T CGC G T continued on next page

101 continued Grom previous page Sequence Site Pattern 109 AC G TT G TCG TC A 1 1 0 AC GT T GTC G TC G 1 1 1 AC G TTGTC G T T C 1 1 2 AC GT T G TCG T TT 113 AC GTTGT TC A AA 114 AC G TTGT TC AA G 115 AC G TT GT TC AGC 116 AC G TT GT TC AGT 117 AC G TT G T T CG C A 118 AC G TT G T T C G C G 119 AC G TT G T TCG T C 1 2 0 AC GTTGT TCG T T 1 2 1 AC G TT G T T T C AA 1 2 2 AC GT T GT TTC A G 123 A c G TT G T T T C GC 124 A c G TTGT TT C G T 125 A c G TT G T T TTC A 126 A c G TTGT T TTCG 127 A c GT T G T T TT T C 128 A c GT T G T T TT T T Times repeated 16 16 16 16 16 16 16 8 4 2 1 1

Table A.l: Theoretical data used in the examples in Section 5.1.1 for n = 128. The smaller theoretical data sets are similar.

Sequence Site Pattern A A G C T C A A B A GC T C AG C A G C T C GC D A GC T C G T Times repeated 16 16 16 16 16 16 32

Table A.2: Theoretical data used in demonstrating the effect of the transi­ tion/ transversion ratio in Section 2.3

1 0 2 A PPEN D IX B

NEWTON-RAPHSON CALCULATIONS

B.l The Newton-Raphson Method for the Assignment of a Time to the Target Node in the Local Rearrangement Strategy

Let t denote the target node, cl and c2 denote the children of the target node after local rearrangement, and let U denote the time assigned to node i. We consider only the subtree descending from node t, and we wish to use the Newton-Raphson method to find the value of tt which optimizes the likelihood of this subtree (see Figure B.l).

For the calculation, we assume that the conditional likelihoods for nodes cl and c2 are known.

First, we consider the likelihood for site j.

X

= Pxy{tt - (y) ^ Pxz{tt - (c2 )Z^(z)

103 Thus the first derivative of the likelihood for site j is given by

^ Pxy(tt - tciWciiy) Pxz{tt -

= ~ ^cl)^cl(y) — tc2) X ^ X ^ O Z t

-^^^lil{y)-Q^JPxy{U — id) — ^c 2 )/c2 W ^ , where thePij{t) are the transition probabilities of the substitution model (see Section

2.1).

Similarly, the second derivative is given by

— y ^ 7Ta;^Cei,t(3:, y) ^ — tcs) dt1 '■ X ^ . q2 + C'c2,t(x, z) y^ jil(y )-^ P ry(^£ ~ ^cl) y ‘ + 2 iji jy) (tt —tcl)''^l{2{z)^^Pxz{it —ic2)^ where

^ c l , t { ^ T y ) — ^ ^ Pxy (ft ^cl)^cliy) y and Cc2,t{^^ z) is a similar term.

Then for all sites combined, the overall log likelihood would be

nsites L — In I = y ' In P i=i Thus the first derivative of the log likelihood is given by

d ln l _ dtt ^ P dtt J — ^ and the second derivative of the log likelihood is

d ^ln l ^ 1 1 f d P \ -

104 c2

Figure B.l: Subtree defined by the target node and its descendants. Note the the nodes cl and c 2 may have nodes descending from them

Let to be an initial value for the node time. We generally set to to be the mid-point of the “allowable” interval, i.e., the interval between the time of the parent node and the time of the closest of the two children following local rearrangement. We then set the estimate of the time of the target node, tt, to be the value given in a single step of the Newton-Raphson procedure, i.e..

The value of tt is then used to determine the mean of the beta distribution from which the time assigned to the target node will be drawn. Specifically, we set the mean to

I of the distance from to to tt if û falls within the allowable interval and | of the distance to the endpoint of the interv'al in the direction of tt otherwise. The variance of the beta distribution is given in (2.14).

105 B.2 The Newton-Raphson Method for Estimation of fi and K in the F84 Substitution Model

To calculate derivatives of the log likelihood with respect to ji and K, we first

examine the derivatives of the conditional likelihood of an arbitrary node k for site

j given state s. Again, let nodes cl and c2 be the descendents of the node under

consideration, fc, and suppose that the conditional likelihoods of cl and c 2 have

already been calculated. We note first that

= Yl- Psxitcl - tk)lii{x) ^ Psy{tc2 ~ tk)l^Av) % y = C'ci,Ar(s,x)Cc2 ,fc(s,2/),

where C'd,fc(s, x) and Cc2,k{s,y) are as defined above.

Then, the derivative of lj.{s) with respect to fj, is

dlj{s) _ d dyi dll

— C'ci,fc(5, x)— {Cc2,k{s, y)} + Cc2 ,k{s, y)— {C'ci,A:(s, x)} ,

where

{C'cl,*(s,x)} — j dfi

= sx(^cl — tk)— {lii{x)} X ^

and -§ji{Cc2 ,k{s, y)} is a similar term.

We note from the above equations that the derivative of the conditional likelihood

for any internal node is a function of the derivatives of the conditional likelihoods of its two descendent nodes. Therefore, the derivative of the likelihood of the entire tree

106 can be calculated using the pruning algorithm of Felsenstein [16] in the same way

that the likelihood of the tree was calculated (see equation (2.13)). We also note that

the calculation of the first derivative with respect to the parameter K is identical to

that described above for fi.

The second derivative of the conditional likelihood at node k for site j given

nucleotide s is given by

^ ^ 2 ^ ~ ^^(^01,6(5, 3 :)^ {C"c2 ,t(^,l/)} -^Cc2,k{s,y)-^ {Cd,fc(s, x)}|

— C'ci,A;(s, x) ^-^{Cc 2 ,A:(s, 2/)} + Cc2,k{.S-. U) q^2 x)}

+ 2 x)}— (Cc2 ,fc(s, y)}^ ,

where

^^{C'ci,fc(5,x)} = ^ | 5 I^sx(ici - ik)-Q^{lii{^)}

+ - i* )} |

= '^Psxitcl - 4) + y^/ci(x)-g^{P5g(tei - t k ) } X ^ X

+ 2 - ik)}^ ,

and ^ { C c 2 ,k(s,y)} is a similar term.

We note that the calculation of the second derivative of the conditional likelihood with respect to K is similar to the calculation shown above for fi. As in the case of the first derivatives, the second derivatives of the entire tree with respect to both fj. and K are then calculated according to the pruning algorithm of Felsenstein [16].

107 Next, we consider the mixed partial derivative,

{Cc2 .;:(s,!/)} + C a A s , v ) ^ {C.i,t(3.i)}j

= ^^{C'cl,fc(s, x)} — {Cc2,k{s, 7/)} 4- Cci,k{s, x) g j^ Q ^ {^ ‘:2,k{s, v)}

+ ^^{C'c2,fc(5, y)}— {Cci,fc(s, x)} + Cc2,k{^, y)-g^^{Cci,k{s,x)},

where

{C*cl,Ar(s,x)} — —ik)-0J^{lii{x)} + '^lii{x)-^{Psx{tcl ~ dKdfi

~ E - ik)}-gj^{l'il{x)} + - tk)

+ E ::r(^cl ~ ^fc)} ^ { ^ c l ( ^ ) } + ^ci(a:) ^ { - P sx{tcl ~ ifc)}-

We note that the partial derivative for the entire tree is found using the pruning

algorithm of Felsenstein [16], as in the case of the first and second derivatives.

To find simultaneous estimates of fi and K for a given tree topology and set of node

times, we iterate through the steps of the Newton-Raphson method, i.e., we repeat

the following calculation until the change in the values of y and K in subsequent

steps falls within a certain tolerance level,

/ d-la 1 d^ln 1 \ / din I '

_____ a in j I ’ \dKdtL ÔK- J \ 8K where and K i are the values of and K at iteration i. We take as initial values, yo and Kq, the current values of y, and K in the SSA.

108 BIBLIOGRAPHY

[1 ] Human Papillomaviruses: A Compilation and Analysis of Nucleic Acid and Amino Acid Sequences, 1997.

[2] E. Aarts and J. Korst. Simulated Annealing and Boltzman Machines. Wiley and Sons, First edition, 1989.

[3] E. Aarts and P. J. VanLaarhoven. A new polynomial time cooling schedule. Proc. IEEE Int. Conf. On Computer-Aided Design, Santa Clara, x:206-208, 1989.

[4] D. Barker. LVB 1.0: Reconstructing Evolution with Parsimony and Simulated Annealing. University of Edinburgh, 1997.

[5] C. J. P. Belisle. Convergence theorems for a class of algorithms on Journal of Applied Probability, 29:885-895, 1992.

[6 ] I. O. Bohachevsky, M. E. Johnson, and M. L. Stein. Generalized simulated annealing for function optimization. Technometrics, 28(3):209—217, 1986.

[7] D. S. Burke and F. E. McCutchan. Global Distribution of Human Immunodefi­ ciency Virus - 1 Clades. In J. Vincent T. DeVita, S. Heilman, and S. A. Rosenberg, editors, AIDS: , Diagnosis, Treatment, and Prevention, pages 119-126. Lippincott-Raven Publishers, fourth edition, 1997.

[8 ] L. L. Cavalli-Sforza and A. W. F. Edwards. Phylogenetic analysis: models and estimation procedures. Evolution, 32:550-570, 1967.

[9] V. Cerny. Thermodynamical approach to the traveling salesman problem: an efficient simulation algorithm. Journal of Optimization Theory and Applications, 45:41-51, 1985.

[10] S.-Y. Chan, H.-U. Bernard, C.-K. Ong, S.-P. Chan, B. Hoffman, and H. Delius. Phylogenetic analysis of 48 papillomavirus types and 28 subtypes and variants: a showcase for the molecular evolution of DNA viruses. Journal of Virology, 66(10):5714-5725, 1992.

109 [1 1 ] S.-Y. Chan, H. Delius, A. Halpern, and H.-U. Bernard. Analysis of genomic sequences of 95 papillomavirus types: uniting typing, phylogeny, and taxonomy. Journal of Virology, 69(5):3074-3082, 1995.

[1 2 ] A. Dress and M. Kruger. Parsimonious Phylogenetic Trees in Metric Spaces and Simulated Annealing. Advances in Applied Mathematics, 8:8—37, 1987.

[13] L. Fei. On a Stochastic Optimization Technique: Stochastic Probing. PhD thesis, The Ohio State University, Columbus, OH 43210, 1992.

[14] J. Felsenstein. Personal communication.

[15] J. Felsenstein. Maximum-likelihood and minimum-steps methods for estimating evolutionarj' trees from data on discrete characters. Systematic Zoology, 22:240— 249, 1973.

[16] J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17:368-376, 1981.

[17] J. Felsenstein. “Statistical inference of phylogenies”. Journal of the Royal Sta­ tistical Society Series A, 146:246-272, 1983.

[18] J. Felsenstein. Distance methods for inferring phylogenies: A justification. Evo­ lution, 38:16-24, 1984.

[19] J. Felsenstein. Phylogenetic Inference Package (PHYLIP), Version 3.5. Univer­ sity of Washington, Seattle, 1993.

[20] W. L. Fink. Microcomputers and Phylogenetic Analysis. Science, 234:1135-1139, 1986.

[21] F. Gao, D. Robertson, C. Carruthers, S. Morrison, B. Jian, Y. Chen, F. Barre- Sinoussi, M. Girard, A. Srinivasan, A. Abimiku, G. Shaw, P. Sharpe, and B. Hahn. A comprehensive panel of near-full-length clones and reference se­ quences for non-subtype B isolates of Human Immunodeficiency Virus type 1 . Journal of Virology, 72(7):5680-5698, 1998.

[22] H. Haario and E. Saksman. Simulated annealing process in general state space. Advances in Applied Probability, 23:866-893, 1991.

[23] B. Hajek. Cooling schedules for optimal annealing. Mathematics of Operations Research, 13(2):311-329, 1988.

[24] E. F. Harding. The probabilities of rooted trees-shapes generated by random bifurcation. Advances in Applied Probability, 3:44-77, 1971.

1 1 0 [25] M. Hasegawa, H. Kishino, and N. Saitou. On the maximum likelihood method in molecular . Journal of Molecular Evolution, 32:443-445, 1991.

[26] M. Hasegawa, H. Kishino, and T.-A. Yano. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution, 21:160-174, 1985.

[27] K. Hayasaka, T. Gojobori, and S. Horai. Molecular phylogeny and evolution of primate mitochondrial DNA. Molecular Biology Evolution, 5(6):626-644, 1988.

[28] Y. Ina. Estimation of the transition/transversion ratio. Journal of Molecular Evolution, 46:521-533, 1998.

[29] T. H. Jukes and C. R. Cantor. Evolution of protein molecules. In H. N. Munro, editor. Mammalian Protein Metabolism, pages 21-132. Academic Press, New York, 1969.

[30] M. Kimura. A simple method for estimating evolutionary rate of base substitu­ tions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16:111-120, 1980.

[31] S. Kirkpatrick, C. Gelatt, and M. P. Vecchi. Optimization by simulated anneal­ ing. Science, 220:671-680, 1983.

[32] M. K. Kuhner, J. Yamato, and J. Felsenstein. Estimating effective population size and mutation rate for sequence data using Metropolis-Hastings sampling. Genetics, 140:1421-1430, 1995.

[33] P. W. Laud, L. M. Berliner, and P. K. Goel. A stochastic probing algorithm for global optimization. Journal of Global Optimization, 2:209-224, 1992.

[34] T. Leitner, B. Korber, D. Robertson, F. Gao, and B. Hahn. Updated proposal of reference sequences of hiv - 1 genetic subtypes. In B. Korber, B. Hahn, B. Foley, J. Mellors, T. Leitner, G. Myers, F. McCutchan, and C. Kuiken, editors. Human Retroviruses and AIDS 1997: A Compilation and Analysis of Nucleic Acid and Amino Acid Sequences, pages 19- 24. Los Alamos National Laboratory", 1997.

[35] P. Lewis. A genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data. Molecular Biology Evolution, 15(3):277—283, 1998.

[36] S. Li. Phylogenetic Tree Construction Using Markov Chain Monte Carlo. PhD thesis, The Ohio State University, Columbus, OH 43210, 1996.

[37] S. Li, D. Pearl, and H. Doss. Phylogenetic tree construction using Markov Chain Monte Carlo. Technical Report 583, The Ohio State University, August 1996. (Revision 1, 1998).

Ill [38] W.-H. Li. Molecular Evolution. Sinauer Associates, First edition, 1997.

[39] M. Locatelli. Convergence properties of simulated annealing for continuous global optimization. Journal of Applied Probability, 33:1127-1140, 1996.

[40] M. Lundy. Applications of the annealing algorithm to combinatorial problems in statistics. Biometrika, 72(1):191-198, 1985.

[41] M. Lundy and A. Mees. Convergence of an annealing algorithm. Mathematical Programming, 34:111-124, 1986.

[42] H. Matsuda. Protein phylogenetic inference using maximum likelihood with a genetic algorithm. In Pacific Symposium on Biocomputing, pages 512-523, 1996.

[43] B. Man, M. Newton, and B. Larget. Bayesian phylogenetic inference via Markov Chain Monte Carlo methods. Technical Report 961, Department of Statistics, University of Wisconsin-Madison, 1996.

[44] F. E. McCutchan, M. 0. Salminen, J. K. Carr, and D. S. Burke. Hiv - 1 genetic diversity. AIDS, 10 (suppl 3):S13-S20, 1996.

[45] N. Metropohs, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087-1092, 1953.

[46] D. Mitra, F. Romeo, and A. Sangiovanni-Vincentelli. Convergence and finite-time behavior of simulated annealing. Advances in Applied Probability, 18:747-771, 1986.

[47] M. Nei. Molecular and Evolution. North-Holland, First edition, 1975.

[48] E. Nummelin. General Irreducible Markov Chains and Non-Negative Operators. Cambridge University Press, 1984.

[49] 0. Olsen, H. Matsuda, R. Hagstrom, and R. Overbeek. FastDNAml: A tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Computations in Applied Biosciences, 10(l):41-48, 1994.

[50] C.-K. Ong, S. Nee, A. Rambaut, H.-U. Bernard, and P. H. Harvey. Elucidating the population histories and transmission dynamics of papillomaviruses using phylogenetic trees. Journal of Molecular Evolution, 44:199-206, 1997.

[51] S. Grey. Limit Theorems for Markov Chain Transition Probabilities. Van Nos­ trand Reinhold Company, 1971.

1 1 2 [52] D. B. Pollock and D. D. Goldstein. A comparison of two methods for recon­ structing evolutionary distances from a weighted contribution of transition and transversion differences. Molecular Biology Evolution, 12:713-717, 1995.

[53] M. Powell. A tolerant algorithm for linearly constrained optimizations calcula­ tions. DAMTP Report NA17, University of Cambridge, England, 1988.

[54] M. Powell. TOLMIN: A fortran package for linearly constrained optimizations calculations. DAMTP Report NA2, University of Cambridge, England, 1989.

[55] A. Purvis and L. Bromham. Estimating the transition/transversion ratio from independent pairwise comparisons with an assumed phylogeny. Journal of Molec­ ular Evolution, 44:112—119, 1997.

[56] A. Rambaut and N. Crassly. Sequence Parameters of Trees: SPOT. Department of Zoology, University of Oxford, 1996.

[57] D. Revuz. Markov Chains. North Holland, second edition, 1984.

[58] N. Saitou and M. Nei. The neighbor-joining method: a new method for recon­ structing phylogenetic trees. Molecular Biology Evolution, 4:406-425, 1987.

[59] K. Stromberg. An Introduction to Classical Real Analysis. Wadsworth, Inc., first edition, 1981.

[60] D. Swofford. PAUP*. Phylogenetic analysis using parsimony (* and other meth­ ods). Version 4- Sinauer Associates, 1998.

[61] J. Wakeley. Substitution-rate variation among sites and the estimation of tran­ sition bias. Molecular Biology Evolution, ll(3):436-442, 1994.

[62] J. Wakeley. The excess of transitions among nucleotide substitutions: new meth­ ods of estimating transition bias underscore its significance. TREE, 11(4):158- 163, 1996.

[63] Z. Yang. Majcimum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Molecular Biology Evolution, 10(6):1396-1401, 1993.

[64] Z. Yang. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. Journal of Molecular Evolution, 39:306-314, 1994.

[65] Z. Yang. Statistical properties of the maximum likelihood method of phylogenetic estimation and comparison with distance matrix methods. Systematic Biology, 43(3):329-342, 1994.

113 [6 6 ] Z. Yang. Phylogenetic Analysis by Maximum Likelihood (PAML), Version 1.3. Department of Integrative Biology, University of California at Berkeley, 1997.

[67] Z. Yang, N. Goldman, and A. Friday. Comparison of models for nucleotide substitution used in maximum likelihood phylogenetic estimation. Molecular Biology Evolution, 11:316-324, 1994.

[6 8 ] Z. Yang and S. Kumar. New parsimony-based methods for estimating the pattern of nucleotide substitution and the variation of substitution rates among sites and comparison with likelihood methods. Molecular Biology Evolution, 13:650-659, 1996.

[69] Z. Yang and B. Rannala. Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo method. Molecular Biology Evolution, 14(7) :717— 724, 1997.

114 IMAGE EVALUATION TEST TARGET (Q A-3) /

/ r

%

1.0 îf llâ m U£ Il 2.2 ï! 12.0 l.l 1.8

1.25 1.4 1.6

150mm

V

V y^P P L IE D jé IIVI/4GE . Inc .a s s 1653 East Main Street V a • Rochester, NY 14609 USA Phone: 716/482-0300 ------Fax: 716/288-5989

O 1993, Applied Image. Inc., All Rights Reserved

o/