Proquest Dissertations
Total Page:16
File Type:pdf, Size:1020Kb
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. UMI A Bell & Howell Infonnation Company 300 North Zeeb Road, Ann Arbor MI 48106-1346 USA 313/761-4700 800/521-0600 Simulation-Based Estimation of Phylogenetic Trees DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Laura A. Salter, B.A., M.S. * * * * * The Ohio State University 1999 Dissertation Committee: Approved by Professor Dennis K. Pearl, Adviser JT Professor L. Mark Berliner Adviser Professor Paul Fuerst Department of Statistics Professor Joseph Verducci UMI Number: 9931673 UMI Microform 9931673 Copyright 1999, by UMI Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. UMI 300 North Zeeb Road Ann Arbor, MX 48103 ABSTRACT A common, goal in the analysis of nucleotide sequence data is the inference of the phylogenetic history of the sequences under consideration. Many criteria for the selection of a phylogenetic representation of the data have been developed. We focus here on two criteria: the maximum likelihood criteria and the parsimony criteria. The maximum likelihood method of phylogenetic tree construction has several advantages over other criteria, including the interpretability of the underlying Markov models, consistency in the statistical sense, and the possibility of statistical testing of hypotheses using the likelihood framework. However, use of the maximum likelihood method in practice has been limited because the method is computationally intensive, especially when the number of sequences under consideration is large. We therefore propose a stochastic search algorithm for estimation of the maximum likelihood tree. The method significantly reduces the computation time involved in constructing the maximum likelihood tree, and in many cases returns an estimate of the phylogeny that has a higher likelihood than those returned by the methods currently in use. We give some convergence results for the algorithm and apply it to several theoretical and real data sets. The algorithm can also be extended to allow for simultaneous estimation of the tree and the model parameters. Examples of the application of the extended algorithm are also given. 11 Parsimoay is currently one of the most widely used phylogenetic tree construction methods. However, current implementations of the parsimony method can be shown to give locally optimal estimates of the phylogeny when a large number of sequences are considered. We have developed a simulated annealing algorithm for estimation of phylogenetic trees under the parsimony criteria, in the hope that such an algorithm would be less prone to entrapment in local minima. Though our algorithm does show reasonable ability to locate the most parsimonious tree, it does so at the expense of computing time. This result is in agreement with previous literature concerning the use of simulated annealing in estimating phylogenetic trees under the parsimony criteria. We provide convergence results for our algorithm and apply the method to several examples. lU This is dedicated to my parents and my sister IV ACKNOWLEDGMENTS I would like to thank my family, my friends, and especially Justin, for the constant encouragement and support they have given me. I am grateful to the members of my committee for their insight and suggestions concerning this research. I would especially like to thank Dr. Berliner for his assistance with the results in Chapter 4. Finally, I am extremely grateful to Dr. Pearl for his continued advice, encouragement, dedication, and understanding throughout my thesis work. VITA April 11, 1972 .....................................................Bom - New Orleans, Louisiana USA 1994 ......................................................................B.A. Biolog}% Mathematics 1996 ......................................................................M.S. Statistics 1997-present ....................................................... Graduate Research Associate, The Ohio State University. FIELDS OF STUDY Major Field: Biostatistics VI TABLE OF CONTENTS Page A b stra c t............................................................................................................................ ii Dedication. ......................................................................................................................... iv Acknowledgments ............................................................................................................ v V i t a ................................................................................................................................... vi List of Tables .................................................................................................................. ix List of Figures ............................................................................................................... x Chapters: 1. Introduction and Literature Review ................................................................. 1 1.1 Phylogenetic Trees and Reconstruction Methods ................................ 1 1.2 Simulated Annealing and Stochastic Probing ....................................... 8 2. A Stochastic Search Strategy for Estimation of the Maximum Likelihood T r e e ......................................................................................................................... 14 2.1 Calculation of the L ik e lih o o d ................................................................. 14 2.2 A Stochastic Search A lg o rith m............................................................. 21 2.2.1 The Generation S c h e m e............................................................... 22 2.2.2 The Cooling S c h e d u le .................................................................. 26 2.2.3 The Stopping R u le ......................................................................... 27 2.3 Simultaneous Estimation of the Tree and the Substitution Model P a ra m e te rs .................................................................................................. 29 2.3.1 Estimation of the Nucleotide Frequency Parameters ................ 30 2.3.2 Estimation of Other Substitution Model Parameters .... 31 2.4 Computer Implementation ....................................................................... 40 vii 3. A Simulated Annealing Algorithm for Estimation of Phylogenetic Trees Under the Parsimony C riteria ............................................................................ 43 3.1 The Parsimony C r ite r ia ............................................................................ 43 3.2 Estimating the Most Parsimonious Tree(s) Using Simulated Annealing 47 3.2.1 A New Simulated Annealing Algorithm for Estimation of the Most Parsimonious Tree(s) ........................................................ 49 3.2.2 Computer Im plem entation ............................................................ 52 4. Properties of the Algorithms ................................................................................ 53 4.1 Convergence Results for the Stochastic Search A lgorithm ................. 53 4.2 Convergence Results for the Simulated Annealing Algorithm for Es timation of the Most Parsimonious Tree(s) ........................................... 66 5. Applications ............................................................................................................. 73 5.1 Estimation of the ML Tree for Fixed Parameter Values .................... 73 5.1.1 Theoretical D a t a ............................................................................. 73 5.1.2 Mitochondrial DNA Sequences ......................................... 76 5.1.3 Group A Papillomavirus Sequences ............................................ 80 5.1.4 Analysis of the env Region for 30 HIV Sequences ................... 84 5.2 Simultaneous Estimation of the Tree and Substitution Model Pa rameters ......................................................................................................... 87 5.3 Estimation of the Most Parsimonious Tree(s) ..................................... 91 6. Conclusion and