Investgating Determinants of Phylogeneic Accuracy
Total Page:16
File Type:pdf, Size:1020Kb
IMPACT OF MOLECULAR EVOLUTIONARY FOOTPRINTS ON PHYLOGENETIC ACCURACY – A SIMULATION STUDY Dissertation Submitted to The College of Arts and Sciences of the UNIVERSITY OF DAYTON In Partial Fulfillment of the Requirements for The Degree Doctor of Philosophy in Biology by Bhakti Dwivedi UNIVERSITY OF DAYTON August, 2009 i APPROVED BY: _________________________ Gadagkar, R. Sudhindra Ph.D. Major Advisor _________________________ Robinson, Jayne Ph.D. Committee Member Chair Department of Biology _________________________ Nielsen, R. Mark Ph.D. Committee Member _________________________ Rowe, J. John Ph.D. Committee Member _________________________ Goldman, Dan Ph.D. Committee Member ii ABSTRACT IMPACT OF MOLECULAR EVOLUTIONARY FOOTPRINTS ON PHYLOGENETIC ACCURACY – A SIMULATION STUDY Dwivedi Bhakti University of Dayton Advisor: Dr. Sudhindra R. Gadagkar An accurately inferred phylogeny is important to the study of molecular evolution. Factors impacting the accuracy of a phylogenetic tree can be traced to several consecutive steps leading to the inference of the phylogeny. In this simulation-based study our focus is on the impact of the certain evolutionary features of the nucleotide sequences themselves in the alignment rather than any source of error during the process of sequence alignment or due to the choice of the method of phylogenetic inference. Nucleotide sequences can be characterized by summary statistics such as sequence length and base composition. When two or more such sequences need to be compared to each other (as in an alignment prior to phylogenetic analysis) additional evolutionary features come into play, such as the overall rate of nucleotide substitution, the ratio of two specific instantaneous, rates of substitution (rate at which transitions and transversions occur), and the shape parameter, of the gamma distribution (that quantifies the extent of iii heterogeneity in substitution rate among sites in an alignment). We studied the implications of the following five sequence parameters, individually and in combination: sequence length, substitution rate, nucleotide base composition, the transition- transversion rate ratio and the rate heterogeneity among the sites. It is found that the transition-transversion rate ratio or kappa has a significant impact on phylogenetic accuracy, with a strong positive interaction with accuracy at high substitution rates, contrary to general belief. This work on known expected tree has implications for the researcher in field and would enable them to choose from among the multiple genes typically available today for an accurate phylogenetic inference. DNA sequences diverge from their ancestral sequences by means of evolutionary events (other than mentioned above) such as deletion (deletion of one more nucleotide from a sequence) or insertion (insertion of one more nucleotide to a sequence) events, commonly referreed to as gaps in a sequence alignment. We have also investigated the relationship between the number of gaps and phylogenetic accuracy, when the gaps are introduced in an alignment to reflect indel (insertion/deletion) events during the evolution of DNA sequences. DNA sequence alignments were generated using computer simulation, while varying several sequence parameters and introducing both substitution and insertion/deletion events, along a 16- taxon model tree, and systematically varying the expected proportion of gapped sites. The resulting alignments were subjected to commonly used gap treatment methods and methods of phylogenetic inference. The results showed that in general, there is a strong almost deterministic relationship between the amount of gap in the data and the level of phylogenetic accuracy, when the amount of gap was high. Our results also suggest that, iv as long as the gaps in the alignment are a consequence of indel events in the evolutionary history of the sequences, the accuracy of phylogenetic analysis is likely to improve if alignment gaps are categorized as arising from insertion events or deletion events and then treated separately in the analysis and if the phylogenetic signal provided by indels is harnessed, for example, by treating the gaps as binary characters in Bayesian or Maximum Parsimony analyses, or in an integrated manner along with substitution events. v ACKNOWLEDGEMENTS I earnestly wish to express my reverential thanks to Dr. Sudhindra R. Gadagkar for providing me this opportunity and subsequent criticism, guidance, and skillful supervision. I really appreciate his patience in proofreading my abstracts, manuscripts, and thesis. I would like to thank Dr. Panagiotis Tsonis for giving me the opportunity to work with him. His determination to publish the project report from my course work was critical in starting my career in the research world. I am highly indebted to my committee members Dr. Jayne Robinson, Dr Mark Nielsen, Dr. John Rowe, and Dr. Dan Goldman for their benevolent supervision and valuable suggestions for all these years. I am thankful to Karen Bahr, Lynda Routley for their unlimited help with all the paperwork. vi I would like to thank my father Dr. Indresh H. Dwivedi, and mother Abha Dwivedi, for their unconditional and endless love and support in every endeavor of my life. I also would like to thank my brother Gyanesh H. Dwivedi, without his encouragement I do not think I would have been here, and my sister-in-law Swarnima Dwivedi for always being there for me. I would remember the times that I spent at the University of Dayton, and would always be close to my heart. I hope the best for everyone. Thank you. vii TABLE OF CONTENTS INTRODUCTION…………………………………………………..............................1 CHAPTERS 1. REVIEW OF LITERATURE 1.1 Phylogenetic inference………………………………………………….............6 1.1.1 Phylogenetic inference……………………………....................................9 1.1.2 Data used for phylogenetic inference........................................................13 1.1.3 Types of phylogenetic trees......................................................................16 1.1.4 Phylogenetic tree representation styles.....................................................21 1.1.5 Application of phylogenetic inference......................................................24 1.2 The process of phylogenetic inference…..........................................................30 1.2.1 Data collection.............…………………………….................................33 1.2.2 Sequence alignment..................................................................................36 1.2.2.1 Pairwise sequence alignment.....................................................37 1.2.2.2 Multiple sequence alignment.....................................................38 1.2.2.3 Choosing an alignment method.................................................50 1.2.2.4 Alignment gap treatments .........................................................51 1.2.3 Model selection in phylogenetics..............................................................55 1.2.3.1 Nucleotide substitutions.............................................................55 1.2.3.2 Model of nucleotide substitutions..............................................57 1.2.3.3 Choosing a model.......................................................................67 1.2.4 Phylogenetic tree reconstruction methods.................................................69 viii 1.2.4.1 Distance-based methods.............................................................71 1.2.4.2 Maximum parsimony..................................................................80 1.2.4.3 Maximum likelihood.................................................................100 1.2.4.4 Bayesian analysis......................................................................104 1.2.5 Assessing phylogenetic accuracy............................................................114 1.2.5.1 Topological distances...............................................................115 1.2.5.2 Assessing tree quality...............................................................117 2. RESEARCH ARTICLES 2.1 Phylogenetic inference under varying proportions of indel-induced alignment gaps..................................................................................................125 2.2 The impact of sequence parameter values on phylogenetic accuracy...............176 3. OTHER 3.1 Whole genome comparison of H1N1 and H3N2 influenza A virus….….........203 3.2 Molecular Mimicry: structural camouflage of proteins and nucleic acids.............. ............................................................................................................................217 SUMMARY......................................................………………………………...........244 LITERATURE CITED ……………………………………………………….........248 ix LIST OF FIGURES CHAPTER 1.1 1. A completely resolved bifurcating phylogenetic tree.............................................9 2. Example of Polytomy............................................................................................10 3. Orthologous and Paralogous genes.......................................................................12 4. Comparison between Monophyletic, Paraphyletic, and Polyphyletic groups.......13 5. Unrooted and Rooted tree.................................................................................... 17 6. Gene tree and Species tree....................................................................................19 7. Phylogenetic tree representation styles.................................................................23