Using Structural Information in Modeling and Multiple Alignments for Phylogenetics

Using Structural Information in Modeling and Multiple Alignments for Phylogenetics DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Xueliang Pan, M.A.S., M.S. ***** The Ohio State University 2008 Dissertation Committee: Approved by Dennis K. Pearl, Adviser Laura Kubatko Adviser Steven N. MacEachern GRADUATE PROGRAM in BIOSTATISTICS c Copyright by Xueliang Pan 2008 ABSTRACT Phylogenetic studies are increasingly based on structural biological data and on statistical formalization. That leads to the study of improved models and of extract- ing the maximum information from sequence data. In this research, I have proposed to incorporate the structural information in two areas that relate to phylogenetic inference: one is to use a spatial dependent substitution model for likelihood calculation in phylogenetic inference; the other is to use a gap distance measure for MSA evaluation. While the first application is to using an improved substitution models in phylogenetic inference, the second one focuses on the quality of the MSA produced by different alignment procedures. The proposed spatial dependent model was based on our observation that the amino acids close to the functional core region tend to be conservative and these on the periphery are likely subject to mutation. So we proposed a substitution model with its rate for each amino acid dependent on its distance to the catalytic active center, or the functional core of the protein. The SD model has been implemented in the framework of Bayesian hierarchical model, the posterior distribution of the model parameters and the phylogenetic inference was estimated simultaneously using the MCMC Metropolis-Hastings algorithm. The SD model has been applied to 11 enzymes that are primarily to central metabolism that are found in species from all ii Kingdoms. The SD model is much better than the currently available substitution models in terms of fitness consistently for all examples. Besides the modeling, we also use the structural information of the sequences for MSA evaluation. The fixed alignments used in phylogenetic studies are derived in advance of phylogenetic analysis. There are many different ways to construct these alignments. The gap measurement proposed here is based on the assumption of structural superposition, and it not only evaluates the alignment quality of those sequences with structural information, but also those sequences without structural information. This measurement can be used to select a better MSA for our phylogenetic analysis. Furthermore, it may lead to improvement of the sequence alignment. iii To my daughter, my wife and my parents. iv ACKNOWLEDGMENTS I first thank God, who has shown His amazing grace to me throughout the years of my graduate studies at the Ohio State University. I thank my advisor, Dr. Dennis K. Pearl, for his inspiring and encouraging advice to guide me to a deeper understanding of the research, and his invaluable comments during the dissertation work. I also give special thanks to Dr. J. Dennis Pollack for his long term collaboration and for providing the amino acid datasets. I am very grateful to Dr. Laura Kubatko and Dr. Steven N. MacEachern. They have given me valuable feedback and suggestions on my research and dissertation. Thanks to my friend Hui Cao at the Department of Computer Science and En- gineering for his help in programming. I thank the OSU Mathematical Bioscience Institute for providing a super-computer account. My gratitude also goes to my friends on and off OSU campus for their friendship and support. Finally I thank my dearest wife, my best friend, Jun Liu. I would never have been able to finish this work without her help, support, and encouragement. v VITA 1993 . .B.S. Environmental Engineering 2002 . .M.S. Civil Engineering 2002 . .Master of Applied Statistics PUBLICATIONS Research Publications Liu, J., Li, J., Rosol, T. J., Pan, X., and Voorhees, J. L. (2007).Biodegradable Nanoparticles for Targeted Ultrasound Imaging of Breast Cancer Cells in Vitro, Physics in Medicine and Biology, 52 4739-4747 Liu, J., He, X., Pan X., and Roberts, C. (2007) Ultrasonic Model and System for Mea- surement of Corneal Biomechanical Properties and Validation on Phantoms, Journal of Biomechanics, Vol. 40, Issue 5, 1177-1182 Liu, J., Levine, A.L., Mattoon, J.S., Yamaguchi, M., Lee, R.J., Pan, X. , Rosol, T.J.(2006) Nanoparticles as Image Enhancing Agents for Ultrasonography, Physics in Medicine and Biology, 51, 2179-2189 Wu, Z., Pan, X., Zhong, L., Shi, Y.,and Tan, T.(1996) Experimental Study of Wet Flue Gas Desulfurization with Steel Work’s Dusts as Absorbent, Environmental En- gineering (in Chinese), Vol.14, No.6, 17-22 Wu, Z., Tan, T., andPan, X. (1995) Experiment Studies of Wet FGD with Lime in a Rotating System Tray Scrubber, ACTA Scientific Circumstantiate (in Chinese), Vol. 15, No.3, 336-43 vi FIELDS OF STUDY Major Field: Biostatistics Studies in: Phylogenetics Prof. Dennis K. Pearl vii TABLE OF CONTENTS Page ABSTRACT…………………………………………………………………………….iii DEDICATION…………………………………………………………………………..v ACKNOWLEDGMENTS……………………………………………………………....vi VITA……………………………………………………………………………………vii LIST OF FIGURES………………………………………………………………………x LIST OF TABLES……………………………………………………………………....xii CHAPTER 1………………………………………………………………………………1 INTRODUCTION ………………………………………………………………………..1 1.1 Amino Acid Sequence and Crystallographic Information.…………..……………2 1.2 Phylogeny Inference…………………………………………………..…………..4 1.3 Protein Structure and the Phylogeny Inference ..............................................12 CHAPTER 2…………………………………………………………..............................17 SPATIAL DEPENDENT SUBSTITUTION MODEL …………………………...…...17 2.1 Substitution Model…………………………………………………………..…..18 2.1.1 The Development of Substitution Models ………………………...…….18 2.1.2 Amino Acid Substitution Models………………………………………..20 2.1.3 Spatial Dependent Substitution Model ……………………………….....27 2.2 Likelihood Calculation and Posterior Tree Distribution.............. ………………34 2.3 Bayesian Inference with MCMC Method ....………………………………...….38 2.4 Implementation of the SD Substitution Model.....................................................48 2.4.1 Parameter Estimation ................................................................................48 2.4.2 MCMC Convergence Diagnosis ...............................................................51 2.4.3 Model Comparison .................................................................................58 viii CHAPTER 3 .....................................................................................................................62 APPLICATION OF THE SD MODEL……………………………………….................62 3.1 The Data…………………………………………………....................................62 3.1.1 Catalytic-active enzyme center (CAC) ...................................................64 3.1.2 Site Distance Determination ...................................................................67 3.2 Discrete SD model Application.....……………………………………………...68 3.2.1 Application to Small PGK Dataset .........................................................68 3.2.2 Application to Large Datasets .................................................................75 3.3 Continuous SD model.......………………………………………………………79 CHAPTER 4……………………………………………………………………………..87 GAP DISTANCE MEASUREMENT AND ITS APPLICATION………………..…….87 4.1 Multiple Sequence Alignment and Quality Evaluation....……………………….89 4.2 Gap Distance Measure (GD_P) and Gap Distance Score........……………….….93 4.2.1 Gap Distance Measure for Each Gap(GD_P)...............................................95 4.2.2 Gap Distance Score for the MSA (GDS) .....................................................97 4.3 Application of GD_P and GDS................................................………………….99 4.3.1 GD_P Application ......................................................................................101 4.3.2 GDS Application and Comparison with Other Evaluation Methods .........106 CHAPTER 5……………………………………………………………………………111 CONCLUSION AND FURTURE RESEARCH.........………………………………...111 5.1 Spatial Dependent Model................................................………………………111 5.2 Gap Measurement…………………………………....…………………………114 BIBLIGRAPHY.......………………………………………………………………....…116 ix LIST OF FIGURES Page Figure 1.1 A segment of an amino acid alignment of some PGK sequences …................2 Figure 1.2 Four distinct aspects of a protein's structure......................................................3 Figure 1.3 Bifurcating phylogenetic tree with five external nodes (a) Unrooted tree (b) Rooted tree………………………………………………………...........................8 Figure 2.1 The crystallographic structure of a PGK (1VPE) molecule……....................24 Figure 2.2 Correlation of averaged distance from amino acid alpha carbons (Cα) to the anchor-atoms at their catalytic active centers (C/AC) vs the positional-ConSurf conservation score (PCCS)....................................................................................26 Figure 2.3 Some examples of rate factor f(r) as a function of r. All the f(r) functions have an overall mean of 1......................................................................................29 Figure 2.4 Example of a rooted tree with five species (leaves) for deriving the likelihood ..............................................................................................................34 Figure 2.5 The local rearrangement transition strategy in Li, Pearl

Using Structural Information in Modeling and Multiple Alignments for Phylogenetics

Mutations, the Molecular Clock, and Models of Sequence Evolution

Phylogeny Codon Models • Last Lecture: Poor Man’S Way of Calculating Dn/Ds (Ka/Ks) • Tabulate Synonymous/Non-Synonymous Substitutions • Normalize by the Possibilities

The Phylogenetic Handbook: a Practical Approach to Phylogenetic Analysis and Hypothesis Testing, Philippe Lemey, Marco Salemi, and Anne-Mieke Vandamme (Eds.)

A Review of Molecular-Clock Calibrations and Substitution Rates In

Relative Model Selection Can Be Sensitive to Multiple Sequence Alignment Uncertainty

Empirical Models for Substitution in Ribosomal RNA

Local and Relaxed Clocks: the Best of Both Worlds

Bayesian Codon Substitution Modelling to Identify Sources of Pathogen Evolutionary Rate Variation

Assessing the State of Substitution Models Describing Noncoding RNA Evolution

Effects of Parameter Estimation on Maximum-Likelihood Bootstrap Analysis

Evolutionary and Functional Lessons from Human-Specific Amino-Acid

Empirical Models for Substitution in Ribosomal RNA