Detecting Multiple Protein Folding Trajectories and Structural Alignment
Total Page:16
File Type:pdf, Size:1020Kb
DETECTING MULTIPLE PROTEIN FOLDING TRAJECTORIES AND STRUCTURAL ALIGNMENT DISSERTATION Presented in Partial Ful¯llment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of the Ohio State University By Hong Sun, PhD Graduate Program in Computer Science and Engineering The Ohio State University 2011 Dissertation Committee: Hakan Ferhatosmanoglu and Yusu Wang, Advisor Srinivasan Parthasarathy Chenglong li °c Copyright by Hong Sun 2011 ABSTRACT To date, many types of molecular biological data such as sequence data, protein data and simulation data have gained rapid acceleration with the advent of power computation ability and high-throughput data collection techniques. Analyzing these data to gain insights about data and about the scienti¯c phenomena they are modeling is increasingly becoming a challenge. One common and e®ective approach to analyze such massive data is by comparing and aligning multiple objects together to identify motifs and discover similar and/or divergent sub-domains. In the thesis, we focus on developing frameworks to comparing and aligning mul- tiple geometric shape data. In particular, the research covers two main subjects: (1) Analysis of protein folding trajectories via aligning multiple folding trajectories modeled as multiple high dimensional curves. We develop a novel method, called the EPO algorithm, that can help to mine folding convergent rules dynamically by exploring vital sub-structures and tracking their folding orders. Our EPO algorithm is very e®ective at identifying structural similarities even when the degree of sim- ilarity is low. Hence it can potentially discover critical folding events that cannot yet be discovered by conventional curve alignment algorithms. (2) Multiple protein structure alignment framework: the framework called Spatial Motifs based Protein Multiple Structural Alignment (Smolign) is a complete package including both align- ment and superimposition tools. We ¯rst introduce a contact-window based motif ii library of three-dimensional molecular structures. The retrieved motifs are poten- tially conserved to speci¯c spatial folds and are non-sequentially related. Later on, the structurally similar seeds are selected and extended with a complex heuristic algorithm from this library. Next, we develop an optimal global alignment and su- perimposition algorithm according to the seeds selected from the ¯rst step. due to the similarities between status of protein folding trajectories and protein structures on the contact map representations and based on the successful application of the above techniques in the domain of protein trajectory analysis, we are further extend- ing the EPO to the domain of protein structure alignment. Slightly modi¯ed from EPO, our Smolign has the ability to detect multiple correspondences simultaneously, to catch alignments globally, to be able to collect sub-set alignments and to support flexible alignments. Our method yields better alignment results compared to other popular MSTA methods on several protein structure datasets that span various struc- tural folds and represent di®erent protein similarity levels. Of particular interest is that Smolign can discover similarities among protein structures even under very low similarity conditions. Our research exhibits signi¯cantly high e±ciency with reasonably high accuracy and will bene¯t the study of high-throughput protein structure-function evolution- ary relationships. A web-based alignment tool as well as a set of downloadable, exe- cutable, and detailed alignment results for the datasets used in this thesis are available at http://bio.cse.ohio-state.edu/Smolign and http://sacan.biomed.drexel.edu/Smolign iii This work is dedicated to my families: Qun Zhao Anton Sun iv VITA 1970 . Born in Beijing, China 1992 . Bachelor of Science in Electrical Engineer- ing University of Science and Technology; Beijing, China 1997 . Master of Science in Industrial and Sys- tems Engineering The Ohio State Univer- sity; Columbus, OH 1999-2008 . System Engineer The O±ce of Treasurer, The Ohio State University 2008-2009 . Sr. Developer Nationwide Insurance 2009-Present . Research Scientist SRA, Inc. / NIEHS PUBLICATIONS Smolign: A Spatial Motifs Based Protein Multiple Structural Alignment Method Hong Sun, Ahmet Sacan Yusu Wang and H. Ferhatosmanoglu. IEEE/ACM Trans- actions on Computational Biology and Bioinformatics. 2011. An enhanced partial order curve comparison algorithm and its application to analyz- ing protein folding trajectories. Hong Sun, Hakan Ferhatosmanoglu, Motonori Ota, Yusu Wang BMC Bioinformatics 2008, 9:344 An Enhanced Partial Order Multiple Curve Comparison Algorithm for Analyzing High Dimensional Trajectories, Hong Sun Yusu Wang and H. Ferhatosmanoglu, Com- puter Society Bioinformatics Conference CSB 2007. UCSD, CA. v A Compressed Multi-Resolution Index Structure for Sequence Similarity Queries. Hong Sun, O. Ozturk and H. Ferhatosmanoglu. IEEE Computer Society Bioinfor- matics Conference (CSB '03). Stanford, CA. August 2003, pp. 553-558. FIELDS OF STUDY Major Field: Computer Science and Engineering Specialization: Software Systems vi TABLE OF CONTENTS Abstract . ii Dedication . iii Vita......................................... v List of Figures . x CHAPTER PAGE 1 INTRODUCTION . 1 1.1 Motivation . 1 1.2 Overview of Our Research . 3 1.2.1 Objective . 3 1.2.2 Contribution . 6 1.3 Outline . 9 2 PROTEIN BACKGROUND PRELIMINARIES AND BASIC TOOLS . 10 2.1 Principle of Protein Structure . 10 2.1.1 Overview . 10 2.1.2 Protein Structure Hierarchy . 10 2.1.2.1 Primary Structure . 11 2.1.2.2 Secondary Structure . 12 2.1.2.3 Tertiary Structure . 15 2.1.2.4 Quaternary Structure . 16 2.2 Protein Folding . 16 2.2.1 introduction . 16 2.2.2 Protein Folding Data Modeling . 17 2.3 Dynamic Programming . 18 2.4 Partial Order Graph and Tool . 21 3 PROTEIN STRUCTURAL COMPARISON . 26 3.1 Overview . 26 vii 3.2 Protein Structure Data Modeling . 28 3.2.1 Geometric Vector Representation . 28 3.2.2 Bio-property Vector Representation . 30 3.2.3 Distance Matrix and Its Variants Representation . 31 3.3 Structural alignment methods . 32 3.3.1 Progressive alignment . 33 3.3.2 Simultaneous alignment . 37 3.4 Measurement of Structural Alignment Quality . 44 4 EPO: ENHANCED PARTIAL ORDER CURVE COMPARISON . 48 4.1 Introduction . 48 4.1.1 Overview . 48 4.1.2 Challenges and goals . 49 4.2 methods . 52 4.2.1 Input data modeling . 52 4.2.2 Notations and Algorithm Overview . 53 4.2.3 Initial POG Construction . 57 4.2.3.1 A Clustering Preprocessing Stage . 58 4.2.3.2 Scoring Function . 59 4.2.4 Merging Stage . 62 4.3 EPO implementation on protein folding data . 65 4.3.1 Background of Dataset . 65 4.3.2 Experimental Setting . 66 4.3.3 Investigation on Entire Protein Structure . 67 4.3.4 Investigation on Substructures . 69 4.3.4.1 Alpha-helix substructure . 69 4.3.4.2 Ring-substructure . 70 4.3.5 Timing of EPO . 73 5 SMOLIGN: A SPATIAL MOTIFS BASED PROTEIN MULTIPLE STRUC- TURAL ALIGNMENT METHOD . 75 5.1 Introduction . 75 5.1.1 Overview . 75 5.1.2 Challenges and goals . 77 5.2 Methods . 80 5.2.1 Algorithm Overview . 80 5.2.2 Construction of the SML . 81 5.2.3 Obtaining seed alignments . 86 5.2.3.1 Selection of seed motifs set . 87 5.2.3.2 Seeds pruning by biological constraints . 91 5.2.3.3 Alignment of candidate seeds. 92 5.2.4 Extending the seed alignments . 95 viii 5.2.5 Global alignment by EPO . 96 5.2.6 Flexible alignments . 98 5.3 Experimental Evaluation Of Smolign . 100 5.3.1 Sample Alignments . 100 5.3.2 Flexible Alignments . 107 5.3.3 Homstrad Benchmark . 108 5.3.4 Additional Datasets . 111 5.3.5 E®ects of a few Key Techniques . 114 5.3.5.1 Seeds selection . 114 5.3.5.2 Bio-constraints a®ection . 116 5.3.5.3 Extended seeds . 117 5.3.5.4 EPO iterations . 117 5.3.6 Summary . 119 6 DISCUSSION AND FUTURE RESEARCH . 123 6.1 EPO aligorithm . 123 6.2 Smolign Framework . 125 6.3 Future Work . 128 6.4 Summary . 130 Bibliography . 132 ix LIST OF FIGURES FIGURE PAGE 1.1 Yearly Growth of Protein Structures . 4 2.1 Amino acids maps tri-nucleotide sequences, also called codons. 11 2.2 Formation of a peptide bond through condensation of two amino acids. 12 2.3 Protein Hierarchy (source: [31]) . 13 2.4 Two basic secondary structure types . 14 2.5 A tertiary structure sample including 5 alpha helix and 6 beta sheet. 15 2.6 An example of protein folding (source: [33]) . 17 2.7 Basic dynamic programming on genome sequence alignment. 20 2.8 Compare alignment representations between traditional method and POG. 23 2.9 Compare basic dynamic programming alignment and POA. 25 3.1 Contact map of 1trmA (PDB code). 32 3.2 The construction of a base footprint. 39 3.3 Demo of Base Bucket........................... 40 4.1 EPO flow chart . 52 4.2 A POG demo. 54 4.3 Compare linear graph and POG. 55 4.4 EPO method overview . 56 4.5 An example for scoring function. 60 x 4.6 Empty and solid points are aligned to the nodes oa and ob, respectively, while points in the dotted region should be grouped together. 62 4.7 NMR structure of trp-cage protein 1l2y. 66 4.8 Distribution of aligned nodes. 67 4.9 Visualizing of vital events. 72 5.1 Smolign flow chart. ..