DETECTING MULTIPLE FOLDING TRAJECTORIES AND STRUCTURAL ALIGNMENT

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of

Philosophy in the Graduate School of the Ohio State University

By

Hong Sun, PhD

Graduate Program in Computer Science and Engineering

The Ohio State University

2011

Dissertation Committee:

Hakan Ferhatosmanoglu and Yusu Wang, Advisor

Srinivasan Parthasarathy

Chenglong li °c Copyright by

Hong Sun

2011 ABSTRACT

To date, many types of molecular biological data such as sequence data, protein data and simulation data have gained rapid acceleration with the advent of power computation ability and high-throughput data collection techniques. Analyzing these data to gain insights about data and about the scientific phenomena they are modeling is increasingly becoming a challenge. One common and effective approach to analyze such massive data is by comparing and aligning multiple objects together to identify motifs and discover similar and/or divergent sub-domains.

In the thesis, we focus on developing frameworks to comparing and aligning mul- tiple geometric shape data. In particular, the research covers two main subjects:

(1) Analysis of trajectories via aligning multiple folding trajectories modeled as multiple high dimensional curves. We develop a novel method, called the EPO algorithm, that can help to mine folding convergent rules dynamically by exploring vital sub-structures and tracking their folding orders. Our EPO algorithm is very effective at identifying structural similarities even when the degree of sim- ilarity is low. Hence it can potentially discover critical folding events that cannot yet be discovered by conventional curve alignment algorithms. (2) Multiple protein structure alignment framework: the framework called Spatial Motifs based Protein

Multiple Structural Alignment (Smolign) is a complete package including both align- ment and superimposition tools. We first introduce a contact-window based motif

ii library of three-dimensional molecular structures. The retrieved motifs are poten- tially conserved to specific spatial folds and are non-sequentially related. Later on, the structurally similar seeds are selected and extended with a complex heuristic algorithm from this library. Next, we develop an optimal global alignment and su- perimposition algorithm according to the seeds selected from the first step. due to the similarities between status of protein folding trajectories and protein structures on the contact map representations and based on the successful application of the above techniques in the domain of protein trajectory analysis, we are further extend- ing the EPO to the domain of protein structure alignment. Slightly modified from

EPO, our Smolign has the ability to detect multiple correspondences simultaneously, to catch alignments globally, to be able to collect sub-set alignments and to support

flexible alignments. Our method yields better alignment results compared to other popular MSTA methods on several protein structure datasets that span various struc- tural folds and represent different protein similarity levels. Of particular interest is that Smolign can discover similarities among protein structures even under very low similarity conditions.

Our research exhibits significantly high efficiency with reasonably high accuracy and will benefit the study of high-throughput protein structure-function evolution- ary relationships. A web-based alignment tool as well as a set of downloadable, exe- cutable, and detailed alignment results for the datasets used in this thesis are available at http://bio.cse.ohio-state.edu/Smolign and http://sacan.biomed.drexel.edu/Smolign

iii This work is dedicated to my families:

Qun Zhao

Anton Sun

iv VITA

1970 ...... Born in Beijing, China

1992 ...... Bachelor of Science in Electrical Engineer- ing University of Science and Technology; Beijing, China

1997 ...... Master of Science in Industrial and Sys- tems Engineering The Ohio State Univer- sity; Columbus, OH

1999-2008 ...... System Engineer The Office of Treasurer, The Ohio State University

2008-2009 ...... Sr. Developer Nationwide Insurance

2009-Present ...... Research Scientist SRA, Inc. / NIEHS

PUBLICATIONS

Smolign: A Spatial Motifs Based Protein Multiple Structural Alignment Method Hong Sun, Ahmet Sacan Yusu Wang and H. Ferhatosmanoglu. IEEE/ACM Trans- actions on Computational Biology and Bioinformatics. 2011.

An enhanced partial order curve comparison algorithm and its application to analyz- ing protein folding trajectories. Hong Sun, Hakan Ferhatosmanoglu, Motonori Ota, Yusu Wang BMC Bioinformatics 2008, 9:344

An Enhanced Partial Order Multiple Curve Comparison Algorithm for Analyzing High Dimensional Trajectories, Hong Sun Yusu Wang and H. Ferhatosmanoglu, Com- puter Society Bioinformatics Conference CSB 2007. UCSD, CA. v A Compressed Multi-Resolution Index Structure for Sequence Similarity Queries. Hong Sun, O. Ozturk and H. Ferhatosmanoglu. IEEE Computer Society Bioinfor- matics Conference (CSB ’03). Stanford, CA. August 2003, pp. 553-558.

FIELDS OF STUDY

Major Field: Computer Science and Engineering

Specialization: Software Systems

vi TABLE OF CONTENTS

Abstract ...... ii

Dedication ...... iii

Vita...... v

List of Figures ...... x

CHAPTER PAGE

1 INTRODUCTION ...... 1

1.1 Motivation ...... 1 1.2 Overview of Our Research ...... 3 1.2.1 Objective ...... 3 1.2.2 Contribution ...... 6 1.3 Outline ...... 9

2 PROTEIN BACKGROUND PRELIMINARIES AND BASIC TOOLS . 10

2.1 Principle of Protein Structure ...... 10 2.1.1 Overview ...... 10 2.1.2 Protein Structure Hierarchy ...... 10 2.1.2.1 Primary Structure ...... 11 2.1.2.2 Secondary Structure ...... 12 2.1.2.3 Tertiary Structure ...... 15 2.1.2.4 Quaternary Structure ...... 16 2.2 Protein Folding ...... 16 2.2.1 introduction ...... 16 2.2.2 Protein Folding Data Modeling ...... 17 2.3 Dynamic Programming ...... 18 2.4 Partial Order Graph and Tool ...... 21

3 PROTEIN STRUCTURAL COMPARISON ...... 26

3.1 Overview ...... 26

vii 3.2 Protein Structure Data Modeling ...... 28 3.2.1 Geometric Vector Representation ...... 28 3.2.2 Bio-property Vector Representation ...... 30 3.2.3 Distance Matrix and Its Variants Representation ...... 31 3.3 Structural alignment methods ...... 32 3.3.1 Progressive alignment ...... 33 3.3.2 Simultaneous alignment ...... 37 3.4 Measurement of Structural Alignment Quality ...... 44

4 EPO: ENHANCED PARTIAL ORDER CURVE COMPARISON . . . 48

4.1 Introduction ...... 48 4.1.1 Overview ...... 48 4.1.2 Challenges and goals ...... 49 4.2 methods ...... 52 4.2.1 Input data modeling ...... 52 4.2.2 Notations and Algorithm Overview ...... 53 4.2.3 Initial POG Construction ...... 57 4.2.3.1 A Clustering Preprocessing Stage ...... 58 4.2.3.2 Scoring Function ...... 59 4.2.4 Merging Stage ...... 62 4.3 EPO implementation on protein folding data ...... 65 4.3.1 Background of Dataset ...... 65 4.3.2 Experimental Setting ...... 66 4.3.3 Investigation on Entire Protein Structure ...... 67 4.3.4 Investigation on Substructures ...... 69 4.3.4.1 Alpha-helix substructure ...... 69 4.3.4.2 Ring-substructure ...... 70 4.3.5 Timing of EPO ...... 73

5 SMOLIGN: A SPATIAL MOTIFS BASED PROTEIN MULTIPLE STRUC- TURAL ALIGNMENT METHOD ...... 75

5.1 Introduction ...... 75 5.1.1 Overview ...... 75 5.1.2 Challenges and goals ...... 77 5.2 Methods ...... 80 5.2.1 Algorithm Overview ...... 80 5.2.2 Construction of the SML ...... 81 5.2.3 Obtaining seed alignments ...... 86 5.2.3.1 Selection of seed motifs set ...... 87 5.2.3.2 Seeds pruning by biological constraints ...... 91 5.2.3.3 Alignment of candidate seeds...... 92 5.2.4 Extending the seed alignments ...... 95

viii 5.2.5 Global alignment by EPO ...... 96 5.2.6 Flexible alignments ...... 98 5.3 Experimental Evaluation Of Smolign ...... 100 5.3.1 Sample Alignments ...... 100 5.3.2 Flexible Alignments ...... 107 5.3.3 Homstrad Benchmark ...... 108 5.3.4 Additional Datasets ...... 111 5.3.5 Effects of a few Key Techniques ...... 114 5.3.5.1 Seeds selection ...... 114 5.3.5.2 Bio-constraints affection ...... 116 5.3.5.3 Extended seeds ...... 117 5.3.5.4 EPO iterations ...... 117 5.3.6 Summary ...... 119

6 DISCUSSION AND FUTURE RESEARCH ...... 123

6.1 EPO aligorithm ...... 123 6.2 Smolign Framework ...... 125 6.3 Future Work ...... 128 6.4 Summary ...... 130

Bibliography ...... 132

ix LIST OF FIGURES

FIGURE PAGE

1.1 Yearly Growth of Protein Structures ...... 4

2.1 Amino acids maps tri-nucleotide sequences, also called codons. . . . . 11

2.2 Formation of a bond through condensation of two amino acids. 12

2.3 Protein Hierarchy (source: [31]) ...... 13

2.4 Two basic secondary structure types ...... 14

2.5 A tertiary structure sample including 5 alpha helix and 6 . 15

2.6 An example of protein folding (source: [33]) ...... 17

2.7 Basic dynamic programming on genome sequence alignment...... 20

2.8 Compare alignment representations between traditional method and POG...... 23

2.9 Compare basic dynamic programming alignment and POA...... 25

3.1 Contact map of 1trmA (PDB code)...... 32

3.2 The construction of a base footprint...... 39

3.3 Demo of Base Bucket...... 40

4.1 EPO flow chart ...... 52

4.2 A POG demo...... 54

4.3 Compare linear graph and POG...... 55

4.4 EPO method overview ...... 56

4.5 An example for scoring function...... 60

x 4.6 Empty and solid points are aligned to the nodes oa and ob, respectively, while points in the dotted region should be grouped together. . . . . 62

4.7 NMR structure of trp-cage protein 1l2y...... 66

4.8 Distribution of aligned nodes...... 67

4.9 Visualizing of vital events...... 72

5.1 Smolign flow chart...... 76

5.2 AFP Alignment examples...... 78

5.3 Overview of the algorithm...... 80

5.4 Selection of seed motifs set...... 89

5.5 property table ...... 93

5.6 Set 2 alignments by different methods...... 105

5.7 Set 3 alignments by different methods...... 107

5.8 Rigid and flexible alignments of set 2...... 109

5.9 Running time distribution on Homstrad families...... 111

5.10 Comparison between MISTRAL and Smolign...... 114

5.11 An extended seed in set1 ...... 118

5.12 EPO iterations on 5 demo sets ...... 119

xi CHAPTER 1

INTRODUCTION

1.1 Motivation

With the emergence of high throughput computing, huge amount of data for appli- cations such as medical diagnosis, object registration and alignment, and pedestrian trajectories extracted from surveillance videos [1] have been produced. In particular, in computational biology, there are massive amount of DNA sequence data, protein structural data, and molecular simulation data produced. One fundamental method to help us analyze such large amount of data is by aligning multiple instances of them to for example identify motifs (similar sub-structures and/or divergent sub- structures). Given the critical role of the multiple objects alignment methods in the analysis of data from a variety of domains, there is a pressing need for the develop- ment of a sensitive and robust automatic framework that can detect similarities and dis-similarity among multiple objects.

In this thesis, we consider the scenario where the input data to be analyzed are geometric objects. Specifically, we focus on two type of data: multiple protein structures and multiple molecular simulation trajectories. At this moment, despite many advances in addressing the multiple structures alignment problem, it remains a very challenging problem. In particular, it turns out two main components of a multiple structural alignment problem: the alignment problem and the superposition

1 problem, are often NP-hard in many settings, and are thus computationally expensive to solve. Therefore, task-specific and domain-specific heuristic solutions are often required in order to bring down the high complexity of the computation. In this thesis, we present two efficient and effective task-specific frameworks, one for aligning multiple molecular simulation data, and one for aligning multiple protein structures.

Motivation for studying molecular simulation and protein structures data.

Proteins carry out their specific biological roles through interaction with other pro- teins or other macro-molecules. This interaction is determined largely by the three dimensional structures of molecules. Therefore, two important directions toward un- derstanding how function are to study their folding trajectories and analyze their structures. First, aligning multiple protein simulation trajectories can help to extract sequences critical events that lead to successful folding. Secondly, obtain- ing both similar (convergent) and dissimilar (divergent) sub-structures from multiple protein structure alignments can identify biologically significant structural motifs and reveal distant evolutionary relationships that may not be detectable from the sequence information alone.

Up to date, more than 36,000 protein structures have been identified and large folding simulation data have been emerged in the past decades. As the ability to discover and generate massive protein structure data increases with advancing tech- nology (See Figure 1.1), analysis of large data sets to determine the function of identified proteins using traditional genetics research methods is becoming more and more difficult, costly, and time consuming since traditional biological science are only focus on developing experimental methods and enabling tools to facilitate simulation of chemical processes and phenomena. As an indicator, Brookhaven Protein Data

Bank [2], the publicly available protein structure database has been growing rapidly

2 and currently has more than 66,700 molecular structures (as of March 2011). On the other hand, only a small fraction of these structures have been identified from physical and functional standpoints.

Another myriad data source is coming from protein folding simulations which are time-consuming and data intensive. Usually, research groups such as folding@home [3,

4] and P found [5] have to create distributed public repositories for storing results since one simulation can easily comprise 1G to 10G of data. For instance, only to capture 8ns time frames (about 8000 confirmations) in single simulation run for a small protein with about just 100 residues will generate over 300MB data [6]. This massive data volume prevents currently used experimental techniques such as X-ray crystallography or nuclear magnetic resonance (NMR) to keep up analyzing kinetic aspects of the protein folding process efficiently.

Consequently, availability of an effective tool for solving MCC problem in the do- main of structural biology is essential for discovery and analysis of significant struc- tural motifs that can help predict protein folding trajectories and solve functional annotation of the proteins sharing low sequence similarity.

To meet the needs of fast and precise analysis of protein data, this thesis elucidates an algorithmic approach to compute reliable multiple structural data alignments of potentially long and diverse molecules under a reasonable consumption of computa- tional resources.

1.2 Overview of Our Research

1.2.1 Objective

In general, MCC problem [7, 8, 9] (i.e., the simultaneous analysis of multiple struc- tures) arises frequently in diverse areas such as the study of network topology, image

3 Figure 1.1: Yearly Growth of Protein Structures

pattern analysis etc. In the common situation, a spatial structure is a point cloud dataset in Euclidean space. Given a collection of m sets of multi-dimensional points clouds, the task of MCC problem is to detect the largest point set of which a congru- ent copy appears in each of the input sets. Since the prior work has demonstrated that MCC problem is known to be NP-hard [10], heuristic analysis procedures must be applied into different domains. In this thesis, we challenge the MCC problem in the domain of molecular biology analysis by presenting a framework for multiple pro- tein structural comparison and its applications. Specifically, we focus on two kinds of molecular biology data, protein folding trajectory data and molecular structure data.

A folding trajectory is a sequence of conformations (structures) of a protein chain, representing different states of this protein at different time points during the simu- lation of its folding process. If each conformation is represented by the distance map

4 between its alpha-carbon atoms so that it is invariant under rigid transformations, we have a n2 dimensions curve for each trajectory of m conformations where n is the number of amino acids that a protein contains.

On the other hand, a protein structure p is represented by the 3D coordinates of its construction unit, amino acid. In general, only the 3D position of alpha carbon

(Cα) of each amino acid residue is counted. According to the characters of above data, The objective of this framework ad- dresses two correlated components, namely alignment and superposition.

The first one, alignment component focuses on finding the correspondent residues among multiple structures. The essential problem of multiple protein structural com- parison involves detecting the correspondence of homologous residues from each struc- ture that every corresponding set fulfils equivalent structural roles. Unfortunately, the common shape comparison framework can not provide any acceptable method to optimally create such correspondences. Instead, domain dependent heuristic ap- proach has to be considered in most situations. Good detecting methods not only capture correspondences sequentially and locally, but also return correspondences po- tentially conserved to specific spatial folds and non-sequentially related that allows to make further optimization in the future. A good detecting method is also capable of grabbing small and unstructured protein spatial segments and its development is one of the main goals of this thesis.

The second one, superposition component is a general tool for comparing all kinds of spatial curves. In fact, the optimality of the superposition on multiple curves/structures is a very hard problem. Even the pairwise comparison problem of aligning two structures A and B is believed to be NP-hard since one has to op- timize both the correspondence between A and B and the relative transformation

5 of one structure with respect to the other simultaneously. Numerous heuristic- based algorithms have been developed in practice for this fundamental problem

[11, 12, 13, 14, 15, 16, 17]. If we have a set of k > 2 structures, then even the problem of aligning them optimally without considering transformations becomes intractable

– it takes Ω(nk) time using the standard dynamic programming algorithm, where n is the size of each protein involved. This thesis presents a promising strategy that performs multiple structure superposition procedure under a reasonable consumption of computational resources. As we shall see, the results support this approach.

1.2.2 Contribution

In this thesis, we propose novel algorithms and tools for protein folding trajectories comparison and multiple protein structure alignment. More specifically we introduced a framework, namely Smolign [18] which includes a list of contributions as below:

• Beside the Smolign framework, we first introduce EPO [19](An Enhanced Par-

tial Order Curve Comparison Algorithm) algorithm based on a precursor con-

cept - Partial Order Graph (POG), as an independent tool to analyze protein

folding trajectories. Generally speaking, EPO adopting a new approach to de-

tect common spatial confirmations comparing to the most of preceding works, is

used to optimally lie over multiple curves and extracts a set of segments whose

spatial shapes are similar in every structure. The idea was initially motivated

by the difficulties in analyzing multiple low similarity protein folding simulation

data. Previously, folding simulations analysis is performed mainly for testing

various protein folding models [20, 21, 22], such as the folding pathway model

and the funnel model; and/or for studying energetic aspects of folding kinet-

ics [23, 24, 25, 26]. The geometric shapes of the conformations involved in

6 folding trajectories have not been widely explored [27, 28, 29], despite their im-

portant role in folding. However in general, an automatic tool to facilitate the

folding simulations analysis at large scales is still missing. EPO provides an im-

portant step towards the problem of optimal superposition by modeling folding

trajectories as curves and using our multiple curves/structures comparison algo-

rithm to detect critical folding events. Comparing with traditional progressive

methods, both center-star approach and hierarchical approach will loose signif-

icant information at each step either for small size curve segments or partial

matched confirmations because of their progressive and pair-wise (when apply-

ing dynamic programming) nature. On the other hand, EPO uses Partial Order

Graph (POG) to store every point for each curve by which small segments or

partial matching confirmations can be detected potentially. Consequently, the

multi-dimensional dynamic programming is involved instead of the pair-wise

one. Furthermore, we also employ a pre-clustering strategy to overcome the

problem of POG size. Since we are dealing with multi-dimensional spatial data

which is much more complex than one dimensional sequence data, we create a

novel two level scoring function which is able to reduce the missing matching

rate and intelligently choose the covering range during the POG construction.

The standard POG based methods have huge advantage over those traditional

pair-wise based methods. However, it is a inherently progressive approach that

un-optimal processing order and unpredictable new input curves may still hurt

the final result. We develop a post-merging method for the refinement propose.

merging stage after POG constructed can greatly increase the matching rate

and reduce the matching error or tighten the alignment nodes.

• A contact map based spatial motif detection method is created in the Smolign

7 framework. This technique extracts key motifs from both the Secondary Struc-

ture Elements (SSEs) and the other characterized geometric cores and then

maps them into compact feature vectors spaces to facilitate the construction of

index structures that have sensitive, accurate and efficient filtering capabilities.

Unlike the common methods used in the previous studies, our method explores

the structural information directly instead of limited by sequence fragment and

method makes possible to release the limitation of sequence order and reveal

biologic meaningful function sites purely from the point of view of structure.

Such motif detection method is also useful to build a broader database which

can provide service to other multiple proposed applications.

• This thesis proposes a new heuristic method that simultaneously detects and

optimizes the maximum correspondences (geometric cores) among the given

structures set. Using a pre-constructed Spatial Motif based Library (SML)

for protein structure data, we can efficiently select valid motif seeds from the

exponential number of candidates, we implement a complicate pruning and

evaluating algorithm that fast retrieves high quality data among the candidates

pool and optimally extend the size of the selected spatial motifs. Compared to

other seed selection approaches in the domain of multiple structure alignment,

our method is the first one that considers all similar geometric cores together to

avoid local minimum issue. According the inherent properties of Smolign, motif

seeds can be easily expanded to flexible/non-rigid alignment, which provides a

powerful tool to detect all function sites simultaneously.

• Lastly, we extend the implementation of EPO as Smolign’s superposition mod-

ule which achieves global superposition task for given structures dataset. To

8 align multiple structures especially for those with low similarities, local mini-

mum issue exists at both seed detection (alignment) and global superposition

stages, we have invented a novel simultaneous seed detecting method at the

first stage, we then apply a new strategy for later stage to further overcome

un-optimal or missed superpositions. In this strategy, EPO is iteratively im-

plemented on the given structure set and the first iteration of EPO obtains

initial transformed and rotated structure positions from seeds alignment stage,

then constructed POG after tuning-up provides a very informative guide to the

new iterative EPO implementation. Iterative EPO gives Smolign the abilities

to interactively discover un-revealed small geometric cores from seed selection

stage and minimize superposition distances.

1.3 Outline

The rest of this thesis is organized as follows. In chapter 2, we first introduce back- ground of essential molecule biology, In chapter 3, we classified and analyze a set of typical algorithms and tools in the domain of structure alignment and explain the problems occurred along the time-line of research. In chapter 4 we describe the construction of EPO and the details of its implementation on trp-cage protein simulation. We also demonstrate the advantages of EPO and reveal the interesting discovers from the experimental results. In chapter 5, we present the design of the

Smolign framework, including motifs detection, library construction, alignment seed detection, global alignment and experimental evaluation of Smolign. We summarize our current works and describe our future research directions at the end of this thesis in chapter 6.

9 CHAPTER 2

PROTEIN BACKGROUND PRELIMINARIES AND

BASIC TOOLS

2.1 Principle of Protein Structure

2.1.1 Overview

Proteins are the main agents in cells. They do everything in the living cells. All functions of the living organisms are related with proteins. Each protein or group of proteins are responsible for their own specific function. Among the major biochem- ical functions of proteins are catalysis, regulation(hormones), transporting chemical compounds, storing energy, converting engery, signaling, immune responses, and cell adhesion. The genetic code (figure 2.1) expressed as either RNA codons or DNA codons produces basic building blocks of proteins, amino acids [30]. Proteins, classi-

fied by their physical size of nanoparticles (definition: 1-100 nm), are polymers, also known as a polypeptide, made of 20 different amino acids joined together by peptide bonds. In the following section, we discuss different level of protein structures.

2.1.2 Protein Structure Hierarchy

Protein is represented by a hierarchy of four levels: primary, secondary, tertiary, and quaternary (See Figure 2.3). Each level is briefly described below:

10 Figure 2.1: Amino acids maps tri-nucleotide sequences, also called codons. Note that most of the amino acids are encoded by more than one codon.

2.1.2.1 Primary Structure

A general chemical structure of an alpha amino acid shown in figure 2.2a contains an amine group, a carboxylic acid group in common and a side chain that varies between different amino acids. The varied side chains are what make each amino acid different from the others. Of the 20 amino acids used to make proteins, there are three groups. The three groups are ionic, polar and non-polar. These names refer to the way the side groups (sometimes called ”R” groups) interact with the environment. Of particular interest, polar amino acids like to adjust themselves in a certain direction, non-polar amino acids don’t really care what’s going on around them.

As both the amine and carboxylic acid groups of amino acids can react to form amide bonds, one amino acid molecule can react with another and form a peptide bond and a molecule of water. This polymerization of amino acids in Figure 2.2b is what creates proteins.

The linear sequence of the different amino acids that comprises one polypeptide

11 (a) Single amino acid (b) peptide

Figure 2.2: Formation of a peptide bond through condensation of two amino acids.

The Rx groups on each amino acid represent the variable side-chains.

chain is called the primary structure [32] of protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end.

Amino acids represent quite a broad spectrum of different chemical structures.

Upon the generation of a protein with a specific amino acid sequence using essentially the genetic information present in the DNA, the link between genetic and functional information is complete.

2.1.2.2 Secondary Structure

Each amino acid sequence form unique stable spatial structure in their native envi- ronments, which can vary considerably among cell compartments and extracellular

fluid. However, four typical local structure formats are recognized by specific back- bone torsion angles and specific main-chain hydrogen bond pairings, namely alpha helix, beta sheet, beta turn and loop.

Secondary structures are usually held together by hydrogen bonds between the carbonyl oxygen and the the amide hydrogen of the peptide bond. The alpha helix

12 Figure 2.3: Protein Hierarchy (source: [31])

shown in Figure 2.4a is a right-handed coiled or spiral conformation stabilized by hydrogen bonding. This bonding occurs between C = O group of one amino acid and N − H group of the fourth amino acid in the chain and makes a complete turn every 3.6 amino acids.

The beta-sheet shown in Figure 2.4b is the second type of regular secondary structure in proteins. Beta sheets is made of beta strands which are connected in lateral direction by minimum two or three backbone hydrogen bonds, thereby giving a formation of a normally twisted, pleated sheet. An extension of polypeptide chain which is characteristically 3 -10 amino acids long with backbone in an approximately fully extended conformation is called a beta strand. In many human such as

13 amyloidoses (for e.g. Alzheimers ) there are formation of by means of higher-level association of beta sheets.

(a) Alpha helix structure (b) Beta sheet structure

Figure 2.4: Two basic secondary structure types

A beta-turn involves four amino acid residues and may or may not be stabilized by the intra-turn hydrogen bond between the backbone CO and the backbone of the forth consecutive NH. Moreover, the distance between the first and the forth amino acid is less than 7A.˚

A loop is a connective segment between alpha helix or beta strand and is often not as stable as other secondary structures. Loops have been classified into five types (alpha-alpha, beta-beta links, beta-beta hairpins, alpha-beta and beta-alpha) according to the secondary structures they embrace. Three angles and one distance variables between the secondary structure elements are usually used for the loop definition. Loops’ prediction plays an critical rule in protein structure prediction method.

14 2.1.2.3 Tertiary Structure

Tertiary structure, e.g. figure 2.5, refers to the 3D structure of protein in entire polypeptide. The tertiary structure of a protein describes the folding of its secondary structural elements linked by turns and loops and specifies super-secondary struc- tures or domain structures in the protein. It also includes the positions of each amino acid’s side chain whose stability is determined by non-bonding interactions and the disulfide bond. The known protein structures have come to light through X-ray crys- tallographic or nuclear magnetic resonance(NMR) studies and the common features of tertiary structure reveal much about the biological functions of the proteins and their evolutionary origins. In this thesis, tertiary structure is a critical element playing fundamental role in our framework as we will see below.

Figure 2.5: A tertiary structure sample including 5 alpha helix and 6 beta sheet.

15 2.1.2.4 Quaternary Structure

Quaternary Structure is on top of protein structure hierarchy. Some proteins contain two or more different polypeptide chains, interaction between multiple polypeptides forms quaternary structure.

2.2 Protein Folding

2.2.1 introduction

As we mentioned above, all proteins in nature are made of polypeptide chains as- sembled by amino acids. Cells create proteins by ”transcribing” them from RNA sequences (themselves being created from DNA sequences). When proteins are tran- scribed from RNA they start out as linear sequences of amino acids. Because the amino acids that make up a protein have various electrostatic and mechanical prop- erties, protein doesn’t stay in its original form for long and begins to fold up into an eventually stable three-dimensional structure. It is this three-dimensional structure

(as well as the mechanical and electrostatic properties of the amino acid sequence) that gives the protein its functionality.

Each protein has a distinct and characteristic solubility in a defined environment and any changes to those conditions (buffer or solvent type, pH, ionic strength, tem- perature, etc.) can cause proteins to lose the property of solubility and precipitate out of solution. The environment can be manipulated to bring about a separation of proteins, for example, the ionic strength of the solution can be increased or decreased, which will change the solubility of some proteins.

Proteins can denature, or unfold so that their three dimensional structure is al- tered but their primary structure remains intact. Many of the interactions that stabilize the 3-D conformation of the protein are relatively weak and are sensitive to

16 various environmental factors including high temperature, low or high pH and high ionic strength. Protein vary greatly in the degree of their sensitivity to these factors.

Sometimes proteins can be renatured but often the denaturation is irreversible.

Incorrect protein folding leads to a number of proteopathy diseases such as antitrypsin- associated emphysema, cystic fibrosis and the lysosomal storage diseases. On the other hand, protein replacement therapy has historically been used to correct the latter disorders, an emerging approach is to use pharmaceutical chaperones to fold mutated proteins to render them functional.

Figure 2.6: An example of protein folding (source: [33])

2.2.2 Protein Folding Data Modeling

Either folding or unfolding of a protein sequence is a very quick process counted on microsecond or sub-microsecond timescale level. To observe the development of a protein sequence movement, molecular dynamics (MD) [34][35] which was originally

17 conceived within theoretical physics in the late 1950s and early 1960s, is applied today mostly in the modeling of biomolecules.

MD is a specialized discipline of molecular modeling and computer simulation based on statistical mechanics which may takes a few months on modern supercom- puters to produce series of consecutive folding stages that preserve information on the physical folding pathway. Given a protein p with l amino acid long in a MD

th simulation sampling n folding stage, the i folding stage is pi, where i < n, including all atoms’ three dimensional coordinates forms a 3 × l dimensional points (see fig- ure 2.6). Connection of n folding stages in time ordering makes a high dimensional curve T = p1, . . . , pn which is usually used as data modeling in folding trajectory analysis.

The study on the animation of curve T revealed a hydrophobic-core driven folding mechanism. Because the task is to identify and capture representative intermediate configurations, high dimensional curve comparison methods are fundamental in this area. Since working in the structure space of the protein is extremely complex, researchers often identify a few key characteristic features of the protein, often so- called reaction coordinates, and study the trends and variations in these reaction coordinates.

2.3 Dynamic Programming

In early 1970’s, dynamic programming [36] was introduced to biologists by Saul B.

Needleman and Christian D. Wunsch. As the most popular tool in computational molecular biology, the method was first used to find optimal alignment of nucleotide and amino acid sequences. Later, many of its variants [37][38][39] appeared and the concept was further applied to 3-dimensional structure molecular comparison, known as homology modeling [40][41]. In the context of sequence alignment, dynamic

18 programming focuses on pairing the the most similar residues between two sequences; in the context of structure alignment, dynamic programming is to find closest pairs of residues after transferring and rotating from one structure to another. Given two proteins with lengths m and n, implementation of dynamic programming starts from a score function S and gap penalty value d. Particularly, S applied in protein sequence alignment is a similarity matrix, called substitution matrix with a 20 × 20 list of amino acid whose entry specifies the scores for aligning each amino acid with another. Commonly used substitution matrix are PAM250 [42], BLOSUM62 [43], etc... On the other hand, S in structural alignment is usually a distance function that measures distance between two spatial residues. d is a score assigned to any mismatch, either under insertion or deletion situation. E.g. figure 2.7 aligns two genome sequences a and b in alignment matrix F . The entry in row i and column j is denoted here by Fij. dynamic programming algorithm processes alignment matrix F by:

F0j = d ∗ j (2.3.1)

F0i = d ∗ i (2.3.2)

Fij = max(Fi−1,j−1 + s(ai, bj),Fi,j−1 + d, Fi−1,j + d) (2.3.3)

where s(a, b) is the similar score from similarity matrix (e.g. an entry in fig- ure 2.7a). Function 2.3.1 and 2.3.2 initialize the algorithm and function 2.3.3 fills in each entry based on principle of optimality [44] recursively.

Once each entry in alignment matrix is filled, we can trace back from the bottom right entry by comparing the three possible scores where Fi−1,j−1 is a match, Fi−1,j is a deletion and Fi,j−1 is an insertion to get alignment path. figure 2.7c shows the final best alignment with total score: s(A, C)+s(G, G)+s(A, A)+3×d+s(G, G)+s(T,A)+s(T,C)+s(A, G)+s(C,T ) = 1 19 where penalty d = −5 in this case. Note the above basic dynamic programming takes O(mn) time and space to obtain global optimal alignment of two sequences.

An important variant of basic dynamic programming is Smith-Waterman [37] algorithm, which addresses local optimal alignment. Smith-Waterman algorithm re- quests all summed values in matrix to be equal or greater than 0. When filling each entry, if the cumulated value is below 0, we set it to 0 which is also called ”start point”. Once alignment start, we can begin the trace back at the maximum value found anywhere in the matrix and continue until the value falls to 0.

(a) Similarity matrix (b) Alignment matrix (c) Alignment result

Figure 2.7: An example of basic dynamic programming on genome sequence align- ment. (a) A sample similarity matrix. Such matrices are usually generated by statis- tical methods [45]. (b) A two-dimensional matrix F is allocated to show the highest score alignment procedure on two sample genome sequences, a and listed in columns and row respectly. (c) The final alignment of sequences a and b with 3 insertions.

20 2.4 Partial Order Graph and Tool

Dynamic programming can be naturally extended to multiple sequence/structure alignment (MSA) by progressively aligning new sequence onto previous alignment result. However, there are a couple of crucial shortcomings making it impractical in real world applications.

Scalability: Given N sequences, assuming the length of each sequence is L without

losing generalization, aligning multiple sequences requires time O(LN ) since

basic dynamic programming requires O(L2) time to do pairwise alignment. Such

exponential time consuming is generally not acceptable even in the modest

cases. Note that some heuristic methods have been developed to attack this

issue [46, 47].

Local minimum and Stability: In general progressive methods, a successive pair-

wise alignment is executed between a new sequence and the previous generated

consensus sequence or alignment profile. Miss-matched alignment, such as in-

sertion and deletion in sequence domain, or alignment error in structure domain

occurred in each stage are not correctable and will be accumulated during the

whole iterative alignment procedure. Consequence of progressively aligning to

consensus sequence is that the final result may fall into local minimum [48][49]

instead of global optimization. As an evidence, the ordering of the inputs can

determine the quality of the alignment result.

An alternative of progressive alignment methods developed recently is Partial Or- der Alignment (POA) [50][51][52][53][54]. POA approach shows great advantages [55] over traditional progressive methods that is quickly spread from multiple sequence alignment to multiple structure alignment. In particular, one of our contribution in

21 the framework is to extend and enhance POA algorithm into structure alignment. In this section, we only use sequence alignment to demonstrate the method.

POA, as its name implying, utilizes Partial Order Graph (POG) [56] to represent alignment result. A POG is a directed acyclic graph that contains nodes i, j without path of directed edges between them. We will give a formal definition in the next chapter before we apply it in our framework.

Unlike progressive alignment, POA aligns multiple sequences without losing in- formation. Figure 2.8a shows that traditional pairwise alignment generates represen- tation with insertion and deletion gaps which will be eventually eliminated in MSA to form a consensus sequence and the eliminated residues cannot be recovered in the future. In fact, a consensus sequence can be considered as a total order graph that all nodes i, j are guaranteed either i < j XOR j < i. On the contrary, POA holds all those residues as partial order nodes without losing them. In 2.8c, nodes without dashed circles indicate that they cannot be aligned at this moment but are saved in the graph. Meanwhile, nodes (I,V ) from the first sequence are temperately aligned to nodes (L, I) from the second sequence. Since all information from previous alignment are available to the next iteration, it is possible to re-arrange all nodes to achieve global optimization. 2.8d gives a compact version of POA representation with the exact same nodes merged together.

POA extends the basic dynamic programming alignment method on top of POG representation. Instead of manipulating on a pure 2D plane matrix where each axis represents an individual sequence like basic dynamic programming (see 2.9a), POA works on a high dimensional spatial matrix. 2.9b gives a 3 sequences alignment demo where the partial order nodes from the previously generated POG create additional dimensions or independent pathes (e.g. sub-sequence °a -°c -°a -°t -°g -°g -°a -°c ). The

22 (a) Traditional pairwise representation

(b) One sequence in PO representation

(c) A pair of aligned sequences in PO representation

(d) PO representation with merged nodes

Figure 2.8: Compare alignment representations between traditional method and

POG. (source: [53])

third sequence aligns each dimension or path on POG as what basic dynamic pro- gramming does except on the bifurcation point where multiple dimensions or pathes are merged together. On the bifurcation point, the possible incoming moves are not fixed in 3-ways such as vertical, horizontal and diagonal in 2D matrix but are depending on the number of dimensions or pathes (e.g. there are 5 moves in this demo).

Compare to traditional MSA methods, POA has a couple of distinct advantages as following:

Global optimization Upon applying dynamic programming on POG, the final trace

back method is no longer working on a 2D matrix but on a high dimensional

23 matrix with all sequences information saved, such as those previously lost in-

serting or deleting positions. In addition, there also exists opportunities to

do aligning adjustment during and after the construction of POG. Therefore,

POA takes these advantages and achieves better global alignment than tradi-

tional progressive methods. More detail discussion will be given in the later

chapter. subset alignment In a POG built from n sequences alignment, each node includes

information of residues from number 1 to n and opens an opportunity to collect

subset alignments. For example, beside those nodes with all residues from

n sequences, there may exist a group of nodes in POG that contain residues

from a subset a sequences (a ⊂ n) and another group of nodes that contain

residues from subset b sequences (b ⊂ n), where a 6= b. Depending on their

biological motifs meaning, group a and b may reveal functional and homologous

relationships within protein families or superfamilies [57][54]. Sometimes, such

differences and distances detection plays more important roles than the full set

alignment.

With such robust and flexible properties, POA is successfully applied in multiple sequence alignment area. However, the single dimension characteristic of protein sequence limits capabilities of POA. We will show the power of POA in protein structures with multiple dimensional spatial points involved in later chapters.

24 (a) Basic dynamic programming matrix (b) One sequence in PO representation

Figure 2.9: Compare basic dynamic programming alignment and POA. (source: [53])

25 CHAPTER 3

PROTEIN STRUCTURAL COMPARISON

3.1 Overview

Proteins carry out their specific biological roles through interaction with other pro- teins or other macro-molecules. This interaction is determined largely by the three dimensional structures of molecules. The structure of a protein is relatively stable comparing to its polypeptide sequence. During the course of evolution, the protein

3D fold is more preserved than its primary sequence [58]. A substantial amount of inner amino acids could be mutated along the generations but stabilized structures are maintained among the homologous species. For example, a DNA sequencing work [59] confirmed that a regulatory protein called neuronal Cdk5 activator, Nck5a adopts a conformation similar to that of cyclinA but they actually share very little sequence similarity. Such facts was explored and quantified in [58] [60] [61] [62].

Therefore, an important direction toward understanding how proteins function and how proteins evolve from common evolutionary origin is to study and analyze their structures. In particular, one fundamental task involved in such an analysis is the structural alignment, where the proteins are superimposed in order to find the similarities and differences in their structures. Alignment and comparison of

26 protein structures can help discover biologically significant structural motifs and re- veal distant evolutionary relationships that may not be detectable from the sequence information alone.

As of today, hundreds of protein structure alignment and comparison methods had been emerged in the past 40 years. The common goals in this area are to identify the equivalences between pairs or set of amino acid residues from given protein set.

Meanwhile, superimpose the identified residues under certain geometric constraints.

The developmental trajectory has evolved from pairwise structure alignment to mul- tiple structure alignment based on the demands of analysis need.

First we break structure alignment and comparison problem into 3 aspects that most of the existed methods have to deal with.

Structure Data Modeling: The way to treat the structure’s raw 3 dimensional

data into a convenient format suitable for comparion or alignment.

Pattern retrieving: Alignment methods to select possible similar fragments from

all over alignment space. The globally final alignment will be developed based

on these initial discovered fragments.

Result Measuring: Geometric measures and biologically relevant measures. The

measurement method is usually used to control alignment procedure and indi-

cate the result quality as well.

Secondly, we outline and classify some widely accepted and used methods ac- cording to the 3 aspects we just mentioned. Lastly we further review in detail some important methods whose concepts or properties are related with our current research.

27 3.2 Protein Structure Data Modeling

The raw protein structure data is captured by X-ray crystallography [63] or NMR spectroscopy [64, Chapter 2]. Generally, X-ray crystallography could get the whole

3D structure by the systematic analysis of a good crystallized material on any size of molecular. On the other hand, NMR is primarily limited to relatively small pro- teins but enable us to observe the chemical kinetics and the motion of the segments

(domains). Either method associated with computational analysis tools returns 3- dimensional coordinators of each atom inside a protein.

In order to compare the structure similarity of given proteins with raw 3 dimen- sional data, most of the developed methods start from converting raw data into certain well organized formats [65]. The converted structure representations to be used as entry point of pattern retrieving and comparison in the later steps should efficiently describe the distinguished features, such as topology, geometry or bio-properties of each protein. Despite the impressively higher number of structure representation variants, most of them fall into one of a few catalogers that we are going to describe below:

3.2.1 Geometric Vector Representation

The first commonly used way is to convert protein data to different kind of vectors based on their geometric properties. Given an ensemble of amino acid residues in the

3-dimensional space, researchers usually pick up one atom - the Cα (Cβ is another popular choice) position to represent the whole residue. In such a way, the geometric characters of a suitable amount of spatial points are utilized to form variant feature vectors to describe local spatial status. a list of such vectors plot the picture of whole protein structure.

Sequential Structure Alignment Program (SSAP) [17] builds the ”view” of 28 0 each residue by constructing a set of vectors from its own Cβ to all other residues Cβ. All such views together provide the insights of direction and position of each protein structure effectively and then later pattern detection and comparison all work on this geometric ”view” vector set.

Vector Alignment Search Tool (VAST) [12][66] adopted the most conserved fragments of SSEs as core of vector construction. The vector records the spatial orientations and connectivity of SSEs. VAST data model simplifies data presentation and omits all non-structural fragments.

MAMMOTH [67][68] is a recently developed method which takes every suc- cessive pair of Cα atoms along a protein backbone as an unit vector, then all unit vectors are placed on the origin to form an unit sphere. Although unit sphere losses global geometric relationship of each structure, it does simplify the protein structure’s representation without involving any sequence information and molecular properties.

Later structure comparison by unit-vector root mean square (URMS) distance is applied on all pairs of unit spheres.

However, the most popular geometric vector representation is so called Aligned

Fragment Pair (AFP) which was first used by CE [16] and its multiple alignment version, CE-MC [69], followed by FATCAT [70] with its successor POSA [54],

MULTIPROT [71] , MATT [72] and MUSTANG [73] etc... All methods adopt a fixed size of consecutive residue fragment (Cα position) directly as a vector like unit. Given a protein of length n, assuming the size of fragment is m, then we get n−m possible overlapped vectors. A pair of such geometric vectors, each from one of two different structures, with closer RMSD (root mean square deviation after optimal superposition), is denoted as Aligned Fragment Pair which reflects local geometric similarity. A block of compatible AFPs describes a possible ordered alignment path between two structures.

29 3.2.2 Bio-property Vector Representation

The second way to represent protein structure data is to utilize protein structure’s biological properties. In addition to primary structure - the sequence description, a protein can also be categorized into secondary structure, tertiary structure and qua- ternary structure according to its local as well as global folding situation. Researchers try to constitute property vectors by collecting folding characters such as secondary structure type, relative location measurement, residues0 charging characters, etc...

Usually, these vectors gain various levels of insight of structure’s organization.

SARF2 [74] utilizes typical secondary structure confirmations to retrieve α- helices and β-sheet fragments from a given protein structure, then creates vector between two SSE fragments {Si,Sj} using properties such as:

the angle Γij between them, the shortest distance between their axes, the

closest points on the axes, and the minimum (Dmin) and the maximum

(Dmax) distances from each SSE to their medium line.

This property vector discards of all global geometric details of structure but gains more local topology information which can be efficiently applied to later comparison stage.

MASS [75], much like SARF2, builds a set of ”bases” for each protein structure.

Each base is a pair of SSEs that is classified by a vector-like fingerprint. The properties included in this fingerprint are: 1) type of each SSE; 2) the angle between two SSEs;

3) the mid-point distance between two SSEs’ axes. 4) the closest distance between two SSEs’ least-squares lines. The bases are clustered by hashing function for later comparison and alignment.

30 3.2.3 Distance Matrix and Its Variants Representation

Distance matrix [76][77][78][79][80] was initially studied around 19300s in the area of graph theory. Given a set of points in space, a matrix with a size of N × N where

N is the number of points contains the distances of all pairs of points. In particu- lar, a Euclidean distance matrix is symmetric and nonnegative, and has all zeroes along the diagonal. Once applying distance matrix on molecular structure data, a list of consecutive amino acid residues in protein sequence, usually using Cα position as representative point, are units to fill in each matrix entry (i, j) where an entry is distance between residue i and residue j. The advantage is that distance matrix captures the structural and connectivity information and provides a complete repre- sentation of the protein structure that is invariant under rigid transformations [81].

Note, it is possible to re-construct the 3-dimensional structure of protein by distance matrix [82].

Distance matrix has an important variant, contact map [83]. Figure 3.1 shows an example contact map of protein 1trmA. Instead of Euclidean distance calculated for each matrix entry, contact map uses a binary two-dimensional representation. For a pair of possible residues i and j, the entry (i, j) of the matrix is 1 if the distance

i j of Cα and Cα is less than a predetermined threshold, or 0 otherwise. Contact map inherits the transformation invariant property of distance matrix and further reveals more less obvious biological characters such as the secondary structures and tertiary structures .

31 Figure 3.1: Contact map of 1trmA (PDB code). The gray dots are entries with i j ˚ Distance(Cα − Cα) < 12A. Specifically, the gray patterns along the diagonal in- dicates α − helices, the gray patterns parallel or perpendicular to the diagonal are parallel, anti-parallel β − sheets, and other less regular patterns of residue contacts corresponding to small loops and turns. (source: [84])

3.3 Structural alignment methods

As we mentioned before, structure comparison and alignment has ability to recognize the important correlation between structure and function and reveal distant evolu- tionary relationships that are undetectable in sequence comparison and alignment.

The research started at from 1970’s [85] but the practical algorithms were mostly developed after 1990’s [12][65][86]. Over the past twenty years, there has been a large volume of researches on the structural alignment problem. Early research focused pri- marily on the pairwise structure alignment problem (see [87] for a survey), where an

32 optimal superposition of two protein structures is sought to minimize a given geomet- ric distance measurement. The quality of an alignment is generally quantified by two parameters: the number of corresponding residues and the root mean square distance

(RMSD see equation 3.4.1) between the atomic coordinates of these corresponding residues. Whereas finding the optimal superimposition is a relatively simple task if the set of correspondences is already known [88], finding the optimal superimposition and correspondences simultaneously is NP-hard [89]. Nevertheless, various heuris- tics have been developed and successfully applied to the pairwise alignment problem

[13, 90, 17, 16, 91, 67, 92, 93, 94].

Recently, a more complex problem, multiple structure alignment problem (MSTA) was introduced because of its advantages on detecting similarities and differences on a set of proteins together. Structural alignment of a set of related proteins helps find the conserved cores shared by all or a subset of proteins and gives better insight into the significance of these structural cores than the pairwise alignment. Unfortunately,

MSTA is computationally a very difficult problem. Even for a fixed transformation,

finding the optimal correspondences among residues from k proteins of average length

L takes O(Lk) time under most standard distance measures.

Based on variant data modelings as we described in section 3.2, two major MSTA approaches are formed roughly along the time order. Below, we introduce a few typical methods that represent the two different approaches.

3.3.1 Progressive alignment

The most intuitive and direct approach to the comparison of multiple structures is to utilize pair-wise alignment as the cornerstone and keep piling the new structures along with the previously generated consensus structure and look for newly formed consensus structure.

33 One of such naturally pair-wise comparison extended method is introduced by

SSAPm [95]which is developed on top of SSAP. With the input of residue view vectors for each structure as we mentioned in data modeling section, SSAPm first performs all pair-wise structural alignments applying double dynamic programming.

Then, highest score pair is selected to build the consensus structure as the geometrical average of the pair. Iteratively, the consensus structure is aligned with the closest structure from the rest with double dynamic programming.

The above progressive MSTA framework is a greedy approach. A new consensus structure is re-computed in every iteration and the geometric position of each residue is also shifted according to the new joined structure. Consequently, the obtained solution is not robust but heavily depending on the order of joining structures as well as the consensus structure maintained. Various heuristics have been exploited to find a good order for the progressive alignments. Note that this order can also be guided by a tree [46, 47] instead of a linear sequence, which removes the need of choosing a seed structure. The progressive procedure may also be iterated several times to locally refine the multiple structure alignments.

In order to reduce the computational complexity, most early developed approaches build a multiple alignment based on progressively aligning inputs in a pairwise man- ner [95]. For example, the center-star approach used by [11] maintains a consensus template, and at each step, a new input structure is aligned to this consensus by pairwise alignment method. Alternatively, one can also construct a consensus tem- plate hierarchically using a binary similarity tree, where each leaf represents an input structure, and each internal node aligns the two structures from its children [96, 95].

One of the main limitations of these greedy methods is that following locally (pair- wise) optimal solutions may not lead to a globally optimal solution. As a result, these methods are not effective at detecting low levels of similarities, as an incorrect

34 decision committed early on may cause to miss the few correspondences that would have otherwise led to the globally optimal solution.

CE-MC [97] uses the CE [16] algorithm to perform all-pairwise alignments, which

are then progressively combined following the order defined by the UPGMA

guide tree [98] of the pairwise alignments. The progressive alignments are re-

fined using Monte Carlo simulations. The CE [16] pairwise alignment algorithm

that forms the basis for CE-MC uses short backbone segments as aligned frag-

ment pairs (AFP), which are combined using combinatorial extension. In addi-

tion, MUSTANG and MATT also adopt input data modeling format AFP to

build local similar substructures. Both of them also create binary tree to guide

multiple alignment, the only difference is the approach to build such trees.

MAMMOTH-mult follows an approach similar to CE-MC. It generates a guide

tree by applying average linkage cluster from all pairwise alignments, where each

pairwise alignment is produced using the MAMMOTH [67] pairwise alignment

method. In particular, MAMMOTH uses unit-vector root mean square (URMS)

distance [28] between hepta-peptide segments as the main mechanism to detect

corresponding residues. On each level of guide tree, MAMMOTH-mult also

employs a SIMPLEX [99] optimization of the multiple alignment to counteract

the greediness of the progressive alignment. Note that MAMMOTH-mult is as

well as a fragment-based alignment method like CE-MC.

The purpose of guide tree within all of these methods is to ease the crucial

accumulating error in basic progressive methods. Generally, guide tree build-

ing is on top of all-against-all pair-wise alignment, the highest similar score

pairs are combined to form an upper level nodes. However, there still exists a

fundamental limitation on this kind of approach.

35 Lemma 3.3.1. Given guide tree G = {n0, n1, . . . , ni,... }, where ni is a node

of the tree and n0 is the root node in particular, |ni| is the size of alignment

core in ni. We conclude: 1) the minimum size of alignment core from the G is

at |n0|. 2) |ni| is always decided by one of its component node with the smallest alignment size.

From this fact, guide tree tries to optimize local alignment at each node but

has difficulty to detect small structurally similar motifs because only the best

solution is selected at each hierarchical stage and the alignment core is fixed

at each node. Consequently, it is lack of control of arrangement on all struc-

ture set alignment. The experimental data will later show that CE-MC and

MAMMOTH-mult have limited ability to align data sets with very diverse

structures. Note, if the optimal alignment core as well as other possible align-

ment cores are all kept in a node, it might have opportunity to perform global

alignment selection but the potentially exponential computing time prevent any

practical methods from emerging.

During the same period while hierarchial guide tree type of progressive methods

were developed, center-star approaches [69, 100, 101, 102, 103] were also very

popular.

The method described in [100] is a typical center-star algorithm and does all pair-

wise alignment by multiple dynamic programming within the structures set. It

then selects one structure which is the one that close to all other structures as

the center. Finally, it consistently combines all the other aligned structures to

this center structure. Center-star avoids to construct consensus structure or

profile in each pair-wise alignment stage but concentrates all alignment jobs onto

the center structure. However, tshe selection of center structure is becoming

36 the bottleneck which causes local minimum and low sensitivity issues. MALE-

CON [104] tried to build a three structure tuple but not a single structure as

alignment center at the cost of enumeration on all possible local alignments.

3.3.2 Simultaneous alignment

The motivation of developing simultaneous alignment tools comes from the fact that local minimum is fundamentally invincible in all progressive type alignment methods.

Therefore, a new approach with brand new concept is emerged recently to consider all the given molecules simultaneously rather than initiating from pair-wise alignments and iteratively combining pair-wise results. Multiprot [71], MASS [75], POSA [54] and our current research take this strategy successfully. The skeleton of these newly developed tools are similar: they first collect substructures from the given molecules set and organize substructures into clustered pool according to their alignment prop- erties; then they operate on the clustered pool to search optimal local similar seeds;

finally they perform global alignment on top of the selected seed. Result evaluation is achieved by counting the alignment core size and the superposition quality. The nature of such strategy determines that these methods are capable of detecting sub- set of alignments which make the methods insensitive to the presence of structurally dissimilar molecules and may benefit to detect low similarity data sets. To collect substructures, some methods such as Multiprot and POSA utilize AFP chains. In contrast to the above sequence orientated common fragments detection, MASS and our method use spatial motifs as the alignment anchor, which effectively avoid the restriction of sequence order. In particular, MASS completely focuses on secondary structure elements(SSEs), while our method also considers strong correlated loops and coils. Below, we introduce a few widely used methods in details and further emphasize on the technical components that we are also interested in our framework.

37 Multiprot is a pivoting oriented algorithm that a pivot molecule participates in all

rest of molecules alignment.

The initial step is to detect all fragment pairs between pivot molecules Mp

and each of the rest molecules Mk from set S. With the time concern, the construction of AFP is performed on a 2Dmatrix which represents the indices

of Mp and Mk. From a start entry in the matrix, compared fragments are extended on both sides along the diagonal direction under the control of pre-set

threshold ².

The second step consists of plotting a 2-dimensional chart that x axis represents

indices of Mp and y axis lists rest of molecules. Each Mk forms a bin including all possible aligned fragment corresponding to indices on M −p. From the plot,

a cut can be detected by scanning x axis from left to right.

we define a locally maximal cut Cut[α, β] as an interval [α, β] such

that for any δ > 0, Cut[α − δ, β] and Cut[α, β + δ] contain different

fragments than Cut[α, β].

Finally, the global multiple alignment is performed on top of previously detected

Cuts. If a Cut includes fragments from all molecules, a full set alignment can

be achieved, otherwise, subset alignment is obtained. However, a common issue

in all simultaneous alignment methods is very likely occur during the alignment

procedure. A selected Cut may include multiple fragments from one molecule

and the number of alignment choices is qikMi where kMi is the number of

fragments from Mi in this case. With the invincibly exponential number of alignment choices, Multiprot selects the best one by a simple heuristic solution

which only pickup a fragment in a Mk with the best pair-wise alignment result.

38 Note, this solution may cause local minimum issue again but it can be made

up by iteratively select Mp in the whole set.

Our framework shares some common interests with Multiprot. The first one

is the pivoting technique. Pivot molecule is much like the center structure in

center-star approach we mentioned above which prevents progressively joining

new molecules to consensus structure. The difference is that a substructure

of pivot molecule is used as anchor to hookup counterparts of other molecules

together simultaneously. Multiprot iteratively select pivot at the full molecule

level to avoid local minimum issue at the cost of highly computational demand-

ing. We will show iterative selecting pivot at substructure level to speed up the

whole procedure in out framework in later chapter. The second one is to utilize

so called bio-core filter. Bio-core helps to prune out non-biologically meaningful

AFP according to the residues chemistry properties. We realize that the early

we filter out the false-positive similar substructures, the easier we handle global

multiple alignment later.

Figure 3.2: The construction of a base footprint. Combination of secondary structure elements A~ and B~ can be either α-helix or β-strand (source: [75])

39 MASS is a hybrid method that combines simultaneous substructures collection and

progressive alignment together.

Figure 3.3: Demo of Base Bucket to get an initial multiple structure alignment from pairs of SSE elements. (source: [75])

Initially, pairs of SSE elements, named bases are detected from each molecule

with representation of the 4−tuple footprint vector (see fig. 3.2) including: type

(helix or strand), angle between SSE elements’ axial vectors, distance between

axial vectors midpoints and line distance between axial vectors.

Then, the similar bases are put into Base Bucket (BB) (see Fig. 3.3) and each

column of BB contains the bases from the same molecule. The organization

format of BB is like a Cut in Multiprot’s 2-dimensional plot and the difference

is that x-axis of BB represents the number of molecules participating in the

alignment and y-axis contains buckets saving all similar bases. Therefore, they

face the same issue, exponential number of alignment choices. To solve this

40 problem, MASS progressively adds base from new molecules by selecting the ¡l¢ best one enlarging a pre-defined multiple alignment score, k 2 , where k is the current number of molecules and l is the size of aligned core. Similar to

Multiprot, MASS iteratively pick a base as pivot to alleviate local minimum issue.

A significant contribution of MASS is its two level data modeling strategy.

MASS is one of ancestor to utilize protein’s secondary structure elements as input format to detect substructures similarities. Comparing to average 200 −

300 hundred residues in a protein molecule, there is only about 15 SSEs in an average molecule structure. It is obvious that SSEs could be a good filter to prune unnecessary alignment efforts from the mass 3D residue position data.

Furthermore, SSEs provide an opportunity to align spatial data directly without the limitation of sequence order. Base construction (figure 3.2) indicates that a spatial core can be detected and quantified by purely geometric measurements.

To perform multiple structural alignment on top of a set of bases each from a molecule, such as those color pathes linked bases in figure 3.3, the data format for comparing similarities among bases is switched to the position of Cα atoms at this step. Since there is a SSEs filter applied before, the amount of such heavy duty distance similarity calculation is greatly reduced.

We observed that Multi-prot and MASS adopts similar substructures clustering method, Cuts in Multi-prot and Base Bucket in MASS, to reduce calculations on similarity comparing. The amount of possible alignments still could be an exponential number and the heuristic solutions are unfortunately too naive to avoid local minimum issue occurring. Such difficulty is invincible once the progressive method is involved. For example, given a N size molecules set, the sub-alignment is always determined on the current ith molecule, but ignore the

41 possible better choices when remaining N − i molecules join. We will address

this issue in our framework in a later chapter with a novel solution.

POSA is an evolution of FATCAT [70, 105, 106] by utilizing POG to align multiple

molecules.

First, each fixed sized AFP (8 residues for each fragment) from all molecules

are collected with a general procedure used in [16].

Secondly, a POG is constructed in the order provided by previously built

guide tree and each input molecule sequence is converted into a sequence of

fixed size fragments extracted from AFP collections, e.g., a molecule structure

1 2 j j th Mi = {fi , fi , . . . , fi ,... }, where fi is the j fragment in Mi. On top of such 1-dimensional data format, a score function based on AFP similarity score is

applied during the POG dynamic programming. In particular, fragment repre-

sentation ignores spatial relationship among AFPs. Consequently, it naturally

j j+1 allows flexible alignment. For example, the connect area between fi and fi can be twisted to satisfy consistent transformation and rotation.

Although POSA claims that it progressively aligns input structures in the or-

der given by guide tree, POG provides all characters of simultaneous align-

ment. Namely, tracing back from POG enables global alignment; partial order

nodes makes subset alignment detectable; fragment representation allows flexi-

ble alignment.

Our framework also takes advantages of POG. From POSA, we realize that frag-

ment representation used in POG has double side affection on the alignment.

The advantage is its efficiency in computational time since 1-dimensional se-

quence and score function similar to substitute matrix are simple enough to

work on. The limitation is its in-sensitivity on small region adjustment such

42 as matched fragments expanding or shifting. Instead, our framework adopts

individual residue0s 3D position as the input data of POG and the POG con-

struction is on top of geometric distance comparison which relaxes from the

restriction of fragment and may look up more optimal options during multiple

alignment.

Another interesting property of POSA is flexible alignment. It is applied in the

situation that structural rearrangements make pure rigid structure alignment

unable to detect similar function cores even in homologous proteins. There

are two kind of flexible alignment. One is to twist connection area between

two potentially aligned fragments along the molecule sequence like POSA did.

Another is to break structure confirmation globally to align substructures in

3D space. Note that our framework falls into the second one and more detailed

analysis will be given in the later chapter.

In this section, we described two major categories of multiple structural alignment methods, progressive and simultaneous alignment. Progressive alignment are ex- tended from pair-wise alignment and it is usually intuitive and easy to implement but has invincible shortcoming. On the other hand, simultaneous alignment collects common sub-structures and aligns whole structures on top of a set of selected sub- structures. It may be complex but has its own nature advantages. Since our research belongs to the later category, we introduced a few related methods in detail. Par- ticularly, we are interested in some unique characters from each of them such as, the pivot technique in Multi-prot and MASS, bro-core filter in Multi-prot, secondary structure elements building base in MASS, POG technique and flexible alignment concept in POSA. However, there still exists a lot space to make improvement ac- cordingly. Since our research is newer than any of them, we have chance to rethink all of these advantages and disadvantages together to create a better framework. 43 3.4 Measurement of Structural Alignment Quality

Up to date, there are hundreds of tools and algorithms in the area of protein structural alignment. Therefore, there are also many scoring schemes that differ significantly to evaluate structural alignment. However, two basic common goals of them are to find the maximum alignment core and minimize the geometry distance between superposed molecules. Usefulness of a method is judged by the size of alignment core that can be detected, but the quality is determined by the geometry distance measurement.

Among various measures of similarity or deviation between aligned structures, sci- entists prefer to use Root-Mean-Square-Distance (RMSD) measurement to challenge structural alignments. Particularly in protein alignments, a molecular structure is treated as a 3 dimensional curve where each center of residue is the curve’s vertex.

On the other hand, the protein folding simulations treat folding motions as a high dimensional curves in the time order if we consider each intermediate confirmation as a high dimensional vertex. Since the purpose of alignment focuses on the structures or confirmations similarities without considering curve’s edges, RMSD and it variants are more suitable for the situation. The computing of RMSD is straightforward in time of O(N) where N is sum of correspondences.

Among all alignment methods we mentioned in the above section, most intuitive way to measure similarity is to put the structures on top of each other so that the equivalent elements come as close as possible. The RMSD distances can be used to quantify the similarity and to score the equivalence. This is called superposition of structures, and if the geometry of the structures are not changed in the process, it is referred to as rigid-body superposition. Algorithms exist for superposing structure A on structure B by finding the superposition (translation of 3 distances and rotation of 3 angles) to minimize the RMSD.

44 The most commonly used one is coordinate distance root mean squared (cRMS) deviation [107, 108], which is based on comparing inter-set point distance matrices.

For example, the matrix of distance between all correspondent points between two structures provided that they have been superposed, is defined as: v u u 1 Xn cRMS(P,Q) = min t kp − T k2 (3.4.1) T n i qi i=1 where pi ∈ P and qi ∈ Q are the amino acids (points) in protein P and Q respectively, and T is a rigid transformation. To be extended to multiple structure alignment, two measures are developed from cRMS. First, the core RMSD (cRMSD) [109] is defined as: 2 X X cRMSD = cRMS (3.4.2) n(n − 1) i,j i j

cRMS requires that the target molecules are superposed before measurement which may not be efficient or even not realistic for some intermediate alignment stages. On the other hand, distance Root Mean Squared (dRMS) deviation measures distance between each distance matrix of participated molecules: v u u 1 Xn−1 Xn dRMS = t (dA − dijB)2 (3.4.4) n(n − 1) ij i=1 j=i+1

A B where dij and dij are the distance between residues i and j in molecules A and B respectively. dRMS can also be extended to multiple alignment like cRMS does. 45 In summary, RMSD is by no means the only way to score similarity, and there is no consensus on what the best method is, but RMSD does have the advantage of being computationally convenient. Beyond the basic mathematic way to measure geometric similarities, some molecules alignment methods adopts advanced scoring schemes [110, 111, 112] to balance the size of alignment core and RMSD of the aligned residues among molecules.

Consider a pair-wise alignment with molecules L1 and L2 that have been optimally superposed on top of each other, there are four poplar geometric measures: similarity index (SI) [113], match index (MI) [113], structural alignment score (SAS) [114] and gapped SAS listed below. Note that multiple alignment scoring scheme can be naturally extended.

cRMS × min(L ,L ) SI = 1 2 (3.4.5) Nmat

1 + N MI = 1 − mat (3.4.6) (1 + cRMS/ω0)(1 + min(L1,L2))

cRMS × 100 SAS = (3.4.7) Nmat

  cRMS × 100  if Nmat > Ngap GSAS = Nmat − Ngap (3.4.8)   99.9 else

where, Nmat is the size of aligned core and Ngap is the total number of gap openings in both molecules.

46 The above measures that combine both alignment core and RMSD provides com- mon standards to evaluate the quality of structure comparison and alignment meth- ods. The performance of different alignment methods are eventually determined by the underline alignment strategies. For instance, MASS is working on top of SSEs and it usually has good RMSD value but limited alignment core size even with core expanding stage applied. On the other hand, POSA flexibility allows binding between aligned fragments which enlarges alignment core explicitly. In our framework, we tries to maximally utilize user pre-defined RMSD threshold to achieve better alignment core size.

47 CHAPTER 4

EPO: ENHANCED PARTIAL ORDER CURVE

COMPARISON

4.1 Introduction

4.1.1 Overview

Proteins are the main agents in cells. Understanding how they function is essential to understand life at the molecular level. From a chemical point of view, a protein molecule is a linear sequence of amino acids. This linear sequence, under appropriate physicochemical conditions, folds into a unique native structure rapidly (a demo is shown in figure 2.6). Understanding folding process is of paramount importance, es- pecially since its outcome, namely the three dimensional protein structure, to a large extent decides the functionality of the molecule. Hence a lot of research has been de- voted to investigating the kinetics of protein folding. In particular, modern (parallel) computation power makes it possible to perform large-scale folding simulations. As a result, interpreting the huge amount of simulation data becomes a crucial issue.

Previously, folding simulations analysis is performed mainly for testing various protein folding models [20, 21, 22], such as the folding pathway model and the funnel model; and/or for studying energetic aspects of folding kinetics [23, 24, 25, 26]. The geometric shapes of the conformations involved in folding trajectories have not been widely explored [27, 28, 29] despite their important role in folding. A particularly

48 interesting work in this direction is by Ota et al. [29], where they investigated the folding trajectories of a mini-protein Trp-cage using phylogenic tree combined with expert knowledge. However in general, an automatic tool to facilitate the folding simulations analysis at large scales is still missing. This chapter provides an important step towards this goal by modeling folding trajectories as curves and using a new multiple curve comparison (MCC) algorithm to detect critical folding events.

In this chapter, we model each protein folding trajectory as a multi-dimensional curve on which each vertex is an intermediate configuration, and then present a novel

MCC algorithm, called the enhanced partial order (EPO) algorithm, to identify crit- ical information from a set of diverse folding trajectory curves in an automatic man- ner. The EPO algorithm addresses several new challenges presented by comparing high dimensional curves coming from folding trajectories. A detailed case study on mini-protein Trp-cage [115] demonstrates that our algorithm can detect similarities at rather low level, and extract biologically meaningful folding events.

4.1.2 Challenges and goals

When translated from a sequence of mRNA to a linear chain of amino acids, a given protein p exists as an unfolded polypeptide or random coil at the initial state. Then the amino acids made up p quickly interact with each other to produce a well-defined three-dimensional structure known as the native state within nanosecond scale. One procedure of simulating physical motion of protein folding generates a sequence of snapshots of intermediate confirmations, also called trajectory. Because of highly stochastic character of protein folding motion, the pathways or trajectories of simu- lating procedures are diverse. The study of protein fold usually relies on an ensemble of folding simulations including both successful and unsuccessful runs, which are tra- jectories that do or do not include a sequence of conformations leading to a near

49 native conformation. Given such a diverse data set, scientists wish to answer ques- tions such as what causes the folding process falling into different results, and what common properties are shared by the successful runs, but not the unsuccessful ones?

To this end, it is highly desirable to be able to compare multiple folding trajectories and extract useful information from them.

The challenges in comparing and aligning protein folding trajectories lies in three issues.

The first one is the noise of simulation output. The highly stochastic nature of protein folding motion not only leads folding trajectories to native state but also to may stack at certain random states or even unfold from sub-native state at some points. A single scan on the trajectory cannot extract the critical features of folding process.

The second one is the high dimensional data format. The distance map represen- tation of protein p with amino acids length L forms a L2-dimensions vertex on the folding trajectory. Comparing to the methods that project folding confirmations into low dimensional space, the high dimensional data provide enough resolution to trace detailed dynamical folding motion but increase the complexity and computational cost dramatically.

The third and the most critical one is the complexity of multiple-curve comparison.

In computational biology, the MSTA problem is the closest relative of this MCC problem. To align a family of protein structures in MSTA, each structure is modeled as a three dimensional polygonal curve to represent its backbone. On the other hand,

MCC in protein folding aims at a set of simulation trajectories, each assembled with a sequence of confirmations as the high dimensional vertex. MSTA, as we introduced in the previous chapter is a very hard problem. In fact, even the pairwise comparison problem of aligning two structures A and B is believed to be NP-hard since one has to

50 optimize simultaneously both the correspondence between A and B and the relative transformation of one structure with respect to the other. Numerous heuristic-based algorithms have been developed in practice for this fundamental problem, such as

[11, 12, 13, 14, 15, 16, 17] that we have introduced before. Similar issue occurs in

MCC approach on protein folding. If we have a set of k > 0 trajectories, then even the problem of aligning them optimally without considering transformations becomes intractable — it takes Ω(nk) time using the standard dynamic programming algorithm, where n is the length of each trajectory involved. In practice, progressive methods are widely used to tackle the MSTA problem [116]. For example, given a set of structures, many approaches start with a seed structure and then progressively align the remaining structures onto it one by one [117, 97, 118, 119, 120, 121, 54].

A consensus or core structure is typically built throughout to maintain the common substructures among the proteins that are already aligned. At each round, usually only pairwise structure comparison is performed to align the current consensus with a new structure. However, progressive methods causes critical issues as we have described before.

The EPO algorithm (figure 4.1) which is developed on top of the concept of par- tial order graph, adopts a novel approach to address these three challenges. First, a pre-process stage including both intra-curve and inter-curves clustering controls each trajectory length and prunes the outlier (noise) points from compared trajectories.

Secondly, without building consensus structure and considering comparison of data set all together, a POG constructed with a novel two-level scoring function effec- tively hold all alignment information from each progressively joined curve. Finally, a merging stage after POG construction greatly improves alignment quality in sev- eral aspects, especially in its sensitivity in detecting low level of similarity and its

51 capability of handling high dimensional curves. Applying EPO to the folding trajec- tories of a miniprotein Trp-cage [115] shows that it is able to automatically detect critical folding events which were observed earlier [29] by biological methods with low similarities.

The final goal of our algorithm is to extract lists of ordered events common to suc- cessful runs but not to unsuccessful ones. One such ordered events can be discovering that a conformation B is always formed after A and followed by a conformation C before reaching a successful folding conformation. (Conformations A, B, and C may not be consecutive.) In addition, our EPO algorithm is general, and we demonstrate its generality and effectiveness in this chapter with protein folding data and later we also apply it to aligning multiple protein structures data.

Figure 4.1: EPO flow chart.

4.2 methods

4.2.1 Input data modeling

In this section, we describe our EPO algorithm for comparing a set of high dimensional general curves. If we are given a set of protein folding data, we first convert each 52 folding trajectory to a high dimensional curve. In particular, a folding trajectory is a sequence of conformations (structures) of a protein chain, representing different states of this protein at different time steps during the simulation of its folding process. We represent each conformation using the distance map between its alpha-carbon atoms so that it is invariant under rigid transformations. For example, if a protein contains n amino acids, then its distance map is a n × n matrix M where M[i][j] equals the distance between the ith and jth alpha-carbon atoms along the protein backbone.

This matrix can then be considered as a point in the n2 dimensions. This way, we map each trajectory of m conformations to a curve in Rn2 with m vertices. We remark that one can also encode the side-chain information into the high dimensional curves, or map the trajectory of a substructure into a high dimensional curve. We will use such more refined high dimensional curves in most of our experiments as well. In the remaining part of this thesis, we use the terms trajectories and curves interchangeably.

4.2.2 Notations and Algorithm Overview

Before we formally define the MCC problem, we introduce some necessary notations.

Definition 4.2.1. Given a set of elements V = {v1, . . . , vl}, a relation ≺ over V is transitive if vi ≺ vj and vj ≺ vk imply that vi ≺ vk. In this thesis, we also refer to vi ≺ vj as a partial order constraint.A partial order graph (POG) G = (V,E) is a directed acyclic graph with V = {v1, . . . , vl}, where vi ≺ vj if there is an edge

(vi, vj). Note that by the transitivity of this relation, two nodes may have a partial order constraint even when there is no edge between them in G. Let R be the set of partial order constraints induced by G. We say that a permutation Π(V ) of V is a partial order list w.r.t. G if for any vi ≺ vj ∈ R, we have that vi appears before vj

53 in the permutation Π(V ). In other words, the linear order in Π(V ) is a total order satisfying all partial order constraints induced from G. See figure 4.2 for an example.

Figure 4.2: A POG G of 5 nodes. Note that there is a partial order constraint a ≺ d even though there is no edge between them. Both ha, b, c, d, ei and ha, c, b, d, ei are valid partial order lists w.r.t. G.

d Let T = {T1,...,TN } be a set of N trajectories in R , where each trajectory Ti

i i is an ordered sequence of n points p1, . . . , pn. (For simplicity, we assume without loss of generality that all Tis have the same length n.) The goal of the MCC algorithm is to find aligned sub-sequences from T .

More formally, an aligned node o is a collection of vertices from Tis, with at most one point from each Ti. Given a 3-tuple (T , τ, ε), where τ and ε are input thresholds, an alignment of T is a POG G with the corresponding set of partial order constraints

R and a partial order list of aligned nodes O = {o1, . . . , oL} such that the following three criteria are satisfied:

C1. |ok| ≥ τ, for any k ∈ [1,L];

i i0 i i0 C2. for any pj, pj0 ∈ ok, ||pj − pj0 || ≤ ε;

54 i i 0 C3. if pj ∈ ok1 and pj0 ∈ ok2 with ok1 ≺ ok2 , then j < j .

(C1) indicates that the number of vertices of input curves aligned to each aligned node ok is greater than a size threshold τ. (C2) means that these aligned points are tightly clustered together (i.e, the diameter of them is bounded by a distance threshold ε). (C3) enforces that points in different aligned nodes still maintain their partial order along their respective trajectory, which means that oks are inherited and thus consistent to the points in each trajectory. Our goal is to maximize L, the size of such an alignment O. See figure 4.3b for an example of an alignment graph.

(a) Linear Alignment (b) Partial Order Alignment

Figure 4.3: Aligning five trajectories (IDs 1 to 5) using (a) a linear graph, and (b) a partial order graph. Symbols in the circles are the node IDs and numbers on edges are trajectory IDs. Note that the linear alignment in (a) will not be able to record the partial similarity between curves 3 and 4, which is maintained in (b) (i.e, node d).

To this end, we formally defined POG and proposed the target our algorithm. We introduce EPO algorithm to capture similarities and dis-similarities between a set of input folding trajectories below.

At a high level, the EPO algorithm has two stages (see figure 4.4): (S1) initial

POG construction stage and (S2) merging stage. The first stage generates an initial 55 alignment for T , encoded in a POG G. The procedure has the same framework as the POA algorithm, but engage a totally different input data modeling. Unlike structural fragments used in POA which is a 1-dimension type data, EPO adopts high dimensional points as input data of POG. Hence, its performance, especially when the similarity is low, is significantly improved, via the use of a clustering preprocessing step and a new two-level scoring function. In the second stage, we develop a novel and effective procedure to merge nodes from G to output a better final alignment G∗.

Below, we describe each stage in detail.

(a) Initial POG (b) POG before merging (c) POG after merging

Figure 4.4: Symbols inside the circles are the aligned node IDs. The table associated with each node encodes the set of points aligned to it. In particular, each row repre- sents a point with its trajectory ID (T column) and its index along the trajectory (S column). For example, the entry (1, 2) associated with node b in (a) means that the

1 aligned node b currently include the point p2, the second point from trajectory-1. In

(a), a POG is initialized by the trajectory T1. An example of a POG after aligning a few trajectories is shown in (b). Note that a new node/branch is created when a point cannot be aligned to any existing nodes. For example, node e was created when

2 1 p3 (i.e, the 3rd point of T2) was inserted. (c) shows the POG after merging point p2 from the node b to the node e constrained by the distance threshold ε.

56 4.2.3 Initial POG Construction

Standard dynamic programming (DP) [36, 37] is an effective method for pairwise comparison between sequences. It produces an optimal alignment between two se- quences with respect to a given scoring function. One can perform multiple sequences alignment progressively based on this DP pairwise comparison method. Roughly speaking, in the ith round of the algorithm, the alignment of the first i − 1 sequences is represented in a consensus sequence. The algorithm then updates this consensus by aligning it with the ith sequence Si using the standard DP algorithm. Informa- tion from Si that is not aligned to the consensus sequence is essentially lost (see figure 4.3a).

The partial order alignment (POA) algorithm [53] alleviates this problem by en- coding the consensus in a POG instead of a linear sequence (see figure 4.3b). That is, the alignment of S1,...,Si−1 is encoded in a partial order graph Gi, which is then updated to Gi+1 by aligning it with Si. Due to the partial order in a POG, the alignment between Gi and Si can still be achieved by a DP algorithm. The POA algorithm reduces the influence of the order of the sequences aligned, and is able to capture alignments between a subset of sequences. More details of the POA algorithm and its variants can be found in [52, 53].

In our case, each trajectory is mapped to an ordered sequence of points (i.e, a polygonal curve), and a similar algorithm can be applied to our trajectory data: In- stead of the usual 1D sequences, we now have dD sequences, where d is the dimension of each point. Note that since each point corresponds to the distance map of a confor- mation, no transformation is needed when comparing such curves. The first stage of our EPO algorithm constructs a POG G with respect to the input set of trajectories

T using a modified POA algorithm. Below we explain the main differences between them.

57 4.2.3.1 A Clustering Preprocessing Stage

The first problem with the standard POA algorithm is that the size of the POG graph maintained expands quickly when the level of similarity is low. For example, suppose we are updating the current POG Gi to Gi+1 by aligning it with a new curve

Ti. If a point p ∈ Ti cannot be aligned to any node in Gi, then it will create a new node in Gi+1, as this node may potentially be aligned later with the remaining curves. Consequently, if the similarity is sparse, many new nodes are created without really producing densely aligned nodes later and the size of the POG graph increases rapidly. This induces high computational complexity.

To address this problem, our algorithm preprocesses input curves in two different ways, intra-curve and inter-curves.

First, we observe that protein folding is not developed under a constant veloc- ity and some of consecutive confirmations are kinetically silent within the folding trajectory. Therefore, we scan each trajectory Ti and group the similar consecutive confirmations into a consensus one by a pre-defined threshold γ. For example, if

i i the pairwise RMSD distance of consecutive points pm, . . . , pn ∈ Ti is within γ, we will create a representative point by the median one. Note the number of points we shrunk are varied and usually more points are compressed when close to native state in a trajectory.

Second, preprocesses all points from the input curves T by clustering them into groups [122], the diameter of which is smaller than a user defined threshold, which is fixed as the distance threshold ε in our experiments. According to the clustering result, we only keep those points that belong to a cluster holding more than τ curves’ points in it. For example, if the threshold is τ = 3 (i.e, we require each aligned node aligns points from at least 3 curves), we will prune out the points in such clusters that cover less than 3 curves. Meanwhile we collect cluster centers in C = {c1, . . . , cr},

58 which we refer to as the set of canonical cluster centers. Intuitively, C provides a synopsis of the input curves and represents potentially aligned nodes.

If, in the process of aligning Ti with Gi, a point p ∈ Ti is not aligned to any node in Gi, then we insert a new node in Gi+1 only if p is within ε away from some canonical center from C — if p is far from all the canonical cluster centers, then there is little chance that p can form significant alignment with points from later curves, as that would have implied that p should belong to a dense cluster. We remark that this set of canonical cluster centers are not only used for shrinking the size of POG, but also used as a predictor of the new two-stage scoring function that we will introduce shortly. There is also further advantage of pruning out unpromising points in the second merging stage of the EPO algorithm.

4.2.3.2 Scoring Function

The choice of the scoring function when aligning Gi = (Vi,Ei) with Ti, is in general a crucial aspect of an alignment algorithm. A good scoring function will align as many points as possible globally. Given a point p ∈ Ti and a node o ∈ Gi, let δ(o, p) be the similarity between p and o, the definition of which will be described shortly. The score of aligning p with o is usually defined as:

Score(o, p) = max{ max (Score(o0, q) + δ(o, p)), max Score(o0, p), Score(o, q)} 0 0 (o ,o)∈Ei (o ,o)∈Ei (4.2.1)

0 where q is the parent of the point p along Ti, and o ranges over all immediate predecessors of o in the POG Gi. It is easy to verify that such scores can be computed by a dynamic programming procedure due to the inherent order existing in both the trajectory and the POG.

A common way to define δ(o, p), the similarity between o and p, is as follows.

59 Assume that each node o is associated with a node center ω(o) to represent all the points aligned to this node. Then    ε − ||p − ω(o)|| if||p − ω(o)|| < ε δ(o, p) = (4.2.2)   0 otherwise

An alternative way to view this is that each node o has an influence region of radius

ε around its center. A point p can be aligned to a node o only if it lies within the influence region of o.

Figure 4.5: Empty and solid points are aligned to the nodes oa and ob, respectively.

For a new point p (the star), although it is closer to ω(ob), it is better grouped with points aligned to oa. Hence ideally, it should be aligned to oa instead of to ob.

Natural choices for the node center ω(o) of o include using an earlier computed canonical cluster center, or the center of the minimum enclosing ball of points already aligned to this node (or some weighted variants of it), which is a dynamic point relying on alignment order. The advantage of the former is that canonical cluster centers tend to spread apart, which helps to increase coverage of aligned nodes. Furthermore, the canonical cluster centers serve as good candidates for node centers as we already know that there are many points around them. The disadvantage is that it does not consider the distribution of points already aligned to this node. See figure 4.5, where

60 without considering the distribution of points aligned to oa and ob, the new point p will be aligned to ob even though oa is a better choice. Using the center of the minimum enclosing ball alleviates this problem. However, the influence regions of nodes produced this way tend to overlap much more than using the canonical cluster centers and the position of these centers also depend heavily on the order of curves aligned. We combine the advantages of both approaches into the following two-level scoring function for measuring the similarity δ(o, p).

Specifically, for a node o, let q be the first point aligned to this node. This means that at the time we were examining q, q cannot be aligned to any existing node in the POG. Let ck ∈ C be the nearest canonical cluster center of q — recall that the node o was created because ||q − ck|| ≤ ε. We add ck as a point aligned to this node, and at any time, the center of the minimum enclosing ball of currently aligned points, including ck, will be used as the node center ω(o). Now let

D(o) = max ||q − q0|| (4.2.3) q,q0∈o be the diameter of points currently aligned to o. We define that:    2ε if ||p − ω(o)|| < D(o)  δ(o, p) = ε else if ||p − ω(o)|| < ε (4.2.4)    0 else

In other words, the new scoring function prefers centering points to be around previ- ously computed cluster centers, thus tending to reduce overlaps between the influence regions of different nodes. Furthermore, it gives higher similarity score for points that are more tightly grouped together with those already aligned at current node, ad- dressing the problem shown in Figure 4. Our experimental tests have shown that this two level scoring function significantly outperforms the ones using either only the canonical centers or only the centers of minimal enclosing balls. We remark that

61 it is possible to use variants of the above two-level scoring function, such as making it continuous (instead of being a step function). We choose the current form for its simplicity. Furthermore, experiments show that there is only marginal difference if we use the continuous version.

4.2.4 Merging Stage

In the first stage, we have applied a progressive method to align each trajectory onto an alignment graph one by one. In the ith iteration, a point from Ti is either aligned to the best matched node in the current POG Gi, or a new node is created containing this point and the corresponding canonical cluster center, or discarded.

After processing all of the N trajectories in order, we return a POG G = GN . In the second stage of our EPO algorithm, we further improve the quality of the alignment in G by using a novel merging process.

Given the greedy nature of the POA algorithm, the alignment obtained in G is not optimal and depends on the alignment order. Furthermore, since the influence regions of different aligned-nodes may overlap, no matter how we improve the scoring function, sometimes it is simply ambiguous to decide locally where to align a new in-coming point, and a wrong decision may have grave consequence later.

Figure 4.6: Empty and solid points are aligned to the nodes oa and ob, respectively, while points in the dotted region should be grouped together.

62 For example, see figure 4.6, where the set of points P (enclosed in the dotted circle) should have been aligned to one node. However, suppose the nodes oa and ob already exist before any point in P is inserted. Then as points from P come in, it is rather likely that they are distributed evenly into both oa and ob. This problem becomes much more severe in higher dimensions, where P can be distributed to several nodes whose centers are well-separated around P , but whose influence regions still cover some points from P (the number of such regions grows exponentially w.r.t. the dimension d). Hence instead of being captured in one heavily aligned node, P is broken into many nodes with small size. Our experimental tests confirm that this is happening rather commonly in both standard and our modified POA algorithms.

To address this problem, we propose a novel postprocessing on G. The goal is to merge qualified points from neighboring less-aligned nodes to augment more heavily loaded nodes. In particular, the following two invariants are maintained during the merging process:

(I1) At any time, the diameter of the target node is still bounded by the distance

threshold ε;

(I2) The partial order constraints induced by the POG graph are always consistent

with the order of points along each trajectory.

0 The second criterion means that at any time in the POG graph G , if p ∈ o1, q ∈ o2, p, q ∈ Ti and p precedes q along the trajectory Ti, then either o1 ≺ o2, or there is no partial order relation between them. In other words, the resulting POG still corresponds to a valid alignment of T with respect to the same thresholds.

1 As an example, see figure 4.4, where the point p2 (i.e, the second point of the trajectory T1) in the node b in (b) is moved to the node e in (c). Note that the graph is also updated to reflect the change (the dashed edge in (c)), in order to maintain the invariants (I1) and (I2). When all points aligned to a specific node o are merged

63 (thus moved) to other nodes (i.e, o becomes empty), we delete o, and its successors in the POG will then become the successors of its parent.

A high level pseudocode of the merging process is shown in algorithm 1. It aug- ments better aligned nodes from the current POG G by processing first the nodes with larger size. We perform this procedure a few times till there is no significant increase in the quality of the resulting alignment. In practice, to speed up the algo- rithm, we merge neighbors to a node o only if its size is greater than some threshold

(fixed at half of the size threshold, i.e, τ/2, in our experiments), as otherwise, there is low probability that o will become a heavy node later.

Algorithm 1: MergingP rocessing

Input: G = {o1, ..., om, ...}, | om |≥| om+1 |

Output: Gnew: new G after merge.

1 while significant progress do

2 foreach om ∈ G in increasing order of m do

3 foreach neighbor on, | on |<| om | do

0n 4 foreach t do

5 if mergeOK() then

0n 6 merge t → om;

7 // mergeOK() checks if the two invariants can be

maintained if performing the candidate merging operation;

64 4.3 EPO implementation on protein folding data

In this section, we report a systematic performance study on a biological dataset that contains 200 molecular dynamics simulations. The experiments achieve the follow- ing goals: First, we show that the quality of the alignments produced by our EPO algorithm is significantly better than that of the original POA algorithm. Second, we demonstrate the effectiveness of our algorithm by applying it to real protein sim- ulation data and obtaining biologically meaningful results that are consistent with previous discoveries [29].

4.3.1 Background of Dataset

Our input dataset includes 200 simulated folding trajectories for a particular protein called Trp-cage. The dataset is provided by the Ota’s Lab [29]. The folding sim- ulations were performed at 325 K by using the AMBER99 force field with a small modification and the generalized Born implicit solvent model. Trp-cage (see Fig- ure 4.7) is a mini-protein consisting of 20 amino acids. It has been widely used for folding study because of its short, simple sequence and its quick folding kinetics. Fol- lowing the definition from [123], a successful folding event has to satisfy the following two criteria:

• The RMSD for a conformation from the native NMR structure [115] is less than

2A.˚

• A subsequence of such near-native conformations holds for at least 200ps.

In [29], 58 successful folding trajectories reaching successful folding events are identified, and each trajectory includes 101 successive conformations sampled at 20ps interval. Furthermore, there are two crucial observations in [29] that we will ex- amine in the our experiment. First, before moving to the native conformation, a

65 Figure 4.7: NMR structure of trp-cage protein 1l2y. Labels on graph mark amino acids(AAs). AA2 to AA7 roughly form an alpha-helix. AA2 to AA19 form a ring-type structure. In particular, AA2 to AA5 and AA16 to AA19 form the “neck” of this ring.

“ring” sub-structure (see Figure 4.7) has to be formed. Second, the distinction of native and pseudonative confirmations heavily relies on side-chain position of “ring” sub-structure. Ota et al. [29] obtained the above results by aligning each pair of trajectories first and then applying a neighbor joining method to group similar tra- jectories together. However this semi-automatic approach requires dedicated expert knowledge. The following experiments applied on the same dataset show that our

EPO algorithm can automatically detect the above folding events with little prior knowledge.

4.3.2 Experimental Setting

In order to be consistent with the results from [29], we select all 58 successful folding events, and call it SuccData. We also randomly select 58 unsuccessful folding trajec- tories, each containing 101 conformations, and collect them in a set called FailData.

The union of successful and unsuccessful data is referred to as the MixData. We

66 set the distance threshold ε = 1A˚, and τ = 40 in the following experiments, unless specified otherwise.

Figure 4.8: Distribution of aligned nodes produced by the EPO algorithm, EPO-

NoMerge (i.e, first stage of the EPO algorithm), and the traditional POA algorithm.

The histogram is the number of aligned nodes (y-axis) versus the size of aligned nodes

(x-axis).

4.3.3 Investigation on Entire Protein Structure

In the first set of experiments, we convert each conformation to a high dimensional point (i.e, a 20 × 20 = 400 dimensional point), based on the distance matrix between all of the alpha-carbon atoms. Figure 4.8 compares the quality of the alignments of the SuccData by performing the POA algorithm, our EPO algorithm without the merging procedure (EPO-NoMerge), and the EPO algorithm. It shows the number of aligned nodes (y-axis) versus the size of aligned nodes (x-axis). Note that EPO-

NoMerge is essentially POA with a clustering preprocessing and the new two-level scoring function.

67 The similarity level between these trajectories is low (i.e, the number of aligned nodes with large size is small). It is clear from this histogram that our EPO al- gorithm significantly outperforms the other two by producing more aligned nodes with large sizes. The comparison between EPO and EPO-NoMerge demonstrates the effectiveness of our merging procedure, and that EPO-NoMerge is better than

POA shows that the two-level scoring function as well as the clustering preprocessing greatly enhances the performance. We have also performed experiments which show that compared to the POA algorithm, EPO-NoMerge is much less sensitive to the order of curves aligned. Comparisons of the three algorithms over the MixData pro- duces a similar result, and majority of points aligned to heavy nodes (i.e, |o| ≥ 40) are from successful runs (the results are not shown in this report).

We also observe that most of the heavily aligned nodes are close to the end of the trajectories for the SuccData. In fact, many aligned points have conformation IDs around and greater than 90, which is indeed the time that the folding starts to get stabilized. More specifically, consider the set of aligned nodes of size greater than 40 for the SuccData. Among all points aligned to these nodes, 67.2% has a conformation

ID greater than 90, and 24.4% has an ID between 80 and 90. This implies that our algorithm has the potential to detect the stabilization of successful folding events in an automatic manner.

This also implies that using the entire protein structure may be too coarse to detect critical folding events, as they are usually induced by small key motifs. In what follows, we map only a substructure of the input protein into a multi-dimensional point and provide more detailed analysis of this folding data.

68 4.3.4 Investigation on Substructures

It is usually believed that certain critical motifs play important roles which stabilize the whole structure in the folding process [21, 22]. We wish to have a tool that can identify such critical motifs (substructures) automatically. We define a candidate motif to be two subchains of Trp-cage, each of length 4. These two pieces induce a sub-window in the distance map of each conformation of the protein. We further require that the number of contacts in this subwindow w.r.t. the distance map of the native structure is at least 4, where a contact corresponds to two alpha-carbon atoms within distance 6A˚. We collect a set of candidate motifs based on these criteria.

Now for a candidate motif, for each of its conformation of along the trajectory, we consider the distance matrix between its alpha-carbon atoms as before, and convert the folding trajectory of this motif into a curve in the 4 × 4 = 16 dimensional space.

In order to be more discriminative, we also introduce a side-chain weighting factor

α, ranging from 0 to 1, to include the side chain information when comparing two high dimensional points (Roughly speaking, for every conformation, we record for each residue also the relative position of the centroid of its side-chain with respect to its alpha-carbon atom. This provides another high dimensional point that we call a side-chain point. The distance between two conformations will combine the distance between their side-chain points by the side-chain weighting factor.) : α = 0 means that side-chain information is completely ignored. We perform our EPO algorithm on both the SuccData and the MixData, and there are two motifs that especially stand out, which we describe below.

4.3.4.1 Alpha-helix substructure

The first one corresponds to an alpha-helix substructure. In Figure 4.7, five successive amino acids (No.2 -7) form an alpha-helix structure which is a simple, self-contained

69 secondary structure (SSE) [115]. From the results returned by our EPO algorithm, we note that this alpha-helix is formed rather early consistently in both successful and unsuccessful runs. Once formed, it remains stable. This is consistent with the common conception that due to its chemical property, alpha-helix is a stable secondary structure, and can be formed quickly. Hence the formation of alpha-helix cannot be used to differentiate successful runs from unsuccessful ones.

4.3.4.2 Ring-substructure

The second motif corresponds to the neck of a ring structure. In particular, it consists of the sub-chains of No. 2 - 5 and No. 16 - 19 amino acids. The following results demonstrate that EPO can automatically not only find but also track the formation of such fingerprint sub-structures (critical motif).

First, we observe from Table 4.1 that when applying the EPO algorithm to the

MixData (with the sidechain weight factor α = 0.9), significant alignments involve mainly trajectories from SuccData. For example, the last row of Table 4.1 shows that among the 62 points (from 62 trajectories) aligned to a particular node, 58 are from

SuccData, with the remaining 4 from FailData. Hence this motif is potentially critical to the success of the folding of Trp-cage. It also suggests that we can automatically classify the MixData into SuccData and FailData with few false positives based on this ring-neck motif, while previously, the classification in the input data was obtained by a few expert defined rules.

Second, when the side-chain weighting factor α = 0.9, it turns out that 49.6% of significant aligned nodes are formed before the conformation ID 85 (compared to re- sults from Section 4.3.3). For example, there are two aligned nodes from the successful runs, where 80% of points (i.e. trajectories) aligned to them has a conformation ID between 75 − 85. This implies that the complete formation of this ring-neck usually

70 Aligned Classification

Node ID |oi| ≥ τ SuccData FailData D(o) A˚

1 49 27 22 1.852

2 45 28 17 1.798

3 41 29 12 2.189

4 40 31 9 1.447

5 48 31 16 1.761

6 40 32 8 1.322

7 47 34 13 1.133

8 42 35 7 1.923

9 44 36 8 0.873

10 49 42 7 1.428

11 54 48 6 1.020

12 59 50 9 1.294

13 60 51 9 0.932

14 56 52 4 1.255

15 62 56 6 1.782

16 62 58 4 1.503

Table 4.1: EPO on ring Structure(MixData). Column 2 – 4 shows the size of an aligned node (i.e, the number of points aligned to this node) from MixData, SuccData, and FailData, respectively. Column 4 shows the diameter of this node (note that the distance threshold ε = 1A˚ means that the diameter of a node can be up to 2A˚).

71 immediately precedes the stabilization of the folding structure (which is roughly at conformation ID 90 for successful trajectories).

(a) Structure folding (b) Ring forming (c) Ring adjusting (d) Final structure

Figure 4.9: Visualizing of vital events listed the Table 4.1 during the folding proce- dure. Purple: α−helix, blue: 3 − 10−helix, cyan: turn, lime: coil. Corresponding to the Table 4.1, the alignment node IDs in: (a)-(1, 2), (b)-(3, 4), (c)-(5-15), (d)-(16).

If reducing the side-chain weighting factor α to 0.5, naturally, we found more aligned nodes. In particular, other than the cluster with conformations of IDs around

80, we observe more significant clusters involving conformations with IDs from 50−70.

By comparing the conformations of the ring-neck motif in these clusters with those in the aligned nodes around 80, we found that the backbone structures are rather similar, but the side-chains are of different orientations. In other words, the shape of the ring-neck motif is first stabilized by the backbone structure, and then the side-chains gradually move into right position. There are a few trajectories where the side-chains eventually move to the mirror image of their correct positions, and lead to pseudo-native conformations which can be detected when considering the side-chains.

Figure 4.9 displays several groups of critical events identified by the EPO algo- rithm (corresponding to the aligned nodes as shown in Table 4.1). In particular, 4.9a

72 includes two closely occurred events during the early stage of the folding procedure

(one of the conformations is selected from the aligned node 1, and the other one is from aligned-node 2). At this time, the sequence has started to fold and one can observe the helical structure, but the ring is not yet formed. 4.9b presents two con- formations representing aligned nodes 3 and 4, respectively. We observe that the ring has started to form at this point, but is still not stable. 4.9c shows several conforma- tions (one each from aligned nodes 5 to 15) occurred in order in most successful runs.

At this stage, the ring is adjusted and stabilized. The adjustment mainly happens around the turn area and for side chains. 4.9d shows the final successful structure.

The above results are consistent with the results from [29], where such a ring- shaped substructure was discovered semi-automatically by pairwise structure com- parisons together with expert knowledge.

4.3.5 Timing of EPO

The above experiments are implemented on a Windows XP machine with 1.5GHz

CPU and 512 MB Memory. Table 4.2 compared the running time of the three meth- ods: EPO, EPO without merging, and the POA algorithm (re-implemented by us).

The EPO algorithm is faster than the traditional POA algorithm. This is because

EPO pre-processes all points from input curves by clustering them into groups and only creates a new aligned node if it has a good potential to be heavily aligned. Thus it is not suprising that the EPO algorithm takes less time in aligning Faildata than

SuccData, while the POA algorithm takes longer time (as it creates a lot of aligned nodes, leading to large partial order graphs). This interesting property implies that the EPO algorithm is effective at aligning curves with low level of similarities.

73 MixData SuccData Faildata

unit = Min.

EPO ≈ 30 ≈ 14 ≈ 7

EPO w/o

merging ≈ 26 ≈ 12 ≈ 6

POA ≈ 49 ≈ 23 ≈ 33

Table 4.2: Comparing processing-time by the EPO algorithm, EPO-NoMerge and the traditional POA algorithm in three datsets. MixData includes 116 trajectories,

SuccData includes 58 trajectories and FailData includes 58 trajectories.

74 CHAPTER 5

SMOLIGN: A SPATIAL MOTIFS BASED PROTEIN

MULTIPLE STRUCTURAL ALIGNMENT METHOD

5.1 Introduction

5.1.1 Overview

Proteins carry out their specific biological roles through interaction with other pro- teins or other macro-molecules. This interaction is determined largely by the three dimensional structures of molecules. Therefore, an important direction toward un- derstanding proteins function is to study and analyze their structures. In particular, since structurally similar proteins tend to share common functionalities, one funda- mental task involved in such an analysis is the structural alignment problem, where the proteins are superimposed in order to find the similarities and differences in their structures. Alignment and comparison of protein structures can help discover biolog- ically significant structural motifs and reveal distant evolutionary relationships that may not be detectable from the sequence information alone.

In this chapter, we propose and develop a robust MSTA framework (see flow chart 5.1) that addresses the aforementioned limitations and challenges. In particular, for each input protein, we construct a small set of biologically meaningful motifs based on interacting windows in its contact map (figure 5.1 a and b). The contact map motifs are able to capture features from both SSEs and the residues that do not form

75 distinct SSEs. Additionally, they are spatially constructed to encode geometrical and functional information not available in sequence fragment based motifs. We then develop a novel multi-level extension algorithm that rapidly extends seed alignments from contact-map motifs to global alignments among multiple structures (figure 5.1 c and d). Finally, we iteratively improve the resulting alignments by EPO method

[19], which further optimizes the correspondences among proteins (figure 5.1 e).

Figure 5.1

This strategy endorses a sensitive and robust automated algorithm that can detect similarities among multiple protein structures even under low similarity conditions.

The success of our method is demonstrated on several protein structure datasets that have previously been used under the context of MSTA and span various struc- tural folds and represent different protein similarity levels. For all of the datasets,

76 our method yields better alignment results compared to other popular MSTA meth- ods in general. Our resulting software is available both as a downloadable binary and as a web service at http://bio.cse.ohio-state.edu/Smolign.http://bio.cse.ohio- state.edu/Smolign

5.1.2 Challenges and goals

In chapter 3, we introduced a recently developed MSTA concept, namely simultaneous alignment approach. In contrast to progressive pairwise methods, most of algorithms with simultaneous alignment concept follow a common first step to achieve alignment job by breaking each input structure into a set of small motifs, such as short fragments of protein backbones [71] (AFP) or the secondary structure elements (SSEs) [75].

Motifs shared by all proteins are then assembled in a geometrically consistent manner.

Since the motifs are much smaller than the whole protein, one can afford to use more accurate methods to align them. Furthermore, using the alignments between motifs as seeds to align the entire structures in different ways helps detect partial local similarities among input structures, yielding rigid/flexible alignments. While the current simultaneous alignment methods tend to be more effective at aligning proteins with diverse structures, they still present limitations and challenges.

Local sequence dependency AFP is a popular substructure format in simultane-

ous alignment methods such as [71, 54], etc... We observe that the performance

of the AFP methods rely heavily on the quality of the representation provided

by the fragments. Using backbone fragments tend to produce motifs that are

only constructed by local sequence fragments which hardly reflect spatial sim-

ilarity. Figure 5.2 shows such an example. AFP m aligns two fragments from

each proteins with low RMSD value in figure 5.2a. However the alignment from

AFP n will be missed if we optimally superimpose two proteins only by the 77 (a) Alignment on a single AFP (b) Alignment on both AFPs

Figure 5.2: A closer look into the alignment produced by AFPs. (a) transformation based on AFP m only may loss alignment from AFP n (b) If combine both AFPs, we can get a better alignment.

correspondent residues from AFP m. In practice, this is an invincible conflic-

tion that the better a pair of short local fragments aligned, the worse the global

result will be. In this case, [71] has to iteratively pickup pivot protein for each

Cut to recover the possible missing globally optimal transformation and [54]

has to bend the connections between fragments to align m and n together.

MASS [75] adopts SSEs (or relations between SSEs) to solve the issue as shown

in figure 5.2b. It combines an alpha-helix and a beta-strand together to make

a spatial motif, and also brings an extra benefit for the next discussed issue

in which it greatly reduced the number of alignment candidate seeds. How-

ever, MASS missing motifs that are not based on secondary structures is one of

remaining challenges. Furthermore, the extension of the seed fragment align-

ments to global alignments also remain a challenging problem. The filtering

employed on the possible seeds and the geometric constraints imposed during 78 the extension stage, in most cases, speed up the process at the cost of miss-

ing better global alignments. Therefore, an immediate task is to create bigger

motifs that gives better indication of optimally global alignment. Since either

AFPs in Multiprot or SSEs in MASS are small motifs which are aligned in Cuts

or buckets independently, if a combination method can be found to join two or

more motifs between Cuts or buckets, the generated global alignment will be

more promising.

Progressive approach on motif set selection According to the detail discussion

in section 3.3.2, all simultaneous alignment methods face a common NP-hard

issue which is how to select the best alignment seed to superimpose structures

from collection of similar substructures. For instance, MASS has to enumerate

all possible bucket combinations with exponential time if seeking best solution.

Both Multiprot and MASS are forced to adopt heuristic solutions. Basically,

they all use a simple cent-star method to progressively join new structures as

we described in section 3.3.1. However, the naive heuristic solution causes local

minimum issue again. Further studying on this critical issue in simultaneous

alignment methods leads a challenging question: Is there a heuristic solution

that inherits simultaneous manner to select optimal alignment seed?

After carefully studied all above issues, the goals of our framework are as following:

At the data modeling stage, we wish to find a concise (so that the computational cost remains low), yet complete (so that we do not miss important structural similarities) set of motifs. Such motifs cover enough spatial so that the final global alignment is sequence independent, also alignment seeds are created simultaneously in order to greatly alleviate the local minimum effect. At the global alignment stage, we wish to consider all structures together without limits of alignment order. Meanwhile, we also want to automatically obtain subset alignment information in this stage. At 79 the final tuneup stage, we want to catch up those irregular alignment portion which cannot be revealed from the previous motif detection stage.

5.2 Methods

5.2.1 Algorithm Overview

The objective of our algorithm is to find the largest multiple alignment among k protein structures while maintaining a cumulative error below a threshold ². This error is quantified as the multiple RMSD (mRMSD) measure (shown in equation

3.4.3) which computes the average of the RMSD values between the aligned residues of a pivot protein Pp and the corresponding residues of the other proteins.

Figure 5.3: Overview of the algorithm. (a) Input protein structures. (b) An example contact map. The contact cells are shown as dots in the corresponding matrix entries.

The sub-windows are extracted to cover the spatial patterns in the contact map. (c)

Spatial Motif Library composed of motifs extracted from the contact maps. (d) Seed alignment of an αβ motif. (e) Extended seed alignment. (f) Refined alignment.

80 A high level description of our algorithm is shown in Figure 5.3. From a dataset of k protein structures, we first extract contact window patterns from the distance map of each protein. These patterns provide a transformation invariant representation of local structures. We observe that pairs of contact windows present a good balance between sensitivity and specificity of fragments to be utilized in multiple structure alignment. Therefore, the contact window patterns in a distance map that are in close proximity are paired up into linked motifs which make up the Spatial Motifs

Library (SML). Compatible motifs common to all proteins are identified from the

SML using a dynamic filtering procedure, and a fast distance-map based alignment method is used to build the seed alignments. The seeds that satisfy a predefined mRMSD threshold are merged to rapidly form larger extended alignments. For rigid alignments, a single set of correspondences defined by the extended alignments is refined using EPO, an enhanced partial order curve comparison algorithm [19]. For

flexible alignments, multiple sets of non-conflicting correspondence sets are used in the refinement. In the following sections, we describe each of these steps in detail.

5.2.2 Construction of the SML

The residue-contact patterns of protein structures are the most conserved features of distantly-related proteins [124], which motivates us to capture and use such patterns for aligning multiple structures. We represent each protein structure using the dis- tance matrix [76] of its alpha-carbon atoms. Distance matrix captures the structure and connectivity information and provides a complete representation of the protein structure that is invariant under rigid transformations [81].

The entries of the distance matrix that are less than a predefined threshold (typ- ically 6A)˚ are denoted as contact cells and they correspond to the residues that are in close proximity in the 3D structure. The collection of these cells give the contact

81 map of the protein (Figure 5.3b), which can be used to identify SSE or other struc- tural patterns. Specifically, the fragments along the diagonal are alpha-helices (α), the fragments parallel or perpendicular to the diagonal are parallel and anti-parallel beta-sheets (β+ and β−), and others less regular fragments of residue contacts cor- respond to small loops (L) and free shapes (F ). We utilize the distance and contact maps to extract and classify similar structural motifs that constitute the Spatial Motif

Library (SML).

Contact windows. An initial 4 × 4 sliding window is used to scan the distance map for detecting any of the SSEs and other significant patterns. We then expand the initial size of the captured window row and column-wise simultaneously until such an expansion no longer incorporates a new contact cell.

Note that individual contact windows by themselves do not in general provide a sensitive representation to be used for structural alignment. Because of the regular- ities in SSEs, many of the contact windows from multiple proteins would align well, but would not necessarily induce a good alignment for the rest of the protein. On the other hand, using pairs of contact windows as seed motifs greatly increases the dis- crimination power of such motifs. One can use even higher order motifs by combining multiple contact windows; however, this risks being too restrictive and it may not be possible to find such higher order motifs shared by all proteins. Therefore, we use pairs of contact windows as our primary spatial motifs, to serve as seed alignments.

Pairs of structural fragments have previously been utilized by one of the earlier

MSTA methods MASS [75], where SSEs are represented as line segments and pairs of SSEs are used to provide seed alignments. Using contact windows instead of SSEs provides a more descriptive representation of motifs and captures spatial arrange- ments that do not form distinct SSEs.

Spatial Motifs.

82 Pairs of interacting and compatible contact windows are linked to form the

Spatial Motifs (Figure 5.3c). A regular spatial motif is formed by linking two α helices (αα), or an α helix and a β sheet (αβ), or two β sheets (ββ). The actual

Spatial Motif types are composed in the following table 5.1:

83 Motif type Description Demo Graph

αα two alpha-helix closed to each other

αβ+ an alpha-helix and a parallel beta sheet

αβ− an alpha-helix and an anti-parallel beta sheet

two parallel beta sheets share the first strand (in

+ + β β1 the sequence order)

+ + β β2 two parallel beta sheets share the second strand

+ + β β3 two parallel beta sheets share the last strand a parallel beta sheet and an anti-parallel beta

+ − β β1 sheet share the first strand a parallel beta sheet and an anti-parallel beta

+ − β β2 sheet share the second strand Table 5.1: Totally 17 motif types are recognized in SML. A collection, called bucket is used to efficiently store each type of motifs. (1) The order of fragments saved in a Spatial Motif is according to the sequence order from N-terminus to C-terminus.

(2) The last 5 motif types in the table are directly detected from the contact map.

Usually, they are not involved in seed alignment unless there are no enough Spatial

Motifs found in a given data set. (3) The top 12 Spatial Motif types are obtained from the linked α (blue column), β+ and β− (green column, shared strand is marked with bold line). In particular, αα, αβ+ and αβ− require D < 13A˚ where D is calculated between α and the first β strand; δ is the number of residues between the end of the first fragment and the start of the second fragment; θ is only calculated in those Spatial Motifs where α is involved by utilizing the least square line.

to be cont’d on next page

84 table 5.1: cont’d

a parallel beta sheet and an anti-parallel beta

+ − β β3 sheet share the last strand two anti-parallel beta sheets share the first

− − β β1 strand two anti-parallel beta sheets share the second

− − β β2 strand two anti-parallel beta sheets share the last

− − β β3 strand α single alpha-helix -

β+ single parallel beta sheet -

β− single anti-parallel beta sheet -

L a loop (usually exist between two SSEs) -

F two random fragments closed to each other -

this is the end of table 5.1

In order to impose that the linked contact windows are interacting in the 3D structure, we further require that the fragments represented by the contact windows are closer than a predefined threshold (typically 13A),˚ and in the case of β sheets, that they share one of their strands. In order to facilitate efficient identification and fast alignment of compatible motifs, we attach the following geometric features onto each motif:

• The protein identification (id).

• The minimum Euclidean distance (D) between the amino acid residues of the

pairs of contact windows (only recorded in αα, αβ+ and αβ− motif types). 85 • Number of amino acid residues in this motif (s) (It is the size of a motif).

• The fragments position information (p) along the protein sequence (it indicates

the relative location of a motif along the protein sequence).

• The angle (θ) between the backbone segments and of each contact window (only

recorded in αα, αβ+ and αβ− motif types).

• Number of amino acid residues (δ) separating the contact windows along the

backbone.

Note that for some sets of proteins, the regular motifs formed by α and β contact windows may not be sufficient to induce a global alignment. Moreover, the SSE assignments are error-prone and may not be consistent across the related proteins.

In order to handle such cases, we store the irregular contact windows from loops

(L) and free shapes (F ) along with single SSEs α, β+ and β− as part of the SML, and resort to these motifs if the regular motifs do not provide satisfactory alignment seeds. The detailed definition of SML construction is given below with the reference of table 5.1:

Definition 5.2.1 (SML). Given a protein set, its SML includes 17 buckets, such as

+ − + + + + + + + − + − + − − − − − − − + − SML = {Bm type }, where m type ∈ {αα, αβ , αβ , β β1 , β β2 , β β3 , β β1 , β β2 , β β3 , β β1 , β β2 , β β3 , α, β , β , L, F }.

Each Bm type = {m(id,D,s,p,θ,δ)}, where m is a motif associated with 6 properties.

5.2.3 Obtaining seed alignments

Upon alignment of similar motifs, one from protein structure in SML would pro- vide seed alignments around which the rest of the protein structure can be aligned.

However, determination of similarity involves expensive operations of finding optimal combination of Spatial Motifs and performing structural alignment.

86 5.2.3.1 Selection of seed motifs set

As of all other simultaneous alignment methods mentioned in the previous section, we have to adopt heuristic approach to achieve the optimal seed selecting task. Unlike previous methods, our strategy is much more complex and advanced which relies on heuristics using the SSE types, and the D, s, p, θ and δ feature values of the motifs.

We will explain this novel selecting step in more detail.

Given SML created from a protein set with size of k, we first pickup buckets only including motifs from all k proteins. Then for each survived bucket, we perform a dynamic, multi-level and iterative adjustment steps on a set of pruning thresholds associated with 5 geometric feature values. In particular, the definition of pruning thresholds in our algorithm is in table 5.2.

The tune-up procedure on pruning thresholds is shown in figure 5.4. At the beginning of every tune-up iteration, the input are a bucket of motifs and a set of pruning thresholds with values either from the previous iteration or from initially pre-defined. We start from sorting all motifs on one of the geometric features, e.g.

D value in figure 5.4(a), then apply an extendable scanning window along the sorted sequence in which the extending conditions are:

(i) {mDi } includes at least one motif from each of k proteins;

ms me (ii) |D − D | ≤ Dλ.

For each scanning window {mDi }, we first attach an important property on it, namely number of possible motif combinations:

NComb{mDi } = qj|mj| (5.2.1)

where |mj| is the number of motifs from each protein j in this scanning window. Then, we continually sort it on one of the rest geometric features at the next

87 Pruning threshold Value range Initial value ↑ factor Memo

Applied only in α related mo-

Dλ [1A,˚ 13A˚] 1A˚ 2A˚ tifs. Control the size difference be-

sλ [4AA, 56AA] 20 4 tween two similar motifs. Control the relative position

difference between two similar

pλ [0.05, 0.6] 0.05 0.05 motifs. Applied only in α related mo-

θλ [0.05, 1] 0.05 0.05 tifs. Control the distribution dif-

ference between two similar

δλ [4AA, 56AA] 18 4 motifs.

Table 5.2: List of pruning thresholds. Each threshold controls the similarity of one particular geometric feature between compared motifs. e.g. given Dλ = 1A˚, if we consider two motifs a and b are similar in terms of D value, then it implies that

a b |D − D | ≤ Dλ.

88 Figure 5.4: Selection of seed motifs set. (a) Sorted motifs list from bucket,

{m(id,D,s,p,θ,δ)} on feature value D (This pruning level is only applied to α related buckets and it is omitted in all ββ buckets) . Rectangle represents a scanning win- dow, {mDi } that is a subset of this bucket. (b), (c), (d), (e) each is a sorted list from the previously generated motifs subset.

89 level, create scanning windows and attach NComb properties iteratively as operations shown in figure 5.4(b)(c)(d)(e).

At the end of every tune-up iteration, we obtain a group of {mδ}s. By observing the NComb properties attached to them, we can decide the tune-up direction at this moment: If sum of Ncomb is too small, e.g. less than a per-defined volume which means the algorithm still has enough capability to cover more possible alignment options, we then gradually relax the threshold values and start next tune-up iteration.

On the other hand, sum of NComb maybe too large to be handled by algorithm, therefore we trace back the largest scanning window and its parent (upper level scanning window) in terms of Ncomb and restrict the the threshold values and start next tune-up iteration on the opposite direction.

The above procedure keep running automatically until a desired number of high quality seed alignments and a suitable set of pruning threshold values are generated.

The advantages of dynamically adjusting pruning thresholds lies on two aspects. First, a set of suitable pruning threshold values will keep optimal motif combinations and push aside less similar motif combinations, by which we avoid to perform costly enu- merating operations. Secondly, we can control the amount of motif combinations

(seeds) on which we will perform actual alignment later. Usually, a set of strict prun- ing threshold values causes less remaining seeds. On the contrary, relaxed threshold values (by using uparrow factor) increases number of seeds but let us cover more possible alignment options.

In addition, the pruning order shown in figure 5.4 is not random. We decide it by the reliability of each feature, from high to low. For example we observe alignment results from previous works and find that motifs with similar size are aligned better than those with diverged size. Therefore we push feature p to higher pruning level which has more power to control the seeds number, meanwhile has less chance to miss

90 optimal alignment. θ is generated by the least square lines of two fragments and it is not very reliable since a fragment can be bent to any shapes in physicochemical envi- ronment. Thus we have it in the lower pruning level and expect a relaxed threshold value which gives more coverage on possible alignment options.

5.2.3.2 Seeds pruning by biological constraints

So far, our seed selecting method purely depends on geometric properties. However, protein confirmations are dynamic in its solvent and our measurement are not accu- rate all the time, we have to relax our geometric constraints to cover loose but poten- tially optimized alignments in some situations. In other words, large number of false position seed candidates may emerge during this relaxing procedure. To address this issue, we consider a different pruning scheme which requires the potentially aligned amino acids of the similar biological type. In particular, the different properties of amino acids result from variations in the structures of different R groups which are of- ten referred to as the amino acid side chain. RASMOL [125] classified different classes of amino acids determined by different side chains in table 5.5. We adopt the following exclusive classifications: (1) surface/buried, (2)polar/hydrophobic, (3)acyclic/cyclic.

In section 5.2.2, we build a SML for given protein dataset with each type of spatial motifs saved in a bucket. Method in following section 5.2.3.1 detects a set of poten- tially optimal seed alignments by applying a scanning window pruning method on top of motif’s geometric properties. Similarly, we can apply biological constraints (bio- constraints) as supplementary pruning scheme to those extremely difficult situations from the above section.

One of the difficult cases in the above pruning method is that the distributions of geometric properties maybe highly skewed. For instance, we may only get a small

NComb in one of iteration, but once the thresholds are relaxed at the next iteration,

91 huge NCombs from some scanning windows suddenly appear. At such situation, bio- constraints can kick in right away and provide a more robust amino acid characteristic than an over-relaxed geometric property. Later in this chapter, we will show the effects of geometric pruning as well as bio-constraints.

Although this complex and advanced strategy can effectively select alignment seeds from a potentially innumerable motif combinations in most of the cases, we have to point out that it is still a heuristic approach that cannot guarantee the best solutions. The invincible issue is that pruning features are not exactly accurate and determinate and may always cause missing optimal alignment seeds.

5.2.3.3 Alignment of candidate seeds.

After the pruning step, we obtain a set of candidate seeds, where each seed consists of k similar motifs, with exactly one from each protein. Note that all detection and pruning tasks in above sections are performed on top of motif level which is fast con- sidering the relative small amount of spatial motifs in protein structures. From now on, we treat each candidate seed separately and perform alignment of its member mo- tifs to generate and identify the seed alignments satisfying the mRMSD criteria. The alignment of the spatial motifs involves identifying individual residue correspondences and from these correspondences, calculating the superimposition that minimizes the mRMSD measure.

The beta-sheets possess relatively well-defined shapes. Thus, for those 9 β±β± categories, we simply select the smallest motif to be the central motif and slide it over the rest of the motifs in the candidate seed to generate gapless alignments. We then apply Quaternion transformation and rotation [126] based on the correspondences induced by each alignment and identify the seed alignments that satisfy the mRMSD criteria.

92 Figure 5.5: Amino acid property table cited from [125]. We use 3-pair of ex- clusive properties and assign binary value on each: (1) surface/buried (1/0), (2) polar/hydrophobic (1/0), (3) acyclic/cyclic (1/0). To quantify 3 property pairs

Vsurface/buried, Vpolar/hydrophobic and Vacyclic/cyclic in a motif, we sum up the property value of every amino acid and divide it by total amino acid number. E.g. a given motif

{V AL, P RO, T Y R, GLY, V AL, SER}, Vsurface/buried = (0+1+1+1+0+1)/6 ≈ 0.67,

Vpolar/hydrophobic = (0 + 0 + 0 + 0 + 0 + 1)/6 ≈ 0.17 and Vacyclic/cyclic = (1 + 0 + 0 + 1 + 1 + 1)/6 ≈ 0.67.

93 For the rest of the motif categories, we utilize the contact windows of the motifs to assign the residue correspondences. The contact window (CW ) of a motif is part of the contact map that covers only the residues forming the motif. The alignment of two contact windows (CW1 and CW2) is found using the MaximumOverlap algorithm below. The contact windows are slided over each other and each sliding window defines a gapless alignment between the two motifs. The algorithm returns the sliding window that maximizes the number of contacts common to both contact windows as induced by the alignment.

Algorithm 2: MaximumOverlap Input: contact windows CW1, CW2 Output: bestS: sliding window with maximum overlap of contacts

1 maxContacts ← 0;

2 foreach sliding window s aligning CW1 and CW2 do

3 count ← 0;

4 foreach pair of overlapped cells do

5 if both are contact cells then

6 count + +;

7 if count > maxContacts then

8 maxContacts ← count;

9 bestS ← s;

We consider each motif in a candidate seed as the central motif and calculate the pairwise alignments with each of the rest of the motifs in the candidate seed. If a contact cell from the central motif’s contact window overlaps with a contact cell from every other motif, we note that there is a common correspondence involving a pair

94 of amino acids from each protein. We repeat the alignment procedure, considering each of the motifs as the central motif, and seek the one that gives the maximum number of common correspondences. Based on these correspondences, the Quaternion transformations are calculated to obtain the mRMSD error of the alignment.

Figure 5.3d shows an example candidate seed from the αβ category, which includes

5 Serine Protease proteins represented in color. The longest common correspondences of the candidate seed is found to be 34, which gives a seed alignment with an mRMSD of 0.44A.˚

5.2.4 Extending the seed alignments

Each seed alignment contains a small local geometrical motif common to all protein structures and can be used as a reference to rotate and translate the whole structures.

However, we realize that an individual candidate seed may be too small to generate high quality global transformations, e.g. similar issue as shown in figure 5.2 could occur if a small motif locates at the corner of structural space. Furthermore, some of the seed alignments may induce the same global alignment causing redundant com- putation. To alleviate these problems, we construct more reliable skeleton structures through merging of compatible seed alignments.

In the ExtendSeed algorithm outlined below, a seed alignment si is enriched with the compatible correspondences from other seeds that have similar transformations.

A correspondence is added onto si so long as it does not conflict with a correspondence already present in si and its addition still maintains a structural superposition error below the threshold (mRMSD < ²).

Each extended seed combines multiple motifs from the seed alignments and ob- tains longer high quality correspondences which are distributed along the whole struc- tural space, Therefor, a larger extended seed provides more reliable base for the

95 Algorithm 3: ExtendSeed Input: S: the set of seed alignments

Input: si ∈ S: the seed to be extended

Output: si: the extended seed

1 foreach sj ∈ S and sj6=si do

2 if τj ≈ τi then //similar transformations

3 foreach cp ∈ sj do //cp: residue correspondence

4 if not Conflicts(cp,si) and mRMSD(si ∪ cp)< ² then

5 si ← si ∪ cp

Quaternion transformation and induces a better global alignment with a larger core.

In the sample shown in Figure 5.3e, the seed alignment is extended from 34 (0.44A)˚ to 134 (1.0A)˚ common correspondences.

5.2.5 Global alignment by EPO

The extended candidate sets provide correspondences for only certain sections (mo- tifs) of the protein structures, from which pairwise translation and rotation matrices are generated. It still remains to find correspondences for the rest of the structure and optimize the transformations to maximize the alignment core and minimize global mRMSD. We use the Enhanced Partial Order (EPO) curve comparison algorithm that we have introduced in chapter 4 to find common superpositions of the transformed structures and optimize the global rigid-body alignment. The GlobalAlignment al- gorithm below iteratively performs EPO process, where each iteration generates new correspondences and transformations, which are then used as input to the next it- eration. The process is repeated until no improvement in alignment core size is

96 obtained. Figure 5.3f shows the final alignment of 5 protein structures; where EPO

finds a structural superposition of 243 correspondences with mRMSD = 1.15A.˚

Algorithm 4: GlobalAlignment Input: PS: whole protein set to be aligned

Input: s: a candidate seed

Input: maxSize = 0: previous alignment size

Input: currSize = 0: current alignment size in every iteration

Input: POG = null: An POG object

Output: PSaligned: whole protein set has been aligned

1 PS = τs(PS) //translate PS into new position by s

2 POG = EP OAlignment(PS)

3 currSize = P OG.getF ullNodes() //get the # of nodes that contains residues from

each protein.

4 s = P OG.getCorrespondences()

5 while currSize > maxSize do

6 maxSize = currSize

7 PS = τs(PS) //translate PS into new position by s0

8 POG = EP OAlignment(PS)

9 POG = P OG.postMerge()

10 currSize = P OG.getF ullNodes()

11 s = P OG.getCorrespondences()

By default, the above GlobalAlignment detects the optimal alignment on the whole input protein set. However, it can be easily extended to detect a subset of proteins that are more similar than the whole input set. E.g. given the POG created

97 in last iteration of GlobalAlignment (see line 8) in which every residue from input set is attached on a partial order node, we build a hierarchial tree with each protein as leaf and combine two structurally closest proteins (two proteins share the most number of nodes in POG) as upper level node. In this way, our algorithm shows a clear picture of structural relationship of a data set.

5.2.6 Flexible alignments

Introducing flexibility to structural alignment becomes useful for two main reasons.

Firstly, a protein may be present in multiple conformational states due to phospho- rylation, interaction with other proteins, or ligand binding [127]. Secondly, especially divergent protein structures contain twists and bends in their structures which cannot be detected by rigid alignment alone. Because Smolign uses a bottom-up approach starting from local structural motifs, the method introduced thus far can naturally be extended to handle flexibility in alignments. Specifically, we achieve this by building multiple structural cores that cover different areas of the proteins, without restricting that they share the same rigid transformation. The final set of alignments generated in this way not only handle flexibility in the structures, but also can capture sequence order independent alignments.

The CollectF lexibleSeeds algorithm below outlines the process of identifying a complementary set of structural cores from the extended seed alignments produced in Section 5.2.4. In order to avoid testing an exponential number of different combi- nations of seeds, we use a heuristic cost measure to focus the grouping of the seeds toward combinations that include larger, complementary fragments. For each seed, we quantify the cost of combining it with other seeds by a mergeCost, defined as:

number of seeds conflicting seedi mergeCosti = (5.2.2) size of seedi

98 We sort the list of seeds by their mergeCost values and starting with the seed that has the smallest mergeCost, we combine compatible seeds to cover as much of the proteins as possible. A new seed is combined with the collection of compatible seeds S0, only if its inclusion increases the coverage of the correspondence set by a minF ragment threshold (minF ragment = 4 is used as the default value). This ensures that the proteins are not over-fragmented in the final flexible alignment.

Algorithm 5: CollectF lexibleSeeds

Input: S = {si}: the set of extended seeds Output: S0: collection of compatible extended seeds

1 Sort S in ascending order of mergeCost;

0 2 S ← {s0};

3 for i = 1 ... |S| do

4 if mergeCost == 0 then //can be added without conflicts

0 0 5 S ← S ∪ si

6 else

0 0 7 si ← si\S //residues not already covered; 0 8 if |si| ≥ minF ragment then 0 0 9 S ← S ∪ si

After a collection of core alignments is obtained, each core is used to induce an optimized multiple alignment through EPO, as done in Section 5.2.5. Whenever a residue correspondence conflict arises between the assignments of different cores, the assignment of the larger core is kept. In order to spatially combine the transfor- mations of multiple cores, we take the central protein structure from the first core in the collection as the rigid structure. The transformations of the other cores are

99 calculated in reference to this central structure. The residues that do not have any correspondences are transformed using the transformation of the first core.

5.3 Experimental Evaluation Of Smolign

We performed a number of case-based and large scale experiments to demonstrate the capability of Smolign to handle different challenges of MSTA problems. In sec- tion 5.3.1, we report the results of typical multiple alignment datasets from the lit- erature and discuss how well Smolign handles different spatial data. In section 5.3.2, we describe a flexible alignment case in detail. In section 5.3.3, we provide a large scale comparison with other MSTA methods using the Homstrad benchmark [128].

Additionally, we compared our Smolign with a couple of newly developed methods in section 5.3.4. In section 5.3.5, we demonstrate the effectiveness of a few key techniques we introduced in Smolign framework including selection of motif seeds, extended seeds and EPO implementation on global alignment. Finally, in section 5.3.6, we briefly summarize the technique advantages of Smolign over selected typical MSTA meth- ods. The experiments presented here, along with alignments from the BAliBASE [129] benchmark dataset, are made available on the supplementary website.

5.3.1 Sample Alignments

Five protein structural datasets are used to benchmark the performance of our al- gorithm (See Table 5.3). These datasets represent different structural folds, span different structural similarity levels, and have previously been used in analysis of multiple structure alignment algorithms. The multiple alignment results for all 5 datasets are compared with those of other popular MSTA methods. In particular, we compare our results with CE-MC [97], Multiprot [71], MAMMOTH-mult [68],

POSA [54], and MASS [75]. 100 Data set Members Average Size PDB Codes

Set 1

Serine Proteases 5 277 1cseE 1sbnE 1pekE 3prkE 3tecE

Set 2

Calmodulin-like 3 161 1jfjA 1ncx 2sas

Set 3

101 Tim-barrels 7 391 1btc 1pii 1tml 4enl 5rubA 6xia 7timA

Set 4

2 Helix-Bundle 10 140 1flx 1aep 1bbhA 1bgeB 1le2 1rcb 256bA 2ccyA 2hmzA 3inkC

Set 5 1afp 1b9nA3 1ckmA2 1esfA1 1fr3A 1jic 1tiiD 2tmp 1b7yB2

OB fold 15 176 1bovA 1eif02 1fjgQ 1htp 1sro 2sns

Table 5.3: Protein data sets used for comparing structural alignment methods. Average Size is the average number of

residues in the proteins in each data set. We obtained the multiple alignments for each dataset using the online web service provided for these methods. Two vital norms are used for comparing the results:

NCORE, which is the length of the multiple alignment calculated as the number of amino-acid correspondences, and mRMSD, which is an indicator of the alignment quality.

The results for all methods are summarized in Table 5.4. The POSA algorithm provides two sets of results: flexible and non-flexible alignments. We use the non-

flexible alignments for comparison here and use the flexible case in the next sub- section. For the results from MAMMOTH, we count the number of “strict cores” as

NCORE since “loose cores” reported by MAMMOTH only align partial structures closely. Since Multiprot allows adjustment of its parameters and returns the most competitive results, we have adjusted its parameters to obtain an accuracy level that matches that of Smolign in order to make the NCORE comparison more meaningful.

Specifically, the accurate values of 3.8A,˚ 4.4A,˚ 3.5A,˚ 3.1A,˚ and 3.0A˚ was used for the

Multiprot server for datasets 1-5, respectively.

Note that the main objective of our method is to obtain the longest alignment that satisfies a user-defined structural similarity threshold. In some cases, smaller but more conserved alignments may also be biologically important and of interest to the user. Therefore, in the available implementation we provide the top n final alignments, in decreasing order of the alignment lengths. For comparison with other methods, we report here only the top scoring alignment for each dataset in Table 5.4.

The complete set of alignments obtained by Smolign can be viewed and downloaded from the supplementary website.

The 5 proteins in Set 1 belong to the Subtilases family of subtilisin-like serine proteases, that have a common evolutionary origin and share highly similar struc- tures and functional features [130]. All of the compared methods align these proteins

102 Data Set CE-MC POSA MAMMOTH Multiprot MASS Smolign

Ncore mRMSD Ncore mRMSD Ncore mRMSD Ncore mRMSD Ncore mRMSD Ncore mRMSD

Set 1 244 1.83A˚ 252 2.08A˚ 223 0.86A˚ 237 1.29A˚ 228 0.97A˚ 245 1.14A˚

Set 2 62 5.80A˚ 67 2.92A˚ 15 1.64A˚ 58 1.92A˚ 50 1.4A˚ 59 1.95A˚

Set 3 ------27 2.08A˚ 30 2.00A˚ 41 2.08A˚ 103 Set 4 ------22 1.80A˚ 15 1.80A˚ 34 1.78A˚

Set 5 ------9 1.27A˚ - - 13 1.74A˚

Table 5.4: Comparison of multiple structure alignment methods on sample alignment datasets. In order to obtain compa-

rable results with other methods, a similarity threshold of ² = 3A˚ was used in Smolign. ”-” indicates that the respective

server did not return any results. reasonably well. Our method provides better alignments than CE-MC, POSA, and

Multiprot. POSA has the maximum NCORE but incurs a large mRMSD cost. MAM-

MOTH and MASS generate more conservative alignments, that align tightly but have smaller coverage. If the ² error threshold in Smolign is reduced from 3A˚ to 2A˚ in order to seek more conservative alignments, it is possible to obtain an alignment with NCORE=230 and mRMSD=0.89A,˚ which is a longer alignment than that of

MAMMOTH, with only a slightly worse mRMSD.

Set 2 has only 3 proteins (PDB: 1cnx, 1jfjA, and 2sas), but the aligned motifs are very diverse. CATH [57] classifies 1ncx and 2sas to have one alpha helical do- main and 1jfjA to have two alpha helical domains. The alignments produced by each method is shown in Figure 5.6. CE-MC and POSA return alignments with inferior mRMSD scores, without significant improvement in coverage over other methods.

Our method, Multiprot, and MASS align the same domain regions, where our align- ment is comparable in both norms to Multiprot. MASS gives a smaller core and a better mRMSD. MAMMOTH, as in Set 1, finds a very small conservative core with a worse mRMSD than MASS. We are again able to control the accuracy of our results by seeking more conservative alignments that satisfy a smaller mRMSD threshold and obtain an alignment with NCORE = 48 and mRMSD = 1.4A˚ when ² = 1.7A,˚ which is comparable to the output of MASS. The Smolign alignment is shown in

Figure 5.6f. The differences in the alignment of this dataset is mainly due to the fact that the progressive pairwise alignment procedure prevents the methods to find the best alignment. While the proteins 1ncx and 2sas are most similar at the EF-hand calcium binding domain (cd00051 in the Conserved Domain Database [131]), 1jfjA and 2sas are most similar at the long alpha-helical segment that connects the two EF- hand domains. An initial alignment of 1jfjA and 2sas, having better global similarity than the other two pairwise alignments, prevents the EF-hand domains of all three

104 (a) CE-MC (b) POSA (c) MAMMOTH

(d) Multiprot (e) MASS (f) Smolign

Figure 5.6: Multiple structure alignments of Set 2 Calmoduline-like proteins by dif- ferent methods. Each protein is shown in a different color: 1jfj, yellow; 1ncx, red; and

2sas, green. The thick blue portions of the backbones indicate the aligned residues.

CE-MC alignment provides the superposed structures, but not the residue correspon- dences.

105 proteins to be aligned properly. The center-star alignment procedure used in Mul- tiprot, and the non-progressive alignment methodology of MASS and Smolign avoid this pitfall and give better results. MASS and Smolign capture the common EF-hand domain by using the alignment seeds from the EF-hand region, and considering all of the proteins simultaneously, extend these seeds to obtain the final alignment core.

Set 3, the Tim-barrels proteins, contains 7 complex structures. Each structure has multiple alpha-helices and beta strands, creating a large number of potential align- ment combinations. CE-MC, POSA, and MAMMOTH fail to produce an alignment.

Our algorithm not only outperforms both Multiprot and MASS, but also produces an alignment with better spatial continuity. Figure 5.7 shows that Multiprot aligns less number of structural fragments, whereas MASS produces an over-fragmented align- ment core, and only Smolign captures the most complete set of structural fragments, including 3 alpha-helical segments and 4 beta strands. Note that, the Tim-barrel proteins usually contain their enzymatic active sites on the loop regions, frequently on the C-terminal end of the sheets. While it is desirable to detect such functional residues, they are not part of the conserved structural core of the proteins and are not detected by multiple structure alignment methods. Methods based on residue conservation [132] are more appropriate for such an analysis.

Set 4 contains helix-bundle proteins selected from 6 superfamilies, whose skeleton includes four closely packed alpha-helices. It presents a challenge for MSTA methods because of the large dataset size and its structural divergence. CE-MC, POSA, and

MAMMOTH again fail to report an alignment. MASS alignment contains a very short helix pair, whereas Multiprot reports either a single long helix or a shorter helix pair depending on the chosen parameters. Smolign consistently outperforms both methods in both norms: it finds a longer alpha-helix pair and a higher quality alignment. Smolign alignment takes under 8 minutes for this dataset.

106 (a) Multiprot (b) MASS (c) Smolign (d) Core fragments

Figure 5.7: A closer look into the alignment produced by Multiprot, MASS, and

Smolign for data set 3, Tim barrels. We only show the complete structure of PDB:4enl as a blue trace. In (d), a helix or strand is considered to be a fragment if its alignment spans more than 5 amino acids and the gaps within the fragment is less than 2.

Set 5 is a very large data set of OB-fold proteins, serving as a stress test for the multiple alignment programs, and the similarity among proteins is extremely low (7% average sequence identity). It is commonly used as a special case to test the sensitivity of MSTA methods. Only our method and Multiprot survive the strain, giving comparable NCORE and mRMSD trade-offs. The common fold of the OB(oligonuclueotide/oligosaccharide binding)-fold proteins has a five-stranded beta-barrel, capped by an alpha helix [133]. Multiprot finds an alignment involving only two of these beta-strands. Smolign is able align three of these beta-strands common among the 15 proteins in the dataset, at an execution time of 40 minutes.

5.3.2 Flexible Alignments

The flexible alignment feature of Smolign is demonstrated here using the data set 2,

Calmodulin-like proteins. These proteins are composed of two distinct components separated by a long and flexible alpha helix. Due to bending of this alpha helical 107 segment, it is not possible to simultaneously align the two sub-structures by a rigid alignment (Figure 5.6f). The best rigid alignment of Smolign aligns 59 residues from the C-terminal domain with an mRMSD of 1.95A.˚ Using this alignment as the an- chor, we aggregate compatible cores as described in Section 5.2.6 to obtain a flexible alignment shown in Figure 5.8b.

The flexible alignments produced by POSA and Smolign show comparable cov- erage and quality metrics, while Smolign achieves a less fragmented alignment (Fig- ures 5.8a and 5.8b). The main difference of the flexible alignment results comes from the philosophy of applying flexibility. POSA and other MSTA algorithms tend to bend a sequence of fragments multiple times to gain better core size and mRMSD at the cost of loosing structural integrity between aligned fragments. Smolign, on the other hand, strictly maintains spatial consistency of each aligned core, while optimiz- ing for core size and mRMSD. The POSA flexible alignment in Figure 5.8a breaks the PDB:1cnx structure at 4 locations and does not preserve the spatial relationship of the fragments. Whereas, the Smolign alignment (Figure 5.8b) consists of only 2 cores whose spatial arrangement is more faithful to the conformation of the structures being aligned and readily yields the interpretation that a single flexible alpha helical segment is responsible for the structural differences among these proteins.

5.3.3 Homstrad Benchmark

Homstrad [128] benchmark dataset contains manually curated pairwise and multiple alignments of highly homologous proteins. The similarity of the aligned proteins is comparable to that of the family level in the SCOP [134] hierarchical classification database. Following the experiments by [72] and [54], we use the 399 Homstrad align- ments that have more than two structures, to illustrate the performance of Smolign.

The coverage and accuracy of the rigid alignments obtained by Smolign is found

108 (a) POSA flexible (b) Smolign flexible

Figure 5.8: Rigid and flexible alignments of dataset 2, Calmodulin-like proteins. The rigid/seed core is shown in thick blue trace in each subfigure. (a) Each structure in the rigid alignment is shown in a different color. (b and c) Each alignment core in the flexible alignment is shown in a different color. Blue portion is the alignment core without bending, other colors show alignments after bending. Only 1cnx is shown in full to provide a perspective of the whole structure. The residues of 1jfjA and 2sas that are not part of the alignment are omitted for clarity. Bending occurs on the conjunction points of different colors.

comparable to other methods (Table 5.5). MATT, POSA, and Smolign give similar overall results, with Smolign giving slightly longer alignments comparable or better mRMSD. MUSTANG performs worse than others in both mRMSD and core size.

Multiprot alignments are more conservative and do not capture the extent of struc- tural fold similarity of the aligned proteins.

While the results for highly similar Homstrad families were consistent among all the methods, Smolign performed comparable to or better than other methods on less similar datasets, such as the seatoxin dataset, whose members do not include distinct secondary structure elements, but are composed of many coils and turns.

109 Method Avg. mRMSD Avg. Core Size

MATT 2.04 172

Multiprot 1.35 142

MUSTANG 2.67 171

POSA (rigid) 2.00 165

POSA (flexible) 2.22 168

Smolign (rigid) 2.05 174

Smolign (flexible) 2.00 177

Table 5.5: Multiple alignment results for the Homstrad benchmark. mRMSD and core size are averages of all Homstrad datasets. The results (except for those of

Smolign) are taken from [72].

Furthermore, the Smolign flexible alignments are particularly enhanced in detecting multiply concurrent structural motifs while maintaining the spatial continuity of the aligned segments. Comparison of flexible and rigid alignments of the HOMSTRAD datasets identifies 57 cases of flexible alignments. The average coverage of Smolign rigid alignments for these 57 sets were 201 residues (mRMSD=2.19A).˚ The flexible alignments increase the coverage by 10% (Ncore=221 residues, mRMSD=2.17A),˚ with an average of 2.2 bends introduced in each alignment. The rigid and flexible Homstrad alignment results can be accessed on the supplementary web page.

Running time. The execution of Smolign on the Homstrad families takes from seconds to hours, depending on the number, length, and divergence of the structures being aligned and the number of candidate seeds detected for the specified error threshold. Since a rigorous running-time comparison with other methods is not pos- sible due to unavailability of their of software distributions, we summarize the running

110 Figure 5.9: Running time distribution on 399 Homstrad families. All experiments were performed on an Intel Quad Core 2.66 GHz PC with 4G RAM.

time of only Smolign in Figure 5.9. Smolign takes under 1 minute to align 70% of the families and under 10 minutes to align 92% of the families. Of the 8 families that take more than 1 hour to align, 5 families (Homstrad codes: Cyclodex-gly-tran, his- tone, kunitz, HLH, and RRF ) induce a large number of candidate cores to evaluate; 2 families (alpha-amylase and alpha-amylase-NC ) include a large number of very long peptide chains; and the remaining rhv family involves isolated secondary structures which could not be captured in the SML stage and thus forces EPO to execute more iterations to combine the motifs into an optimized rigid alignment.

5.3.4 Additional Datasets

We have presented above the performance of Smolign on a set of commonly used mul- tiple structure alignments and on the Homstrad database. We have also compared the alignments obtained by Smolign against those of some of the popular multiple structure alignment methods. Additional datasets that have been used to benchmark structural alignment methods include SISYPHUS [135], SABmark [136], and BAL- iBASE [129]. A comprehensive evaluation of the available methods and datasets is beyond the scope of the current study and is left as a future exercise. In this section,

111 we compare Smolign to two of the more recent multiple structure alignment methods, namely MISTRAL [137] and MAPSCI [138].

The MISTRAL structure alignment method [137] uses a piecewise-linear sigmoidal weight function to reward short separations of pairs of amino acids from proteins. A simulated annealing based search over the relative orientations of the proteins is then performed to obtain the translation and rotation matrices that minimize this energy function. MISTRAL follows a center-star multiple alignment approach by first computing all-pairwise structure alignments and then assigning one of the proteins as the pivot protein to which other proteins are aligned.

The performance of MISTRAL for multiple structure alignments have been demon- strated for four datasets [137]. The first two datasets contain two sets of globins previously considered in [73], and the last two datasets are two groups of proteins from the Homstrad database. The structural alignments generated by Smolign using the default parameters are compared with those reported for MISTRAL (shown in

Table 5.6). MISTRAL has a reported tendency to generate smaller alignments than other methods [137], and this is also observed for datasets 1 and 4, when compared with Smolign. The alignments produced by MISTRAL and Smolign are similar for

Set 3, with Smolign giving a slightly longer alignment. Note, however, that Smolign gives a significantly longer alignment with a better mRMSD for Set 2. The residue correspondences reported by MISTRAL are a subset of those reported by Smolign

(Figure 5.10). We attribute the insufficient expansion of the MISTRAL alignment to its protein-centric pairwise evaluation strategy, compared to the motif-centric all- inclusive evaluation used in Smolign. Additional alpha helices and turns detected by

Smolign and the reduced mRMSD are due to the candidate expansion and alignment optimization stages followed in Smolign.

MAPSCI [138] is another recent method employing a center-star approach to

112 Data Set Mistral Smolign

Ncore mRMSD Ncore mRMSD

Set 1 136 1.4A˚ 140 1.51A˚

Set 2 72 2.1A˚ 99 1.89A˚

Set 3 100 0.7A˚ 103 0.71A˚

Set 4 54 2.0A˚ 69 2.84A˚

Table 5.6: Comparison of multiple structure alignments obtained by MISTRAL and

Smolign on four datasets considered in [137].

construct the multiple alignment. The method is quite similar to that described in

[139], with the main difference being that MAPSCI works on the Cα coordinates directly, whereas [139] translates the backbone vectors to the origin. Both of these methods work on a consensus pseudo-structure as the average of the proteins being aligned. The sum of the pairwise distances between this consensus structure and each protein in the set is then iteratively minimized to obtain the final alignment.

MAPSCI is reported to produce alignments that compare favorably with the align- ments produced by MAMMOTH [67] and MATT [72]. The measurement of the core

RMSD is different in MAPSCI than the mRMSD measurement reported here, mak- ing a direct comparison of the alignment quality difficult. On the other hand, Smolign generally produces alignments with greater coverage than MAPSCI. On a set of 232

HOMSTRAD families considered in [138], MAPSCI produces alignments with an av- erage coverage of 71% (expressed in percent of the length of the shortest protein in each HOMSTRAD family), whereas Smolign produces alignments with an average coverage of 85%.

113 (a) MISTRAL (b) Smolign (c) difference

Figure 5.10: Multiple alignments produced by (a) MISTRAL and (b) Smolign on the dataset of globins from [137]. Residues that are part of the detected alignment are shown in blue. (c) Residues considered part of the alignment by Smolign but not

MISTRAL are highlighted in blue.

5.3.5 Effects of a few Key Techniques

In the above sections, we have shown the success of Smolign framework by comparing it with some other popular methods on benchmark datasets. To get better under- stand the performance of Smolign itself, we need to break into the framework and observe how each component effectively achieves its goal. Below, we demonstrate the effectiveness of a few novel key techniques.

5.3.5.1 Seeds selection

In section 5.2.2, we introduced concept of spatial motifs library, in which it collects spatial motifs from given protein structures and assigns each of them into bucket ac- cording to the motif type. In the next section 5.2.3, we invent a heuristic method to generate optimal seed alignment. Both SML and seed selecting method are funda- mental of Smolign framework whose quality and efficiency determine the final global

114 1cseE 1sbnE 3tecE 3prkE 1pekE Candidates to be checked Seeds after selection

Bucket # of Spatial motifs # of combinations pruning thresholds (Dλ, sλ, pλ, θλ, δλ)

αα 17 17 20 8 8 164 (6, 28, 0.4, 0.4, 54) 151

αβ+ 51 49 41 26 26 934 (4, 6, 0.15, 0.15, 40) 655

αβ− 8 7 7 12 11 180 (4, 28, 0.4, 0.4, 54) 103

β+β+1 1 1 1 0 0 - -

β+β+2 4 4 3 3 3 76 (−, 56, 0.6, −, 56) 74

β+β+3 2 2 1 0 0 - -

β+β−1 0 0 0 0 0 - -

β+β−2 0 0 0 0 0 - - 115 SML β+β−3 1 1 1 1 1 1 (−, 56, 0.6, −, 56) 1

β−β−1 0 0 0 0 0 - -

β−β−2 0 0 0 0 0 - -

β−β−3 1 1 0 1 1 - -

α 8 8 9 6 6 - -

β+ 10 10 8 6 6 - -

β− 3 3 2 4 4 - -

L 0 0 0 0 0 - -

F 12 12 13 15 17 - -

Table 5.7: SML data and seeds selection in set1 from section 5.3.1 alignment result. Therefore it is interesting to trace down the detail of them by im- plementing a demo dataset. Table 5.7 lists the number of spatial motifs in each of

17 buckets detected from 5 given protein structures. In generated SML, there are 5 buckets contain motifs from each structure (Smolign omits single SSE and irregular shape buckets since enough linked motifs are detected) and their potential aligning combinations are huge, e.g. αα bucket have 17 × 17 × 20 × 8 × 8 = 369, 920 possible alignments. However, by dynamically adjusting the pruning thresholds, the novel seed selecting method we invented automatically pruned most of less similar seed align- ments. In particular, we only need to perform seed alignments on 164 candidates in

αα bucket which is merely about 0.043% of original possible combinations.

5.3.5.2 Bio-constraints affection

When incurring an extreme dataset with large number of similar motifs (It may be caused by in-accurate measurement), bio-constraints pruning described in sec- tion 5.2.3.2 will automatically kick in to extend pruning process. Actually, we detect

18 times out of all 399 Homstrad benchmark testings (see section 5.3.3) in which bio-constraints are engaged. We will give detailed analysis with an example below.

Family ”grs” (see detailed data in our web-site) contains 11 structures and its seed selection results are shown in table 5.8. After pure geometric pruning, there are still 6418 potential seed candidates left and the costly local alignment on them takes more than 20 minutes in our experiment. However, bio-constraints kicks in and prunes another 4254 seeds away. Giving a closer look at the dataset, we find that each structure is composed of large number of α − helix and β − sheets and some of them are buried inside the structure. In this case, no geometric properties but only bio-constraints can distinguish the surface motifs such as αα, αβ+, αβ− from the inside ones.

116 # of combinations af-

ter geometric pruning.

Bucket (Dλ, sλ, pλ, θλ, δλ) # of combinations after bio-constraints

αα 1025 (10, 20, 0.25, 0.25, 49) 434

αβ+ 1400(11, 6, 0.15, 0.15, 40) 355

αβ− 2027 (7, 28, 0.4, 0.4, 54) 218 SML β+β+1 134 (−, 46, 0.6, −, 56) 134

β+β+2 1 (−, 46, 0.6, −, 56) 1

β−β−2 1831 (8, 28, 0.25, 0.25, 49) 1022

Total: 6418 2164

Table 5.8: GRS family in homstrad: seeds selection with geometric properties and biological constraints.

5.3.5.3 Extended seeds

Seed extending in section 5.2.4 helps the input data of Quaternion transformation and rotation, a set of correspondent residues spreading out on structures’ geometric space. For example, an extended seed of set1 which leads to the best global alignment later is composed of correspondence chunks coming from 12 seeds (see figure 5.11) and figure 5.3e shows the distribution of all correspondences.

5.3.5.4 EPO iterations

In Smolign framework, iterative EPO procedure shown in algorithm 4 takes respon- sibility of global alignment. In figure 5.12, we plot EPO operations on the previous used 5 demo datasets. All of the solid lines in the figure are monotonous rising from left side (seed alignment) to right side (final global alignment) which means each

117 Figure 5.11: An extended seed in set1. The most left column is base seed with type

αα and 32 correspondences. The rest are additional seeds and are marked with types and selected correspondences from each.

118 EPO iteration tries to overcome the local minimum from the previous one by utiliz- ing a wider spread transformation base. The dotted lines represent the changes on mRMSD during iterations. In particular, the last EPO’s mRMSD is always less than or equal to the previous one because of applying the post merging step on last POG.

Figure 5.12: EPO iterations on 5 demo datasets. Each dataset has two same color plots, solid line represents the core size change during the EPO iterations and dotted line shows the mRMSD variance after each iteration. We normalize the Nsize by di- viding the length of shortest protein sequence in the dataset and mRMSD by dividing the ² threshold used in the dataset.

5.3.6 Summary

We compared the multiple alignments generated by Smolign with those produced by other multiple structure alignment methods, namely CE-MC [97], MAMMOTH- mult [68], Multiprot [71], POSA [54], and MASS [75]. All above 6 methods have been

119 discussed in detail in chapter 3. In particular, we summarize a list of unique charac- teristics from each one which cause the difference of alignment results in table 5.9.

When initially choosing local unit sub-structure, either AFP used in CE-MC,

Multiprot and POSA or unit vector used in MAMMOTH is a sequence fragment that contains no spatial information and causes large redundancy in the next local similarity detection step. Smolign differs from these multiple structure alignment methods mainly in its use of contact windows as the main representation of proteins.

Contact windows, which is less restrictive than backbone segments of predefined lengths, generates spatial sub-structure. In another word, such spatial sub-structure is amino acid interaction core that already delivers a crucial subset of our final alignment result.

To detect local sub-structures similarity, all methods except CE-MC and MAMMOTH- mult, two progressive approaches, adopts simultaneous approach by which they par- tially overcome the local minimum issue. The impact can be easily observed in table 5.4, in which the Ncore of 4 simultaneous approaches are consistently better than the other two’s. Additionally, the analysis shows that alignment position results are also more consistent in 4 simultaneous approaches than in 2 progressive methods.

One excelling aspect of Smolign over other MSTA methods is core selection in simultaneous manner. Results in table 5.4 demonstrate that methods, such as Mult- prot and MASS, using simple progressive heuristic approach still have trouble on local minimum issue when handling diverged or multi-cores protein set. On the con- trary, dynamic and simultaneous approach endues Smolign the ability to minimizes the effect of local minimum issue and locates optimal alignment seed at most of the situations.

The global alignment method employed in Smolign is similar to POSA. Both consider all of the protein data together, and further reduce the effect of the local

120 MAMMOTH-

Methods/Characteristics CE-MC mult Multiprot POSA MASS Somlign

unit sub- linked SSE &

structure AFP unit-vector AFP AFP SSE irregular cores

local local simi-

larity detec-

tion pair-wise pair-wise simultaneous simultaneous simultaneous simultaneous

dynamic & si-

core selec- simple & pro- simple & multaneous se-

121 tion N/A N/A gressive N/A progressive lection

global center-star (re- center-star

alignment guide tree guide tree cursive) POG (recursive) EPO

local extension local extension

post pro- at pair-wise globally itera- at pair-wise iterative EPO

cess N/A stage tive alignment stage N/A & post-merge

sub-set

alignment N/A N/A Yes Yes Yes Yes

extra properties flexibility N/A N/A N/A Yes N/A Yes

Table 5.9: Comparison of 6 methods. minimum issue caused by the guide-tree based approaches. The refinement step used in Smolign is comparable in its nature to the Partial Order Graph (POG) search used in POSA, except that using 3D position data of amino acids allows additional opportunities of detecting extra aligning positions and reducing final mRMSD by post-merge step as described in algorithm 4 above. Note that Smolign employs the

EPO algorithm to refine and extend a multiple alignment of all of the proteins, whereas POSA employs POG search at each of its pairwise iterations.

As for the by-product of simultaneously local similarity detection, sub-set align- ment ability is automatically available for all methods but CE-MC and MAMMOTH- mult as we described above. In addition to local similarity detection step producing sub-set alignment, POSA and Smolign can also generate sub-set alignment on top of

POG structure. In summary, sub-sets obtained from local similarity detection step help to find different core alignments coming from multi-core proteins, on the other hand, sub-sets obtained from POG are able to enlarge the size of major core which is common to the whole data-set as well as detect cores from sub-set protein data.

Finally, only POG based methods such as POSA and Smolign are able to generate

flexible structure alignments. However, Somlign maintains relative spatial positions within an alignment core which guarantees the important biological information be preserved after protein structure is bent. Furthermore, Smolign also allows sequence independent flexible alignment according to the analysis in section 5.2.6.

122 CHAPTER 6

DISCUSSION AND FUTURE RESEARCH

This chapter summaries our accomplishment described in this thesis, and outlines future work to be done on our Smolign framework.

6.1 EPO aligorithm

There are two main differences between the MCC problem that we are interested in and the traditional MSTA problem. In the case of protein structures, it is usually explicitly or implicitly assumed that the (majority of the) input proteins belong to one family (How to classify a set of input structures into different families is a related problem, and many such classifications exist [14, 57, 134]), or at least share some relations. As such, one can expect that some consensus of the family should exist.

However in comparing protein folding trajectories case, the set of curves are from a set of simulations including both successful and unsuccessful runs, and we wish to classify this diverse set of curves, and capture common features within as well as across its sub-families. Secondly and more importantly, the level of similarity existing in these folding trajectories is usually much lower than that in a family of related proteins. Hence we aim at an algorithm with high sensitivity, which is able to detect small-scaled partial similarity and handle multi-dimension trajectories as well.

In chapter 4, we propose and develop a sensitive MCC algorithm, called the EPO

123 algorithm, to compare a set of diverse high dimensional curves. Our algorithm follows a similar framework as the POA algorithm [53, 54] to encode the similarities of aligned curves in a partial order graph instead of in a linear structure used by many traditional

MSTA algorithms. This has the advantage that other than similarities among all curves, similarities among a subset of input curves can also be encoded in this graph.

See Figure 4.3 for an example, where nodes in both graphs represent a group of aligned points from input curves.

For the more important problem of sensitivity, we observe that being a greedy approach, the progressive MSTA framework tends to be inherently insensitive to low level of similarities — if one early local decision is wrong, it may completely miss a small-scaled partial similarity which is also widely recognized as local minimum issue. To improve this aspect of the performance of the progressive framework, we

first propose a novel two-level scoring function to measure similarity, which, together with a clustering idea, greatly enhances the quality of the local pairwise alignment produced at each round. We then develop an effective merging step to post-process the obtained alignments. This step helps to reassemble vertices (high dimensional points) from input curves that should be matched together, but were scattered in several sub-clusters in the alignments due to some earlier non-optimal decisions. Both techniques are general and can be used to improve the performance of many existing

MSTA algorithms. Experimental results show that our MCC algorithm is highly sensitive and able to classify input curves. We also demonstrate the power of our tool in mining critical events from protein folding trajectories using a detailed case study of a miniprotein Trp-cage. Additionally, our tool is also very efficient, which is essential in processing massive folding simulation data available in a high-throughput manner. Hence it could be potentially used in combination with current protein

124 folding simulation methods to discover folding tendencies in real time and collect possible folding simulation performance status.

Although our EPO algorithm is developed with the goal of comparing folding trajectories, the algorithm is general and can be applied to other domains as well, such as protein structures or pedestrian trajectories extracted from surveillance videos

[1]. In particular, in chapter 5, we demonstrate this generality by merging the EPO into a protein structure alignment framework namely Somlign algorithm. In Smolign, the EPO algorithm is used to improve the results of existing multiple protein structure alignment algorithms especially when input proteins share low structural similarity.

6.2 Smolign Framework

As the major contribution of this thesis, we have presented Smolign as a complete solution for multiple protein structure alignment (MSTA) problem. The framework in chapter 5 based on a spatial motif library (SML) generated from residue distance matrices provides alignment-order independent results from a given random allocated protein set and can generate flexible as well as rigid structural alignments. The alignments produced are comparable to or better than those of other methods, both in alignment quality and coverage.

In the terminology and formalism introduced in chapter 3, Smolign uses an element based structure representation, as opposed to a geometric vector representation such as dividing a structure into a sequence of linear vectors. It combines characters of properties vector representation and distance matrix representation together and utilizes several element classes, including the contact windows, residue coordinates, and secondary structure elements. The clustering of compatible pairs of structure elements is done by use of transformations, where the element pairs with similar

125 translation and rotation matrices are merged, similar to the SARF program [140] and to the method introduced in [141].

Smolign differs from previous multiple alignment methods in several major as- pects. Fundamentally, Smolign utilizes contact windows as the basic representation of proteins, from which 3D structural similarities can be identified. Contact windows have previously been used in pairwise structural alignment where DALI [13] being the most known example, but not in multiple structural alignment problem. The main bottleneck in using contact windows for structural alignment is the computational cost of identifying and extending common structural conformations. The problem of finding similar contact sub-windows, known as the Contact Map Overlap (CMO)

[142] can be directly translated to a maximum clique problem [143]. Because this is a NP-complete problem [144], several heuristics have been proposed for the pair- wise alignment case [145]. Instead of modeling the problem directly as a maximum clique problem, Smolign exploits the additional information contained in the protein structures, such as secondary structure type, Euclidean distance and angle between backbone segments and greatly reduces the search space.

Most importantly, Smolign is the first MSTA method that simultaneously selects alignment seed (sub-structures) from multiple proteins. In chapter 3, we analyzed in detail that the reason of MSTA being a NP-complete problem is the exponential number of similar sub-structures from which we have to choose to form optimal alignment seed. In all mentioned methods above, progressive type approaches by- pass such difficult choice by iteratively performing pair-wise alignment on top of all protein structures but totally sacrifice the ability to catch up the potentially lost optimal alignment in each iterative step. Other approaches detect massive similar sub-structures simultaneously but fall back on simple and progressive type of solution to pick up alignment seed from multiple protein structures. Facing this critical NP

126 issue, Smolign keeps simultaneous manner with a novel and complex dynamic seed selection algorithm that explore the possible alignment candidates in a best-first search. Note that the core of our seed selection algorithm is the set of dynamic pruning thresholds which eventually determine the quality of candidate seeds. Unless otherwise noted, the results reported in this thesis were obtained using the default parameters shown in table 5.2. These defaults are available on the job submission web site as advanced options. Even though the default parameters achieve competitive results, we allow the interested users to change these parameters to control the quality vs. coverage and the speed vs. accuracy trade-offs. Of particular importance is the ² error threshold, which sets an upper threshold for the mRMSD of the alignment that can be obtained. A tight ² error threshold would generate fewer candidate seeds but discover only highly conserved structural motifs, whereas a relaxed ² would discover more divergent motifs, at the computational cost of generating many false candidates that need to be evaluated.

We also realized that efforts to search optimal results have to be allocated at every aspect of MSTA problem and consideration of all input structures together is the key point. Therefore, Smolign continues to achieve the global alignment simultaneously by iteratively executing a powerful partial order curve comparison algorithm [19].

Furthermore, Smolign provides the ability to generate flexible alignments, which is not supported by many of the other available methods.

We attribute the success of Smolign to the concise yet complete representation of the input structures it uses to construct the motif library. Pairs of interacting contact map sub-windows provide a good balance between the sensitivity of the representa- tion and the corresponding search space. Through its dynamic filtering and efficient candidate evaluation and expansion algorithms, Smolign handles large and complex datasets where other methods fail to produce any results.

127 6.3 Future Work

Future work on protein trajectory alignment and Smolign framework could go in 3 possible directions as described below.

• First, one of the limitations of EPO algorithm on protein trajectories data in

the current implementations is that scalability issue is not very well clarified.

Currently, we have only experimented the EPO algorithm with a mini-protein

(Trp-cage). One immediate question is to understand the scalability of the

EPO algorithm for larger proteins or longer trajectories. In particular, a larger

protein means a curve of higher dimensions. Our EPO algorithm seems to scale

linearly with the number of participated trajectories from current experiments.

Furthermore, in practice, it is likely that we only perform the algorithm on

short simulation period. For longer trajectories, it seems that our algorithm

scales in a quadratic manner. However, further experiments are necessary to

investigate the scalability issue.

Previous works analyze protein folding trajectories by collecting various statis-

tics on measures such as the contact number (i.e, the number of native contacts)

of each conformation along a trajectory and the URMS distance between a con-

formation and the native structure [28]. One way to view this is that a trajectory

is mapped into a time-series data representing the evolution of the number of

native contacts, which can be considered as a one-dimensional curve. In this

regard, we can use our EPO algorithm to analyze a collection of such curves

induced by one measure. In general, there may be multiple measures, geometric

or physio-chemical, that a user may wish to inspect. Hence it is highly desir-

able to build a framework for analyzing folding trajectories that can incorporate

these multiple measures and also enables the addition of new properties easily.

128 • Secondly, we exclude multiple confirmations pattern issue out of our EPO

algorithm. A confirmations pattern is composed of a group of consecutive

confirmations that implies a crucial folding/unfolding step such as nucleation-

condensation or diffusion-collision [146, 147, 148] during the developing of sim-

ulation trajectory. Currently, EPO only compares single pattern (usually close

to native state) among trajectories set (The data selection is performed outside

of our algorithm). In the future, we want to not only find folding order in one

pattern but also extend EPO ability to detect folding order among multiple

patterns. One of the potential methods to achieve this goal is to treat each

confirmation like a residue in protein structure, then borrow the idea of motif

collection and alignment seed detection from Smolign framework. Following

this idea, ”motifs” construction in terms of folding domain could be attributed

to distinguishing and clustering confirmation patterns [149, 150]. Furthermore,

combining pattern level and confirmation level comparison method can help to

interpret various folding models between proteins.

• Finally, it is necessary to further improve the computational efficiency of Smolign

framework. Although the major contribution of Smolign framework is to pro-

cesses given structures simultaneously to maximally avoid local minimum issue,

it still contains heuristic steps in which optimal alignment may be missed in

certain cases. For example, in some extreme situations where an input protein

set contains large number of motifs that all involve α-helix (e.g. αα, αβ±)

and even worse the values of θ property falls in a narrow spectrum, Smolign

may have to take long time to select alignment seed from large candidates and

consequently may cause trouble to prune false-positive alignment seeds. The

ultimate reason is the dynamic behavior of protein segments [151] and its in-

accurate measurement such as least squares fitting method used in detecting θ

129 property. This efficiency and accuracy conflict motivates us to seek new prun-

ing properties with better discrimination. One of the potential efforts is to

involve more complex local geometric properties. In some previous literatures

like [152, 153, 154], some clues about mathematically abstracting a core struc-

ture at different resolutions have been explored. In the future, we attempt to

develop a core structure measurement that describes the detailed variants in

more discrete manner. Another possible improvement lies on bio-constraints.

At this moment, we only utilize it as supplemental tool because of the limi-

tation of expert knowledge. By collecting statistic data from scanning large

training dataset, we will gain more confidence to prompt bio-constraints as a

major pruning factor.

6.4 Summary

We believe that Smolign combining with EPO together provide an import step in the advancement of the multiple protein structural alignment, but we acknowledge that it may not give the best or most appropriate results in every single case. While Smolign can be utilized for large scale automated analysis, the use of different alignment pro- grams that are developed under varying assumptions and use varying representations of proteins is likely to enrich any given case study. It must also be noted that the cur- rently available multiple structure alignment programs, including Smolign, are geared toward identifying conserved structural cores of proteins, which is an important task in structure classification, fold recognition, and structure prediction problems. On the other hand, they may not be able to identify conservation of individual residue conformations or functional motifs that can be detected by LFMPro [155], gSpan

[156] and [157].

Smolign is provided both as a web service for fast and convenient access and as

130 a downloadable binary for the more intensive batch tasks. The sample alignments described here and the alignments for Homstrad and BaliBase benchmark datasets are also provided on the supplementary web-site.

131 BIBLIOGRAPHY

[1] YCaspi, MIrani: Spatio–Temporal Alignment. Proc. IEEE Transactions On Pattern Analysis and Machine Intelligence. 2002, :1409–1424.

[2] Berman H, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H, Shindyalov I, Bourne P: The Protein Data Bank 2000, http://www.pdb.org/.

[3] Larson SM, Snow CD, Shirts M, Pande VS: Folding@Home and Genome@Home: Using distributed computing to tackle previously intractable prob- lems in computational biology. Security 2009.

[4] Beberg AL, Ensign D, Jayachandran G, Khaliq S, Pande VS: Folding@home: Lessons from eight years of volunteer distributed computing 2009, :1–8.

[5] Swain M, Ostropytskyy V, Silva CG, Stahl F, Riche O, Brito RM, Dubitzky W: Grid Computing Solutions for Distributed Repositories of Pro- tein Folding and Unfolding Simulations. In Proceedings of the 8th in- ternational conference on Computational Science, Part III, ICCS ’08, Berlin, Heidelberg: Springer-Verlag 2008:70–79, http://dx.doi.org/10.1007/978- 3-540-69389-5_10.

[6] Berrar D, Stahl F, Silva C, Rodrigues J, Brito R, Dubitzky W: Towards Data Warehousing and Mining of Protein Unfolding Simulation Data. Jour- nal of Clinical Monitoring and Computing 2005, 19:307–317, http://dx.doi. org/10.1007/s10877-005-0676-z. [10.1007/s10877-005-0676-z].

[7] Alt H, Knauer C, Wenk C: Comparison of Distance Measures for Planar Curves. Algorithmica 2003, 38:45–58.

[8] SBiasotti, Marini S, Mortara M, Patane G, Spagnuolo M, Falcidieno B: 3D shape matching through topological structures. DGCI 2003, LNCS 2886:194–203.

[9] Hilaga M, Shinagawa Y, Kohmura T, Kunii TL: Topology matching for fully automatic similarity estimation of 3D shapes. In SIGGRAPH ’01: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, New York, NY, USA: ACM 2001:203–212. 132 [10] Akutsu T, Halldorsson MM: On the approximation of largest common point sets and largest common subtrees. Theoretical Computer Science 2000, 233:33–50.

[11] Gerstein M, Levitt M: Comprehensive assessment of automatic struc- tural alignment against a manual standard, the scop classification of proteins. Protein Science 1998, 7:445–456.

[12] Gibrat JF, Madej T, Bryant SH: Surprising similarities in structure com- parison. Curr. Opin. Struct. Biol. 1996, 6(3):377–385.

[13] Holm L, Sander C: Protein structure comparison by alignment of dis- tance matrices. J. Mol. Biol. 1993, 233:123–138.

[14] Holm L, Sander C: Dali/FSSP classification of three-dimensional pro- tein folds. Nucleic Acids Res 1997, 25:231–234.

[15] Krissinel E, Henrick K: Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Cryst. 2004, D60:2256–2268.

[16] Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of optimal path. Protein Engineering 1998, 11(9):739–747.

[17] Taylor W, Orengo C: SSAP: sequential structure alignment program for protein structure comparison. Methods Enzymol 1996, 266:617–35.

[18] Sun H, Sacan A, Ferhatosmanoglu H, Wang Y: Smolign: A Spatial Motifs Based Protein Multiple Structural Alignment Method. IEEE/ACM Trans Comput Biol Bioinform 2011, http://www.hubmed.org/display.cgi? uids=21464513.

[19] Sun H, Ferhatosmanoglu H, Ota M, Wang Y: Enhanced partial order curve comparison over multiple protein folding trajectories. Comput Syst Bioinformatics Conf. 2007, :229–310.

[20] Borreguero JM, Ding F, Buldyrev SV, Stanley HE, Dokholyan NV: Multiple Folding Pathways of the SH3 Domain. ArXiv Physics e-prints 2003, 87.

[21] Levinthal C: Are there pathways for protein folding? J.Chim.Phys. 1968, 65:44–45.

[22] Wolynes P, Onuchic J, Thirumalai D: Navigating the folding routes. Sci- ence 1995, 267:1619–1920.

133 [23] Abkevich VI, Gutin AM, Shakhnovich EI: Specific nucleus as the trasition state for protein folding: evidence from the lattice model. Biochemistry 1994, 33:10026–10036.

[24] Chiti F, Taddei N, White PM, Bucciantini M, Magherini F, Stefani M, Dobson CM: Mutational analysis of acylphosphatase suggests the importance of topology and contact order in protein folding. Nature Struc. Biol. 1999, 6:1005–1009.

[25] Dokholyan NV, Buldyrev SV, Stanley HE, Shakhnovich EI: Molecular dy- namics studies of folding of a protein-like model. Fold. Design 1998, 3:577–587.

[26] Lockless SW, Ranganathan R: Evolutionarily Conserved Pathways of En- ergetic Connectivity in Protein Families. Science 1999, 286(5438):295– 299.

[27] Du R, Pande VS, Grosberg AY, Tanaka T, Shakhnovich E: On the role of conformational geometry in protein folding. Journal of Chemical Physics 1999, 111:10375–10380.

[28] Kedem K, Chew L, Elber R: Unit-Vector RMS(URMS) as a Tool to Analyze Molecular Dynamics Trajectories. Proteins: Structure, Function and Genetics 1999, 37:554–564.

[29] Ota M, Ikeguchi M, Kidera A: Phylogeny of protein-folding trajectories reveals a unique pathway to native strutcure. PNAS 2004, 101(51):17658– 17663.

[30] Crick F: What Mad Pursuit: A Personal View of Scientific Discovery. New York: Basic Books 1988.

[31] Bottomley S: nteractive Protein Structure Tutorial 2004, http://biomedapps. curtin.edu.au/biochem/tutorials/prottute/hierarchy.htm.

[32] Linderstrom-Lang K: The Lane Medical Lectures. Standford, California: Stan- ford University Press 1952.

[33] http://withfriendship.com/user/neeha/protein-folding.php.

[34] Alder BJ, Wainwright TE: Phase Transition for a Hard Sphere System. Journal of Chemical Physics 1957, 27:1028–1029.

[35] Alder BJ, Wainwright TE: Studies in Molecular Dynamics. I. General Method. Journal of Chemical Physics 1959, 31:459–466.

134 [36] Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970, 48:443–453.

[37] Smith TF, Waterman MS: Identification of common molecular subse- quences. J. Mol. Biol. 1981, 147:195–197.

[38] Kahveci T, Singh A: An efficient index structure for string databases. VLDB 2001, :351–360.

[39] Myers E: An O(ND) difference algorithm and its variations. Algorith- mica 1986, :251–266.

[40] AC May TB: Automated comparative modelling of protein structures. Curr Opin Biotechnol. 1994, 5(4):355–360.

[41] Sali A: Modelling mutations and homologous proteins. Current Opin- ion in Biotechnology 1995, 6(4):437 – 451, http://www.sciencedirect.com/ science/article/B6VRV-45765WK-3G/2/12e44225642a7d3bf5c030b459fad95b.

[42] Pearson W: Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 1990, 183:63–98.

[43] Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. Atlas of protein sequence and structure 1978, 5(Suppl 3):345–352.

[44] Bellman RE: Dynamic Programming. Dover Publications, Incorporated 2003.

[45] Gojobori T, Li WH, Graur D: Patterns of nucleotide substitution in pseu- dogenes and functional genes. J Mol Evol. 1982, 18(5):360–369.

[46] Thompson J, Higgins D, Gibson T: CLUSTAL W: improving the sensi- tivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22:4673–4680.

[47] Feng D, Doolittle R: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Biomedical and Life Sciences 1987, 25(4):351– 360.

[48] Gotoh O: Extraction of conserved or variable regions from a multiple sequence alignment. In Proceedings of Genome Informatics Workshop IV. 1993:109–113.

[49] Altschul SF: Gap costs for multiple sequence alignment. Journal of The- oretical Biology 1989, 138(3):297 – 309.

135 [50] Irizarry K, Kustanovich V, Li C, Brown N, Nelson S, Wong W, Lee CJ: Genome- wide analysis of single-nucleotide polymorphisms in human expressed sequences. Nature Genetics 2000, 26(2):233–236.

[51] Lee C: Generating consensus sequences from partial order multiple sequence alignment graphs. Bioinformatics 2003, 19(8):999–1008.

[52] Grasso C, Lee C: Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalabil- ity to very large alignment problems. Bioinformatics 2004, 20(10):1546– 1556.

[53] Lee C, Grasso C, Sharlow M: Multiple sequence alignment using partial order graphs. Bioinformatics 2002, 18(3):452–464.

[54] Ye Y, Godzik A: Multiple flexible structure alignment using partial order graphs. Bioinformatics 2005, 21(10):2362–2369.

[55] Lassmann T, Sonnhammer E, Dialign P: Quality Assessment of Multiple Alignment Programs. FEBS Lett 2002, 529:126–130.

[56] Pemmaraju S, Skiena S: Computational Discrete Mathematics: Combinatorics and Graph Theory with Mathematica. New York, NY, USA: Cambridge Uni- versity Press 2003.

[57] Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH–A Hierarchic Classification of Protein Domain Structures. Structure 1997, 5(8):1093 –1108.

[58] Chothia C, Lesk A: The relation between the divergence of sequence and structure in proteins. EMBO J 1986, 5:823–826.

[59] Tang D, Chun ACS, Zhang M, Wang JH: Cyclin-dependent Kinase 5 (Cdk5) Activation Domain of Neuronal Cdk5 Activator. Journal of Biolog- ical Chemistry272(19):12318–12327, http://www.jbc.org/content/272/19/ 12318.abstract.

[60] Chothia C, Lesk A: The evolution of protein structures. Cold Spring Harb Symp Quant Biol 1987, 52:399–405.

[61] Sander C, Schneider R: Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 1991, 9:56– 68, http://www.ncbi.nlm.nih.gov/pubmed/2017436.

[62] Rost B: Twilight zone of protein sequence alignments. Protein Engi- neering 1999, 12(2):85–94, http://peds.oxfordjournals.org/content/12/ 2/85.abstract.

136 [63] J C KENDREW HMDRGPHWDCP G BODO: A Three-Dimensional Model of the Myoglobin Molecule Obtained by X-Ray Analysis. Nature 1958, 181:662–666.

[64] Keeler J: Understanding NMR Spectroscopy. Irvine, California: University of California, Irvine 2007.

[65] Eidhammer I, Jonassen I, Taylor WR: Structure comparison and structure patterns. J Comput Biol 2000, 7(5):685–716, http://dx.doi.org/10.1089/ 106652701446152.

[66] Jean-Francois1 Gibrat ; Thomas Madej; SHB: Surprising similarities in structure comparison. Current Opinion in Structural Biology 1996, 6(3):377– 385.

[67] Ortiz AR, Strauss CE, Olmea O: MAMMOTH (Matching molecular mod- els obtained from theory): An automated method for model compar- ison. Protein Sci 2002, 11(11):2606–2621.

[68] Lupyan D, Leo-Macias A, Ortiz ARR: A new progressive-iterative algo- rithm for multiple structure alignment. Bioinformatics 2005, :3255–3263.

[69] Guda C, Scheeff ED, Bourne PE, Shindyalov IN: A New Algorithm for the Alignment of Multiple Protein Structures Using Monte Carlo Optimization. Pacific Symposium on Biocomputing 2001, :275–286.

[70] Ye Y, Godzik A: Flexible structure alignment by chaining aligned frag- ment pairs allowing twists. Bioinformatics 2003, 19:ii246–ii255.

[71] Shatsky M, Nussinov R, Wolfson HJ: MultiProt – A Multiple Protein Structural Alignment Algorithm. WABI ’02: Proceedings of the Second International Workshop on Algorithms in Bioinformatics 2002, :235–250.

[72] Menke M, Berger B, Cowen L: Matt: Local Flexibility Aids Protein Mul- tiple Structure Alignment. PLOS Computational Biology 2008, 4:e10.

[73] Konagurthu AS, Whisstock JC, Stuckey PJ, Lesk AM: MUSTANG: A mul- tiple structural alignment algorithm. Proteins: Structure, Function, and Bioinformatics 2006, 64(3):559–574, http://dx.doi.org/10.1002/prot.20921.

[74] Alexandrov N: SARFing the PDB. Protein Engineering 1996.

[75] Dror O, Benyamini H, Nussinov R, Wolfson HJ: Multiple structural align- ment by secondary structures: Algorithm and applications. Protein Science 2003, 12:1492–2507.

[76] Richardson J: The anatomy and taxonomy of protein structure. Adv. Protein Chem. 1981, 34:167–339. 137 [77] Phillips D: The development of crystallographic enzymology. Biochem Soc Symp. 1970, 30:11–28.

[78] Nishikaw K, Ooi T: Comparison of homologous tertiary structures of proteins. Journal of Theoretical Biology 1974, 43(2):351 – 374.

[79] Liebman M: Quantitative analysis of structural domains in protein. Biophys J. 1980, 32:213–215.

[80] Sippl MJ: On the problem of comparing protein structures : Develop- ment and applications of a new method for the assessment of struc- tural similarities of polypeptide conformations. Journal of Molecular Biology 1982, 156(2):359 – 388.

[81] Havel T, Kuntz I, Crippen G: The theory and practice of distance geom- etry. Bull. Math. Biol. 1983, 45:665720.

[82] Vassura M, Margara L, Di Lena P, Medri F, Fariselli P, Casadio R: Recon- struction of 3D Structures From Protein Contact Maps. Computational Biology and Bioinformatics, IEEE/ACM Transactions on 2008, 5(3):357 –367.

[83] Holm L, Sander C: Mapping the Protein Universe. Science 1996, 273(5275):595– 602.

[84] FT-COMAR: Fault Tolerant Reconstruction of 3D Structure from Protein Contact Maps. http: // http: // bioinformatics. cs. unibo. it/ FT-COMAR/ index. html .

[85] Rossmann MG, Argos P: The taxonomy of binding sites in proteins. Molecular and Cellular Biochemistry 1978, 21:161–182, http://dx.doi.org/ 10.1007/BF00240135. [10.1007/BF00240135].

[86] Koehl P: Protein Structure similarities. Current Opinion in Structural Biology 2001, 11:348–353.

[87] Sierk M, Kleywegt G: Deja Vu All Over Again: Finding and Analyzing Protein Structure Similarities. Structure 2004, 12(12):2103–2111.

[88] Kabsch W: A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr. 1978, A34:827–828.

[89] Lathrop R: The protein threading problem with sequence amino acid interaction preferences is NP-complete. Protein Eng. 1994, :1059–1068.

[90] Holm L, Sander C: 3-D lookup: Fast protein structure searches at 90% reliability. Proc. Ann. Int. Conf. on Intelligent Systems for Molecular 1995, :179–187.

138 [91] Szustakowski JD, Weng Z: Protein structure alignment using a genetic algorithm. Proteins: Structure, Function, and Bioinformatics 2000, 38(4):428– 440. [92] Jewett AI, Huang CC, Ferrin TE: MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared- distance. Bioinformatics 2003, 19(5):625–634. [93] Can T, Wang YF: CTSS: A Robust and Efficient Method for Protein Structure Alignment Based on Local Geometrical and Biological Fea- tures. Proc. IEEE Computer Society Conference on Bioinformatics 2003, :169–179. [94] Kolbeck B, May P, Schmidt-Goenner T, Steinke T, Knapp EW: Connectivity independent protein-structure alignment: a hierarchical approach. BMC Bioinformatics 2006, 7:510–530. [95] Taylor WR, Flores TP, Orengo CA: Multiple protein structure alignment. Protein Science 1994, 3:1858–1870. [96] Russell R, Barton G: Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels. Proteins 1992, 14(2):309–323. [97] Guda C, Lu S, Scheeff ED, Bourne PE, Shindyalov LN: CE-MC: a mul- tiple protein structure alignment server. Nucleic Acids Research 2004, 32:W100–W103. [98] Sneath PH, Sokal RR: Numerical taxonomy. Nature 1962, 193:855–860. [99] Barton GJ, Sternberg MJ: A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure com- parisons. J Mol Biol 1987, 198(2):327–337. [100] Gerstein M, Levitt M: Using Iterative Dynamic Programming to Obtain Accurate Pairwise and Multiple Alignments of Protein Structures 1996, :59–67, http://portal.acm.org/citation.cfm?id=645631.757999. [101] Gerstein M, Levitt M: Using Iterative Dynamic Programming to Obtain Accurate Pairwise and Multiple Alignments of Protein Structures. Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology 1996, :59–67. [102] Akutsu T, Sim KL: Protein Threading Based on Multiple Protein Struc- ture Alignment. IPSJ SIG Notes 1998, 98(105):25–30. [103] Gusfield D: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge 1997. 139 [104] Ochagav´ıaME, Wodak S: Progressive Combinatorial Algorithm for Mul- tiple Structural Alignments:Application to Distantly Related Pro- teins. Proteins 2004, 55:436–454. [105] Ye Y, Godzik A: Database searching by flexible protein structure align- ment. Protein Science 2004, 13(7):1841–1850, http://dx.doi.org/10.1110/ ps.03602304. [106] Ye Y, Godzik A: FATCAT: a web server for flexible structure compar- ison and structure similarity searching. Nucleic Acids Res 2004, 32:582– 585. [107] Schwarzer F, Lotan I: Approximation of protein structure for fast sim- ilarity measures. RECOMB ’03: Proceedings of the seventh annual interna- tional conference on Research in computational molecular biology 2003, :267– 276. [108] Kolodny R, Linial N: Approximate protein structural alignment in poly- nomial time. Proc. Natl. Acad. Sci. 2004, 101(33):12201–12206. [109] Chen Y, Crippen GM: An iterative refinement algorithm for consis- tency based multiple structural alignment methods. Bioinformatics 2006, 22(17):2087–2093. [110] Holm L, Park J: DaliLite workbench for protein structure comparison. Bioinformatics/computer Applications in The Biosciences 2000, 16:566–567. [111] Zhang Y, Skolnick J: TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Research 2005, 33:2302–2309. [112] Comprehensive Evaluation of Protein Structure Alignment Meth- ods: Scoring by Geometric Measures. Journal of Molecular Biology 2005, 346(4):1173 – 1188. [113] GJ Kleywegt TJ: A super position. CCP4/ESF-EACBM Newsletter on Pro- tein Crystallography 1994, 31:9–14. [114] Subbiah S, Laurents DV, Levitt M: Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core. Current Biology 1993, 3(3):141 – 148. [115] Neidigh J, Fesinmeyer R, Andersen N: PDB ID:1L2Y Mini-proteins Trp the light fantastic. Nat.Struct.Biol. 2002, 9(6):425–430. [116] MJSutcliffe, IHaneef, DCarney, TLBlundell: Knowledge based moddelling of homologous proteins, part I: three-dimensional frameworks derived from the simultaneous superposition of multiple structures. Protein Engineering 1987, 1(5):377–384. 140 [117] Chew LP, Kedem K: Finding the Consensus Shape for a Protein Family (Extended Abstract)citeseer.ist.psu.edu/596999.html.

[118] Lupyan D, Leo-Macias A, Ortiz AR: A new progressive-iterative algorithm for multiple structure alignment. Bioinformatics 2005, 21(15):3255–3263.

[119] Ochagav´ıaME, Wodak S: Progressive Combinatorial Algorithm for Mul- tiple Structural Alignments:Application to Distantly Related Pro- teins. Proteins 2004, 55:436–454.

[120] Orengo CA: CORA–Topological fingerprints for protein structural fam- ilies. Protein Science 1999, 8:699–715.

[121] Sandelin E: Extracting multiple structural alignments from pairwise alignments:a comparison of a rigorous and heuristic approach. Bioin- formatics 2005, 21(7):1002–1009.

[122] Jain AK, Murty MN, Flynn PJ: Data Clustering: A Review. ACM Comput. Surv. 1999, 31(3):264–323.

[123] Koike R, KKinoshita, Kidera A: Ring and Zipper formation is the key to understanding the structural variety in all-β proteins. FEBS Letters 2003, 533:9–13.

[124] Lesk A, Chothia C: How different amino acid sequences determine sim- ilar protein structures: I. The structure and evolutionary dynamics of the globins. J. Mol. Biol. 1980, 136:225–270.

[125] Sayle RA, Milner-White EJ: RASMOL: biomolecular graphics for all. Trends in Biochemical Sciences 1995, 20(9):374, http://www.ncbi.nlm.nih. gov/pubmed/7482707.

[126] Hart JC, Francis GK, Kauffman LH: Visualizing quaternion rotation. ACM Trans. Graph. 1994, 13(3):256–276.

[127] Lemmen C, Lengauer T, Klebe G: FlexS: A method for fast flexible ligand superposition. J. Medicinal Chem. 1998, 41:4502–4520.

[128] Mizuguchi K, Deane CM, Blundell TL, Overington JP: HOMSTRAD: A database of protein structure alignments for homologous families. Protein Sci 1998, 7(11):2469–2471.

[129] Thompson JD PO Plewniak F: BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 1999, 15:87–88.

141 [130] Siezen RJ, Leunissen JA: Subtilases: the superfamily of subtilisin-like serine proteases. Protein Sci 1997, 6(3):501–523, http://dx.doi.org/10. 1002/pro.5560060301.

[131] Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese- Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR, Gwadz M, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Lu F, Marchler GH, Mullokandov M, Omelchenko MV, Robertson CL, Song JS, Thanki N, Yamashita RA, Zhang D, Zhang N, Zheng C, Bryant SH: CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic Acids Res 2011, 39(Database issue):D225–D229, http://dx.doi.org/10.1093/nar/gkq1189.

[132] Armon A, Graur D, Ben-Tal N: ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. J Mol Biol 2001, 307:447–463, http://dx.doi. org/10.1006/jmbi.2000.4474.

[133] Murzin AG: OB(oligonucleotide/oligosaccharide binding)-fold: com- mon structural and functional solution for non-homologous sequences. EMBO J 1993, 12(3):861–867.

[134] Murzin A, Brenner SE, Hubbard T, Chothia C: SCOP: A structural clas- sification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 1995, 247:536–540.

[135] Andreeva A, Prli A, Hubbard TJP, Murzin AG: SISYPHUS–structural alignments for proteins with non-trivial relationships. Nucleic Acids Res 2007, 35(Database issue):D253–D259. [http://dx.doi.org/10.1093/nar/ gkl746].

[136] Walle IV, Lasters I, Wyns L: SABmark–a benchmark for sequence align- ment that covers the entire known fold space. Bioinformatics 2005, 21(7):1267–1268, http://dx.doi.org/10.1093/bioinformatics/bth493.

[137] Micheletti C, Orland H: MISTRAL: a tool for energy-based multiple structural alignment of proteins. Bioinformatics 2009, 25(20):2663–2669, http://dx.doi.org/10.1093/bioinformatics/btp506.

[138] Ilinkin I, Ye J, Janardan R: Multiple structure alignment and consensus identification for proteins. BMC Bioinformatics 2010, 11:71, http://dx. doi.org/10.1186/1471-2105-11-71.

[139] Ye J, Janardan R: Approximate multiple protein structure alignment using the sum-of-pairs distance. J Comput Biol 2004, 11(5):986–1000.

142 [140] Alexandrov NN, Takahashi K, Go N: Common spatial arrangements of backbone fragments in homologous and non-homologous proteins. J Mol Biol 1992, 225:5–9.

[141] Chew LP, Huttenlocher D, Kedem K, Kleinberg J: Fast detection of common geometric substructure in proteins. J Comput Biol 1999, 6(3-4):313–325, http://dx.doi.org/10.1089/106652799318292.

[142] Godzik A, Skolnick J, Kolinski A: Regularities in interaction patterns of globular proteins. Protein Eng 1993, 6(8):801–810.

[143] Strickland D, Barnes E, Sokol J: Optimal protein structure alignment using maximum cliques. Operations Research 2005, 53:389–402.

[144] Goldman D, Istrail S, Papadimitriou C: Algorithmic aspects of protein structure similarity. In: Proc. 40th Annual IEEE Sympos. Foundations Comput. Sci.. IEEE Computer Society, Los Alamitos 1999, :512–522.

[145] Pullan W: Protein Structure Alignment Using Maximum Cliques and Local Search. Advances in Artificial Intelligence, LNCS 2007, 4830:776–780.

[146] Ivarsson Y, Travaglini-Allocatelli C, Brunori M, Gianni S: Mechanisms of protein folding. European Biophysics Journal 2008, 37:721–728, http:// dx.doi.org/10.1007/s00249-007-0256-x. [10.1007/s00249-007-0256-x].

[147] Brockwell DJ, Radford SE: Intermediates: ubiquitous species on folding energy landscapes? Current Opinion in Structural Biology 2007, 17:30 – 37, http://www.sciencedirect.com/science/article/pii/S0959440X07000048. [Folding and binding / Protein-nucleic interactions].

[148] Gianni S, Ivarsson Y, Jemth P, Brunori M, Travaglini-Allocatelli C: Identifi- cation and characterization of protein folding intermediates. Biophys- ical Chemistry 2007, 128(2-3):105 – 113, http://www.sciencedirect.com/ science/article/pii/S0301462207000877.

[149] Zarrine-Afsar A, Larson SM, Davidson AR: The family feud: do proteins with similar structures fold via the same pathway? Current Opin- ion in Structural Biology 2005, 15:42 – 49, http://www.sciencedirect.com/ science/article/pii/S0959440X05000126. [Folding and binding / Protein- nucleic acid interactions].

[150] Lindberg MO, Oliveberg M: Malleability of protein folding pathways: a simple reason for complex behaviour. Current Opinion in Structural Bi- ology 2007, 17:21 – 29, http://www.sciencedirect.com/science/article/ pii/S0959440X07000097. [Folding and binding / Protein-nucleic interactions].

143 [151] Ho BK, Agard DA: Probing the Flexibility of Large Conformational Changes in Protein Structures through Local Perturbations. PLoS Comput Biol 2009, 5(4):e1000343, http://dx.doi.org/10.1371%2Fjournal. pcbi.1000343.

[152] Montalvo RW, Smith RE, Lovell SC, Blundell TL: CHORAL: a differential geometry approach to the prediction of the cores of protein struc- tures. Bioinformatics.

[153] Chang P, Rinne A, Dewey TG: Structure alignment based on coding of local geometric measures. BMC Bioinformatics 2006, 7:346, http://www. biomedcentral.com/1471-2105/7/346.

[154] Ku SY, Hu YJ: Protein structure search and local structure character- ization. BMC Bioinformatics 2008, 9:349, http://www.biomedcentral.com/ 1471-2105/9/349.

[155] Sacan A, Ozturk O, Ferhatosmanoglu H, Wang Y: LFM-Pro: A Tool for Detecting Significant Local Structural Sites in Proteins. Bioinformatics 2007, 23(6):709–716.

[156] Yan X, Han J: gSpan: Graph-based substructure pattern mining. In Proc. 2002 Int. COnf. Data Mining (ICDM’02), Maebashi, Japan 2002:721– 724.

[157] Bandyopadhyay D, Huan J, Prins J, Snoeyink J, Wang W, Tropsha A: Iden- tification of family-specific residue packing motifs and their use for structure-based protein function prediction: I. Method development. J Comput Aided Mol Des 2009, 23(11):773–784, http://dx.doi.org/10.1007/ s10822-009-9273-4.

144