<<

A TWO-PRONGED APPROACH TO IMPROVE DISTANT

DETECTION

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy in the Graduate School of The Ohio State University

By

Marianne Lee, MBA

* * * * *

The Ohio State University 2009

Dissertation Committee: Approved by Professor Ralf Bundschuh, Adviser

Professor Umit Catalyurek ———————————————— Professor Charles Daniels Adviser Biophysics Graduate Program

ABSTRACT

With the tremendous growth in biological information, has become a powerful approach to aid in assigning the functional role of .

By establishing an ancestral relationship or homology to a well-understood , the function of a previously uncharacterized protein can be inferred.

The most common method to detect homology between proteins is to use , of which BLAST and PSI-BLAST are the most popular tools. The challenge is to find as many true positives as possible, and distinguish these true positives from false positives when sequence similarity falls into the twilight zone (<25%), as is commonly observed for distantly related sequences.

A two-pronged approach is presented to address the challenge in distant homology detection. In the proposed LESTAT algorithm, conserved structural features are incorporated into an iterative profile-based sequence alignment method. This imparts LESTAT with the ability to finding more true positives than

PSI-BLAST based on seven test case studies. In the proposed SimpleIsBeautiful

(SIB) algorithm, a mathematical model and a novel model validation approach is utilized to improve PSI-BLAST's ability to discriminate true and false positives without sacrificing its computational efficiency. These additional features result in improved performance in deciphering true and false positives when compared to

i i existing PSI-BLAST approach. A web-server that runs the SIB algorithm, SIB-

BLAST, was launched in December 2008 under the URL (http://sib- blast.osc.edu).

One alternative application of homology prediction is to utilize that information to predict protein-protein interactions. As a first step to explore such questions, an algorithm was developed that attempts to predict interacting partners of a hetero-oligomer from a homo-oligomer using a structure-based sequence alignment strategy in conjunction with correlation analysis of amino acids pair. The prediction algorithm was applied to the human Rh proteins and the SSoPCNA proteins. The results reveal that interacting residues in a homo- oligomer do undergo , presumably under evolutionary pressure, when trying to complex with another protein molecule to form a hetero-oligomer.

ii i For Michael and my family

iv

ACKNOWLEDGMENTS

My venture into science from finance has been both exciting and eventful.

At times, it was frustrating especially when I hit a roadblock. Nevertheless, the end result is rewarding and I would not trade the past years and experience with anything. There are many people that I would like to thank for their support and encouragement - my mom and dad, my sister and brothers, my graduate committee members: Dr. Bundschuh, Dr. Clanton, Dr. Catalyurek, and Dr.

Daniels, my lab mates: Xin, Patrick, Manoj and Sridevi, and people that have helped me in time of need. I would like to take this opportunity to thank them all.

I would like to thank a few people specifically who have deep influence on my PhD career. First and foremost is my mentor and advisor Dr. Bundschuh. I would not have achieved my scientific accomplishments without his advice and guidance. His immense patience and calming manner in correcting my errors and explaining fundamental theories have kept me motivated. I especially thank him for believing in me, first by taking a chance on a novice in science and computer programming, and then by constantly challenging me to reach for my potentials. I am fortunate in having an advisor who upholds high scientific standards and integrity.

v I must thank Dr. Thomas Clanton, who has been instrumental in my joining the OSU Biophysics program. His lectures on ethics and constant emphasis on being an honest and responsible scientist have greatly influenced my professional and personal development. Last but not least is my husband

Michael. He has been my constant companion and cheerleader throughout my

PhD career. Even when I doubt myself, his unwavering faith in me keeps me going.

v i VITA

1989………………………. B.S. Business Administration, California State University, Los Angeles, CA

1993………………………. MBA Finance, University of Southern California, Los Angeles, CA

2002 – Present…………...Graduate Teaching and Research Associate, The Ohio State University

PUBLICATIONS

Research Publication

1. Lee, MM., Chan, MK. Bundschuh, R., " SIB-BLAST: a web server for improved delineation of true and false positives in PSI-BLAST searches." Nucleic Acids Research, 2009, May 8.

2. Lee, MM., Isaza, CE., White, JD., Chen, RY., Liang, G., He, H., Chan, SI., Chan, MK., "Insight into the substrate length restriction of M32 carboxypeptidases: characterization of two distinct subfamilies." Proteins: Structure, Function, and Bioinformatics, 2009, in press.

3. Fekner, T., Li, X., Lee, MM., Chan, MK., "A Pyrrolysine Analog for Protein Click Chemistry." Angew Chem Int Ed Engl 2009, 48:1633-5.

4. Lee, MM., Jiang, R., Krzycki, JA., Chan, MK.,"Structure of the Desulfitobacterium hafniense pyrrolysyl-tRNA synthetase." Biochem Biophys Res Commun., 2008, 374:470-4.

v ii 5. Lee, MM., Chan, MK. Bundschuh, R., "Simple is beautiful: a straightforward approach to improving the delineation of true and false positives from a PSI- BLAST search." Bioinformatics, 2008, Apr 10.

6. Lee, MM., Bundschuh, R., Chan, MK., "Distant Homology Detection Using a LEngth and STructure-based Sequence Alignment Tool (LESTAT)." Proteins: Structure, Function, and Bioinformatics, 2007, 71:1409-1419.

FIELDS OF STUDY

Major Field: Biophysics Graduate Program

vi ii TABLE OF CONTENTS Page Abstract ……………………………………………………………………………….. ii Dedication ……………………………………………………………………………. iv Acknowledgements ………………………………………………………………… v Vita …………………………………………………………………………………….. vii List of Tables ………………………………………………………………………… x List of Figures ………………………………………………………………………. xi

Chapters 1. Introduction ………………………………………………………………………. 1 1.1. Protein Structure and Function ……………………………………... 3 1.2. Homology Detection ……………………………………...... 8 1.3. Sequence Alignment ……………………………………...... 9 1.4. Dynamic Programming Algorithm ………………………………… 13 1.5. BLAST - a Heuristic Approach ……………………………………... 15 2. Distant Homology Detection Using a Length and Structure-based Sequence Alignment tool (LESTAT) ……………………………………...... 18 2.1. Experimental Section ……………………………………...... 21 2.2. Results and Discussion ……………………………………...... 32 2.3. Conclusion ……………………………………...... 46 3. Simple Is Beautiful: a Straightforward Approach to Improve the Delineation of True and False Positives in PSI-BLAST Searches ...... 48 3.1. Methods ……………………………………...... 50 3.2. Results ……………………………………...... 58 3.3. SIB-BLAST Web Server…...…………………………………………. 63 3.4. Discussion ……………..…...…………………………………………. 67

ix 3.5. Conclusion …………….…...…………………………………………. 68 4. Predicting Interacting Partners of Hetero-oligomers …………………..... 70 4.1. Methods …………….………………………………………………….. 74 4.2. Results and Discussion …………………………………………….. 83 5. Conclusion …………………………………………………………………….... 89

x LIST OF TABLES Table Page

Table 3.1 ROC100 values characterizing retrieval performance for a "gold standard" yeast database …………………...... 62

Table 4.1 Top ten predictions of the possible subunit combinations forming the human Rh complex using different multiple sequence alignment models …..... 84

x i

LIST OF FIGURES

Figure Page

Fig. 1.1 Proteins are comprised of a specific sequence of amino acids …..….. 4

Fig. 1.2 The different levels of protein structure ……………………………..….. 5

Fig. 1.3 Venn diagram depicting the distribution of recognized folds in the three superkingdoms of life and the number of recognized folds in each ……………. 6

Fig. 1.4 Which individuals belong to the Simpson family? ………..……………. 8

Fig. 1.5 Protein homology based on structure ………..…………………………. 9

Fig. 1.6 Example of a sequence alignment ……………..….……………………. 10

Fig. 1.7 Challenge in homology detection ……..………..….……………………. 12

Fig. 1.8 Coverage versus Error plot ……………….…………..……….…………. 13

Fig. 1.9. Example of a filled dynamic programming matrix ……………………… 14

Fig. 2.1 The LESTAT algorithm ……………….………..………….……..….……. 20

Fig. 2.2 Performance Comparison of LESTAT versus PSI-BLAST for the initial test set ……………….…………………………………………..…….……..………. 34

Fig. 2.3 Performance Comparison of LESTAT versus PSI-BLAST for the independent validation set assessed at superfamily level ………..……...……. 35

Fig. 3.1 Performance comparison between rounds two and five of PSI-BLAST, and our new method for combining different rounds ………..…….……..………. 58

Fig. 3.2 Performance comparison between PSI-BLAST, SAM-T2K, and our proposed method ………..…………………………………………………..………. 61

Fig. 3.3 Snapshot of the SIB-BLAST front page. ………..…….………..………. 64

Fig. 3.4 Snapshot of the SIB-BLAST result page …………..….………..………. 66

x ii Fig. 4.1 Schematic overview of the prediction algorithm for oligomeric assemblies ………………………………………………………………………….... 74

Fig. 4.2 Graphical representation of the interacting subunits and the corresponding interfaces of a heterotrimer …………….………………....………. 82

Fig. 4.3 Ribbon diagrams of the three interfaces in SsoPCNA ……..…………. 87

xi ii

CHAPTER 1

INTRODUCTION

The completion of the genome sequencing of numerous organisms, the most notable being the in 2001 [1] has led to an explosion in the number of proteins sequences available in protein databases (8,500,000 sequences versus 57,000 structures as of 4/20/09). This has greatly altered the focus of biological and biomedical sciences and promoted the development of new biological technologies. Structural and functional genomics are among the many disciplines that have arisen from these advances.

Homology detection - the identification of ancestral relationships between proteins - especially in the area of structural and functional annotation, has played an important role in structural genomics. Over the past two decades, tremendous effort has been put into the development of bioinformatics tools for detecting homology relationships. When a new sequence is discovered, using sequence alignment methods to search for its homologs has become a standard procedure. Despite the general utility of sequence-based algorithms, such as

BLAST [2, 3] and PSI-BLAST [3, 4], to identify potential homologs, it is still quite common to discover that after laborious experimental effort, the target protein

1 predicted to have a new fold in fact has a homolog with a known structural motif.

This frequent occurrence can be attributed to the difficulty of predicting homology for proteins having low sequence similarity, as is commonly observed for distantly related sequences. Thus, one major challenge in protein homology detection is how to identify the “more” distant homologs (higher sensitivity).

PSI-BLAST [3, 4], an extended version of BLAST [2], is arguably the most widely used sequence alignment tool for distant homology detection. It uses an iterative profile (PSSM)-based search strategy that enhances its sensitivity in detecting weak homologies. Nevertheless, incorporation of false positive sequences into its profile is quite common, leading to model corruption and meaningless results. Finding ways to overcoming this model corruption problem remains a challenge in the bioinformatics fields.

The major focus of my thesis is in developing strategies to overcome the challenges in distant homology detection using sequence alignment methods. As such, this thesis is organized in the following manner: in the remaining chapter, I will give a brief overview on the structure of proteins and sequence alignment methods; this is followed by Chapter 2 and 3 - the main components of this thesis, which describe the two-pronged approach I have undertaken to improve the detection of distant homology. Specifically, Chapter 2 presents my first approach in overcoming the challenge of detecting homologs with low sequence similarity. The chapter describes a novel algorithm, LESTAT that I have developed, and which incorporates the structural information from distant homologs to identify highly conserved regions, and then uses these regions and

2 their separation distances to build a search model. Significantly, LESTAT can identify many of the more distant homologs not detected by PSI-BLAST. Chapter

3 aims at overcoming the model corruption problem observed in PSI-BLAST. It details the development of the SimpleIsBeautiful algorithm, which utilizes a mathematical formula and the output produced by PSI-BLAST to build a validation model that verifies the authenticity of a true homolog. The result is improved PSI-BLAST search accuracy without sacrificing its computational efficiency - a major feat given that better alignment methods than PSI-BLAST are available but at a higher computational cost. Chapter 4 describes my efforts to develop an algorithm that uses a structure-based sequence alignment method and correlation analysis to predict the interacting partners in an oligomer from a homologous homo-oligomer. The algorithm is still a work-in-progress. Based on the validation result, the potential influence of evolutionary pressure on the coevolution of amino acids in a homo-oligomer and a hetero-oligomer is discussed.

PROTEIN STRUCTURE AND FUNCTION

Proteins are the biological nanomachines that mediate most processes in cells. They are built from a repertoire of twenty amino acids, connected by peptide bonds that give rise to their sequence - the specific arrangement of amino acids that imparts a protein with its particular structure and function (Fig. 1.1).

3

Fig. 1.1. Proteins are comprised of a specific sequence of amino acids. Each letter in the sequence denotes a different amino acid. Source: http://ffden- 2.phys.uaf.edu.

The primary structure of a protein, that is, its amino acid sequence is folded into a three-dimensional (tertiary) structure, which brings together residues that are far apart in the sequence. Tertiary structure is stabilized by noncovalent interactions between side chain amino acid residues, including hydrophobic interactions, hydrogen bonding, electrostatic interactions, and disulfide linkages.

When a protein consists of several polypeptide chains, termed subunits, the spatial arrangement of these subunits is known as its quaternary structure. The subunits of oligomeric proteins are usually held together by weak noncovalent interactions, including hydrogen bonding, van der Waals and hydrophobic interactions and interchain disulfide bonds. Hydrophobic interactions are the principal forces and electrostatic forces help to provide proper interactions between the subunits. Fig. 1.2 gives an overview of the different levels of protein structure [5]. 4

Fig. 1.2. The different levels of protein structure [6].

5 All proteins consist of one or several domains defined as the basic units of folding and function, and often have similar chain topologies [7]. When separated from the parent chain, protein domains are able to fold independently to their native configuration. Based on the 1992 estimate by Chothia [8], the number of unique protein folds in nature were no more than one thousand, and the experimentally determined structures at that time, which represented about 400 of the folds, were the most common ones. In 1998, Koonin et al. [9] reported that about one half of the folds in the three superkingdoms of life are common to all three groups of organisms (Fig. 1.3). Thus it seems feasible to use the "relatively small" portion of protein structures that are determined experimentally to derive most of the unsolved structures.

43 Bacteria 60 Eukaryotes 36 137 3 4 1

Archaea

Fig. 1.3. Venn diagram depicting the distribution of recognized folds in the three superkingdoms of life and the number of recognized folds in each [9].

In the two structure databases, SCOP (Structural Classification of Protein)

[10] and CATH (Class, Architecture, Topology, Homologous) [11], protein structures are classified into hierarchies based on their evolutionary origin and

6 structural similarity. At the highest hierarchy, proteins are classified into broad categories according to their secondary or tertiary structure characteristics, independent of sequence similarity or provable evolutionary history or relationship. Further down the hierarchy, among proteins that share similar folding patterns, protein domains sharing enough structural and/or sequence similarity, thus suggesting common ancestry, are considered to be related and are grouped into the same superfamilies and/or families. Thus, under the SCOP classification scheme, proteins within the same families share at least 30% residue identity and are of common evolutionary origin. In the Superfamilies of proteins, probable evolutionary relationships is inferred based on evidence of structural and/or functional features rather than sequence similarity since proteins in the Superfamilies share low sequence similarity (usually less than

25%) [12]. Thus, in general, proteins that are structurally similar are more likely to be evolutionarily related and perform similar function.

Different evolutionary processes such as duplication and gene fusion, give rise to new proteins or function, however. Selection of one or a few sequence changes can cause proteins to differentiate and confer novel functions.

It is found that subtle changes in residues at conserved positions confer functional differences across subfamilies [13-16]. For example, in the active site of an enzyme may enable the protein to bind a different substrate or catalyze a different reaction using the same substrate. produces two copies of the same gene, freeing one gene of functional constraint, allowing it to assume a new biological role through mutation, gene fusion or

7 oligomerisation. Proteins can also achieve functional variety through domain combination [17]. Many dehydrogenases, such as malate dehydrogenase, alcohol dehydrogenase, and glyceraldehyde-3-phosphate dehydrogenase, share homologous NAD- or NADP-binding domains but have different unrelated partner domains that vary with the reaction catalyzed [18]. In domain shuffling between multidomain partners, the structural environment of a domain is changed by a new partner, thereby modifying its function [19]. Over evolutionary time, some positions in the protein sequences remained conserved, indicating their importance to protein function while other positions have evolved extensively such that the sequence identity between family members is low [13, 16].

HOMOLOGY DETECTION

Two proteins are said to be homologous if they share common ancestry.

Homology detection refers to the identification of certain characteristics/features that give rise to the inference of ancestral relationship between proteins. A simplistic illustration of the concept of homology detection is depicted in Fig. 1.4.

Fig. 1.4. Which individuals belong to the Simpson family?

8 The three individuals at the left are matched to the Simpson family at the right. Bart and Krusty are easy to classify as a family member and non-family member, respectively. But the middle character is harder to place. Closer examination reveals, however, that this is Homer – just younger and with a lot more hair! This latter case illustrates the concept of a distant homolog.

Classifying proteins in biology is similar. One way to infer relationships between proteins is based on structure [13, 14, 20, 21]. For example, the proteins enterotoxin and cholera toxin in Fig. 1.5 are structurally similar. Thus, they are more likely to be evolutionarily related .

Fig. 1.5. Protein homology based on structure.

SEQUENCE ALIGNMENT

More commonly, however, protein homologs are identified based on comparison of their sequences. Sequence alignment (Fig. 1.6) involves looking

9 for the overlapping regions (alignment) between amino acid sequences. The longer the overlapped region of matching residues is, the more likely the two proteins are related.

Fig. 1.6. Example of a sequence alignment. "*" means that the aligned residues are identical. ":" means that the aligned residues are similar based on their physicochemical properties.

Different approaches exist for aligning sequences. Most sequence alignment methods rely on the quantification of sequence similarity between sequences to imply function similarity. The general assumption undertaken by these methods are that proteins sharing greater than 30% sequence identity are descendents of the same common ancestor, thus they belong to the same family and likely have similar structural and functional properties [13]. The most common sequence comparison method is the "all-against-all" single-sequence similarity search approach, which involves alignment of a single probe sequence to every sequence in the database in a pairwise fashion. Programs such as

BLAST [2], FASTA [6] and SSearch [22] are among the many programs that use this approach. These methods, however, do not make use of extrinsic information such as information implicit in the alignments of families of related 1 0 proteins or structural information embedded in a protein structure, and thus often miss the weak homologies between distant related proteins. In general, these

"all-against-all" single-sequence similarity search approaches fail to find distant homologs, particularly, those whose sequence similarity is in the twilight zone

(≤25%).

Profile-sequence comparison methods such as PSI-BLAST [3, 4] were developed to be more sensitive, extending the range of homologs that could be detected. These methods perform multiple alignment on homologous sequences to build a profile of families of protein that confers better discrimination between conserved residue positions from non-conserved positions.

Another powerful profile-based method is the Hidden Markov Model

(HMM) that includes position specific probabilities for insertions and deletions in its profile in addition to the standard amino acid frequencies. The position specific gap penalties penalize random hits, whose gaps are variable, much more than true homologs which tend to have the same insertions and deletions at the same position as the sequences used to build the HMM. As a consequence, profile

HMMs have been shown to perform better than sequence profiles (e.g. PSI-

BLAST) in the detection of homologs and in the quality of alignments [23-25].

Programs such as HMMER [26] and SAM [27, 28] epitomize this approach.

However, the improved sensitivity comes at the expense of higher computational cost. Thus, although SAM has shown to be superior to PSI-BLAST in its ability to detect distant homologs [29], it is much less popular than PSI-BLAST.

In homology searches, one starts with a protein of interest and tries to

1 1 identify all of its family members. This problem is similar to finding Homers (true hits) in a sea of Krusty the Clowns (false hits) (Fig. 1.7). Thus, an optimal sequence alignment tool is one that best discriminates true and false positives

(high sensitivity), and finds the most true positives without incurring false ones

(high specificity) (e.g. finds more Homers without picking up Krustys).

Fig. 1.7. Challenge in homology detection – Finding Homers in a sea of clowns.

A typical way to depict the effectiveness of a sequence alignment algorithm is by a coverage versus error plot. Coverage is defined as the homologs (true positives) that have scores above a selected error threshold - this reflects the sensitivity of a method. Error is the number of non-homologs (false positives) incurred - this indicates specificity. Thus, a coverage versus error plot is simply a graph showing the number of true positives uncovered against the number of errors incurred at a given threshold. In essence, it is a representation of how a method performs at different levels of accuracy. 1 2 For an optimal sequence alignment algorithm, the curve in a coverage versus error plot would lie as far right as possible, denoting higher sensitivity, and would be as flat as possible before the onset of an error, indicating better selectivity (Fig. 1.8).

Fig. 1.8. Coverage versus Error plot: tradeoff between sensitivity and specificity as illustrated with the Aravind data set [30] used in the SimpleIsBeautiful algorithm study [31].

DYNAMIC PROGRAMMING ALGORITHMS

The application of dynamic programming algorithms to biological sequences was first introduced by Needleman and Wunsch (1970) [32]. The algorithm strives to find an optimal alignment that covers the entire length of both sequences. This was followed by the Smith-Waterman algorithm (1981) [33] - a modification of the Needleman-Wunsch algorithm - that looks for an optimal alignment of subregions that aligned best between both sequences. `

1 3 In dynamic programming, all possible alignments between the two sequences are explored and the alignment with the highest score is defined as the optimal alignment between two sequences (Fig. 1.9). Thus, algorithms that use dynamic programming are guaranteed to find the optimal scoring alignment.

For large protein whose amino acid sequence is long, this can be very time and memory expensive. Add to this fact is that the search database (as in the

National Center for Biotechnology Information (NCBI) non-redundant (NR) database) is huge (>8 million protein sequences as of 4/09), thus aligning sequences using dynamic programming cannot be done yet on a practical timescale.

Fig. 1.9. Example of a filled dynamic programming matrix for two DNA sequences. Scoring scheme used: +5 for a match, -2 for a mismatch, -6 for each gap. Optimum path is shown in red. Arrowheads are 'traceback pointers' [34]. 1 4 BLAST (Basic Local Alignment Search Tool) - A HEURISTIC APPROACH

In conducting a search against the NR database at NCBI for sequences similar to a query sequence, BLAST [2, 3] takes a matter of minutes whereas the original Smith-Waterman algorithm [33] would require about a day to accomplish the same task. Such computational efficiency is one of the many appealing attributes of BLAST - and probably the major reason for its wide acceptance and huge popularity within the scientific community.

BLAST [2, 3] derives its computational efficiency using a heuristic approach, which substantially reduces computational time and in turn produces biologically meaningful results within a practical timescale. The heuristics used by BLAST is based on knowledge of the sequences and alignment statistics.

Instead of searching exhaustively the sequence space, these heuristic approaches allow the programs to confine the searches in as small a fraction as possible of the dynamic programming matrix, while still looking at all the high scoring alignments. Because a heuristic approach does not guarantee optimal alignment, in gaining operation efficiency, accuracy and sensitivity are compromised.

BLAST employs several heuristics to accelerate its database search. It uses a list of very short segments from the query sequence to first scan for exact matches with each database sequence and retains only the most significant ones based on a certain threshold for subsequent search of a good longer alignment.

Since these seed segments (words) are very short, the search of short sequences to identify candidate regions can be done very fast. Because the

1 5 length of the word is the minimum length required to achieve a high enough score to be significant, the choice of the word length as well as the threshold can greatly affect the sensitivity and accuracy of the search. For example, if the word length is too long, it may miss short but significant pattern. Likewise, a higher threshold allows for faster and more specific searches but it produces less number of seeds for subsequent search of longer alignment. On the other hand, a shorter word length or a less stringent threshold increases sensitivity, but this requires additional computation time. The alignment score of the seed segment of the query sequence to the database sequence is calculated based on the used. For protein BLAST, the default word-size is 3 and the default scoring matrix is BLOSUM62. Only pairs with alignment scores above a certain threshold are retained for further alignment.

Another heuristic used by BLAST involves triggering an extension to generate an alignment. BLAST uses a two-hit method that requires two non- overlapping high scoring pairs within a certain distance of each other and on the same diagonal to trigger a longer ungapped alignment between the seed segment and the database sequence. Once a matched region is found and the ungapped alignment score exceeds a chosen threshold, a gapped extension is invoked to produce an alignment using dynamic programming. Because there are far less high scoring pairs that also have a second pair within the distance threshold and on the same diagonal, the threshold for seed selection must be lowered to maintain sensitivity. This means allowing matches to lower-scoring words in the database sequence and potentially affects the quality and accuracy

1 6 of the alignment result. Also, fewer extensions are invoked, increasing the chance of missing a high scoring pairs. The advantage, however, is that the computationally demanding dynamic programming is confined to a smaller region of already identified similarity, substantially reducing the time required to conduct the search. The caveat is that it increases the probability of missing a higher scoring extension if the high scoring pairs happened to lie outside of the pre- identified region of similarity. The choice of threshold parameters becomes even more important when applying heuristics to distantly related sequences with low similarity between them in order to attain reasonable sensitivity.

Using heuristics, BLAST [2, 3] and PSI-BLAST [3, 4] are able to attain computational efficiency - one of the major reasons behind its huge popularity even though other state-of-the-art methods, such as the HMM-based SAM-T2K

[27, 28], have shown to be superior in remote homology detection but at a higher computational cost [29]. The goal of my research, and hence the main theme of this thesis is to find ways to improve distant homology detection with an effort towards keeping computational cost at a minimum.

1 7

CHAPTER 2

DISTANT HOMOLOGY DETECTION USING A LENGTH AND STRUCTURE- BASED SEQUENCE ALIGNMENT TOOL (LESTAT)

Homology detection is one of the most widely used tools for structural and functional annotation and is a major driving force behind structural genomics efforts. Towards this end, a number of distinct alignment methods and algorithms have been developed to identify potential homologies. State-of-the-art homology detection tools such as HMMER [35, 36] utilize profile technology and expert- curated libraries. Alternative approaches to improve performance include incorporating structural information such as amino acid environmental preferences, secondary structure elements, and loop lengths into a search profile

[37-40]. However, these latter approaches require the preexistence of a manually curated model or multiple alignment of the .

In contrast, automated profile-based methods have the advantage that they avoid the introduction of human bias and can be applied in the absence of a predefined model. The most widely used automated application, PSI-BLAST [3], starts with a single sequence and applies a machine learning algorithm that iteratively incorporates hits found in previous cycles to build and improve a sequence profile (PSSM). Numerous studies have demonstrated that iterative

1 8 model building enhances remote homolog detection [3, 41]. Despite these advances, the identification of distant homologs having low sequence identity

(<25%) remains a challenge. Indeed, the motivation of this study is derived from the initial inability of PSI-BLAST to identify the homology of Pyrococcus furiosus

M32 carboxypeptidase (PfuCP) with its structural homologs, Rattus norvegicus neurolysin and Homo sapiens -converting enzyme (ACE).

Our approach is based on the hypothesis that homology prediction can be dramatically improved by using an initial model built from a set of distant homologs having limited sequence similarity, thus avoiding the inherent bias associated with a single starting sequence. Structural homologs with low sequence identity often have conserved residues, many of which are critical for function [42-44]. Consequently, we have developed an automated machine learning algorithm, LESTAT, that uses the conserved features derived from a small set of distantly related structural homologs to build an initial model comprised of structurally conserved blocks and block separation distances. This model is improved through iterative refinement cycles using information extracted from identified sequence homologs (Fig. 2.1). This approach is unique because it combines the structural information incorporated into the initial model with a complete and automated exploitation of all sequence information available for the protein family in question without the need for expert model building.

1 9

Fig. 2.1. The LESTAT algorithm. (a) Flow diagram of the LESTAT algorithm. Initial building of the model (blue), iterative refinement of the model (orange).

(b) Representative Block-containing Position Specific Scoring Matrix (BPSSM). The first 20 columns represent the log-odds scores for the 20 amino acids; the 21-22 columns indicate the maximum and minimum separation between neighboring residues, and the remaining columns are the cost of having different block distances.

The key determinant that allows LESTAT to overcome the structural and sequence diversity inherent in distant homologs is its use of structurally conserved blocks, as opposed to the entire protein sequence to build its model.

2 0 This concept of structurally conserved blocks has already been explored in a non-iterative fashion in the CDD database and the corresponding search tool,

SALTO [40, 45, 46]. We have added the separation distances between blocks as another important measure for homology and incorporated this feature into an iterative framework. In addition to these structure-derived properties, two additional components, a “lock-in” feature and an enhancement factor, have been incorporated to enhance the ability of LESTAT to detect more remote homologs. These two features together act to mitigate compositional biases resulting from disparate family sizes during the iterative model building cycles. In this chapter, we describe this algorithm and compare its performance to PSI-

BLAST, arguably the most popular machine learning sequence alignment algorithm currently in use. The results and implications of these studies are discussed.

EXPERIMENTAL SECTION

Algorithm

LESTAT is a structure-based sequence alignment algorithm that utilizes the information derived from the of proteins from related families to iteratively search a database for homologs. A schematic representation of this algorithm is depicted in Fig. 2.1. In brief, this method is comprised of the following steps: (1) construction of an initial model comprised of amino acid blocks and block separation distances from three structural homologs with low sequence identity; (2) generation of a block-containing position specific

2 1 scoring matrix (BPSSM); (3) alignment of query sequences to the BPSSM; and

(4) scoring and assessing the statistical significance of the optimal alignment.

Sequences with reasonable statistical significance are used to generate a new

BPSSM, which is used to identify additional homologs in an iterative fashion, repeating steps (2)-(4), until the refinement reaches convergence.

Construction of the initial homology model

The first step towards building the initial model is to retrieve the coordinates of the structurally-characterized protein of interest from the Protein

Data Bank (PDB) [47]. These coordinates are then submitted to the structural alignment algorithm DALI [48], which produces a list of probable structural homologs, each with a corresponding Z-score. These DALI-Z scores provide a statistical measure of the significance of similarity between the probe structure and the resultant structural homologs, with a score of 2.0 or higher being considered to be structurally similar [49]. Two structural homologs from this DALI list are selected, and combined with the initial protein of interest to form a three- for use in subsequent structural alignment and model generation. In selecting the DALI homologs to be combined with the protein of interest, only proteins having a DALI-Z score of greater than 8.0 are considered to ensure adequate structural similarity. Consequently, we attempt to select maximally divergent members that fulfill this criterion to facilitate the detection of subtle patterns [13]. We have empirically found that a three- protein set is optimal for the initial profile construction as it produces models with

2 2 a reasonable balance between the number of conserved positions and diversity of sequence information. Improved models can be obtained by screening several three-protein sets prepared from divergent homologs and selecting for the set having the highest number of conserved positions.

Utilizing this strategy, several three-protein sets are selected and structurally aligned, and their conserved positions identified. The set that yields the highest number of structurally conserved positions is used to construct a profile for the initial homology search. Notably, this multiple structural alignment can be performed either by manually taking advantage of expert knowledge, or in an automated fashion using a multiple structural alignment program. For this study, we chose to use the program MAPS [50]. MAPS superimposes the three- dimensional structures and searches for the structurally equivalent residues among all the structures based on their approximate positions on the backbone and side chains of the proteins. The structurally conserved blocks of amino acid residues identified by MAPS are subsequently used to construct the initial block- containing position specific score matrix (BPSSM) as described below.

The Block-containing Position Specific Score Matrix

We describe a protein family by a Block-containing Position Specific Score

Matrix (BPSSM) (Fig. 2.1b). Each row in the BPSSM describes the properties of a structurally conserved amino acid position and the possible distances between two consecutive structurally conserved positions.

2 3 A key difference from standard local alignment algorithms such as the

Smith-Waterman algorithm [51] employed (in a heuristic approximation) by PSI-

BLAST is that for regions of contiguous conserved amino acid residues, no gap is allowed at any position within the contigs. For the non-conserved regions between the contigs wherein the length varies, an upper and lower boundary of allowable separation distances are established based on the model sequences, and a variable cost is assigned to each separation distance that reflects how typical the observed distance is compared to the distances in the model sequences. We assume that the distribution of these variable lengths is approximately Gaussian. This allows us to characterize each gap between contigs by a mean g and a standard deviation σ of the separation distances. The corresponding log-odds cost function is given as follows:

! (g " g )2 gapscore(g) = min{0, + c} (1) 2#$ where g is the actual length of the gap, and α as well as c are globally chosen parameters. The! parameter α rescales the standard deviation of the distribution.

A value larger than one makes the score distribution wider than the distribution of separation distances observed in the model sequences and thus allows new homologs with somewhat larger separation distances than those already included in the model to be found. The parameter c represents the overall normalization of the Gaussian distribution. Since we take the minimum of the

Gaussian score and zero, the parameters c and α together set the range of separation distances that remain unpenalized while α alone determines how

2 4 strongly separation distances are penalized once they are outside this range. In this study, we chose α = 1.5 and c = 0.5 after trying several different values on the neurolysin-like proteins and the TIM barrel proteins as test cases.

The Weighting scheme and the Modified Weight Function

For constructing the BPSSM, we use an approach analogous to that of

PSI-BLAST for pruning high similarity protein sequences, calculating the weighted observed amino acids frequencies, and constructing the pseudocount frequencies. There are key differences, however. In addition to the weighting factors provided by PSI-BLAST’s approach, two features specifically designed to maintain the sequence diversity in the profile are added. First, our algorithm includes a “lock-in” feature that is used to retain a sequence that has been identified in previous iterations as being a homolog in subsequent alignments, even if its E-value in the subsequent iterations is above the significance level.

The philosophy behind this approach is that if the sequence is found to be a statistically significant homolog in any of the iterations, it is a homolog and therefore should be retained independent of the results in future iterations [20]. In this study, we found that retention of the information from the divergent homologs is essential to the discovery of more distantly related protein sequences. Second, in order to further allow such retained homologs to contribute to the sequence model, each sequence is weighted by an extra enhancement factor designed to prevent previously identified homologs with very few close relatives from being overwhelmed by large families of closely related proteins that would otherwise

2 5 dominate the BPSSM. Our additional enhancement factor helps to correct this composition bias by balancing the relative importance of these divergent but under-represented homologs that have good initial scores in a previous alignment by increasing the weight of these sequences. The enhancement factor of each sequence is iteratively trained based on weights resulting from prior iterations.

In developing a weighting scheme for the enhancement factor, we sought a function that would be one for hits with low E-values, but would increase logarithmically based on the extent of divergence (as estimated from its E-value) for retained hits "locked in" from previous iterations but having a E-value above threshold in the current iteration. While the choice of a function that smoothly interpolates between these two cases is somewhat arbitrary, we made the specific choice:

( " e%+ enhancementfactor(E) = w*1 + w 0 * log$1 + '- (2) ) # t &, where w is the weight factor from the previous round, e is the E-value calculated in the previ!o us iteration, t is the E-value threshold used for model sequence inclusion, and w0 is the factor regulating the strength of the reweighting. The value 0.2 for the reweighting parameter w0 was also chosen empirically by observing the performance on the neurolysin-like proteins and the TIM barrel proteins.

2 6 Alignment and Statistical Assessment

Our alignment algorithm uses a dynamic programming approach to find an optimal local alignment between the protein sequence and the BPSSM, which contains both the substitution scores and the length parameter information as described above. To assess the significance of each hit, the background amino acid frequencies of Robinson and Robinson [52] are employed to generate

300000 random protein sequences. The BPSSM is applied to these random sequences to obtain the scores for sequences of lengths ranging from 100-2000 amino acid residues. Numerous studies have shown empirically that local alignment scores approximately follow an extreme value distribution (also known as Gumbel distribution) [51, 53-55]. The scores in this study also exhibit the exponential tail characteristic of a Gumbel distribution. However, they differ from a Gumbel distribution for the lower scores due to finite sequence length effects.

As such, we fit exponential distributions to the tails of the observed distributions and use these fits to derive the parameters (K and λ) required for calculating the

E-value. The statistical parameters, K and λ, are re-calculated at each iteration as a new BPSSM is constructed and applied. A more detailed explanation of K and λ will be discussed in chapter 3 equation (1).

A Machine Learning Approach

The structure-based sequence alignment profiles, in the form of blocked- position specific scoring matrices, are built and trained using a machine learning approach. This process involves iterative searches against the non-redundant

2 7 (NR) protein database for viable homologs using the BPSSM generated from sequences identified during the previous search cycle. At each iteration, a new

BPSSM is automatically generated from a multiple alignment of all previously identified sequences together with all newly identified sequences having an E- value less than 0.05. The contribution to the BPSSM from each selected sequence is re-calibrated according to its corresponding weight factor, which is calculated from its previous weight and enhancement factor according to equation (2) described above. This process of iterative profile refinement is repeated until convergence or until no new significant sequences is found.

Test Cases

The current implementation of LESTAT is computationally expensive due primarily to the random sequence calculation. While we plan on alleviating this computational cost in the future by incorporating BLAST-like heuristics, for the study of this first implementation of the algorithm, we chose for practical reasons, to focus on a limited number of test cases with the purpose of illustrating proof of principle of this work. We began by first testing our approach on proteins that we worked with in our crystallographic laboratory. The initial set of test cases contained proteins from the following three protein families: 1) the neurolysin-like protein family, the model system that stimulated this study; 2) TIM barrel proteins, the second most ubiquitous fold in the PDB database (immunoglobulins being the most abundant) [37, 56] the Rossmann-fold protein family, the

2 8 superfamily having the largest number of sequence families according to the

CATH database [57].

To provide additional validation, we sought independent test cases comprised of divergent proteins to assess the scope and applicability of our approach. The database of distant aligned protein structures (DAPS) [58] contains protein-pairs, which share less than 25% sequence identity and thus are ideal test cases. Since this database is still too large for a complete test, we resorted to an unbiased third party choice of samples from this database.

Specifically, we used the four pairs from this database that Eisenberg and coworkers have randomly chosen to validate their findings on their human bactericidal/permeability-increasing protein (BPI) [59]. Each of these proteins has a unique fold and belongs to a distinct superfamily. Thus, they are well suited to be used for testing and validation of this study.

Assessment Procedure

The performance of LESTAT and PSI-BLAST were assessed by applying both algorithms to search for sequences in the non-redundant (NR) database

(frozen in Oct 2003 when we began this study) and comparing their ability to identify protein homologs having the targeted fold. The assignment of a true or false hit was based on the fold of the closest putative homolog identified within a benchmark database. This allows for a direct comparison in terms of the number of true homologs identifies for a given number of errors. Specifically, the

Structural Classification of Proteins (SCOP) [60] ASTRAL95 domain structure

2 9 database (version 1.65) [61-63] was used for the benchmark evaluation. SCOP is a manually constructed hierarchical database of proteins, which have been classified based on their structure and function.

The presumed SCOP assignment was determined by submitting the hits returned from LESTAT and PSI-BLAST to the program BLASTP to search for homologs in the ASTRAL database. Since we are only interested in the SCOP assignment of the sequence region identified by LESTAT or PSI-BLAST to be related to the query sequence, a footprint approach was incorporated in which a sequence footprint of the target domain was created and used for subsequent fold assessment. The footprint of the LESTAT hit sequence was obtained by aligning the hit sequence with the BPSSM and selecting the region between the first aligned amino acid residue and the last aligned amino acid residue. The PSI-

BLAST footprint was derived similarly based on the full length of the local alignment reported by PSI-BLAST. Hits that covered 80% of this footprint were retained for SCOP assignment and performance evaluation.

We define a true hit (“yes”) as a protein sequence (a LESTAT or PSI-

BLAST hit) that met the 80% footprint criteria described above and has a homolog with the same (or related) SCOP identifier and an E-value no greater than 0.005. Thus, within this E-value 0.005 threshold, if both related and non- related homologs are found, the query sequence is assigned the “related” SCOP classification, irrespective of the E-value of non-homologous sequences. On the other hand, a protein sequence is defined as a false hit (“no”) or non-related if its entire repertoire of homologs within the E-value 0.005 threshold are of different

3 0 SCOP identifiers and therefore are deemed non-related. Protein sequences whose closest homologs exceed the E-value threshold of 0.005 cannot be reliably assigned and thus are marked as ambiguous. Only hits classified as

“yes” or “no” are considered for final performance evaluation. For the coverage- error analysis, hits were -ordered by their significance based on E-value resulting from each search method and the aggregate number of true positives

(“yes” hits) was plotted against the number of false positives (“no” hits).

Accordingly, for each test case, we conducted a LESTAT search for ten iterations or till convergence, whichever occurred first. Hits from the three iterations that resulted in the most number of hits with E-values up to 0.05 were then subjected to the SCOP assignment and performance evaluation as described above. The best performing round was selected for method evaluation against the best performing round of PSI-BLAST.

For the PSI-BLAST search, we took advantage of the structural information contained in the initial set of three related proteins by running PSI-

BLAST on all three protein sequences used to create the initial model of each test case and merging the resultant hits from the same iteration of each of the three sequences. Among those hits that were found in more than one of the three searches, only the hit with the smallest E-value was kept. This use of multiple structurally homologous (more distantly related) query sequences as opposed to a single sequence allows PSI-BLAST to cover a much larger sequence space.

The entire protein sequence was used to perform the PSI-BLAST search for all test cases. However, for the test cases involving the Rossmann-fold and the P-

3 1 loop containing nucleoside triphosphate hydrolase families, the proteins used to build the initial model are multi-domain and possess partner domains of a different fold. Thus, an additional PSI-BLAST search was performed for these two test cases using only the region relevant to the structural domain in question.

Previous tests to evaluate the optimal number of PSI-BLAST rounds have indicated that for large-scale, automated applications, five to six rounds are sufficient to reveal most of the potential matches that would be identified at convergence [4]. As such, the performance of PSI-BLAST rounds one to ten was analyzed using the same BLAST validation strategy as employed for LESTAT, and the best-performing PSI-BLAST iteration was selected for method comparison.

RESULTS AND DISCUSSION

Comparison of LESTAT with PSI-BLAST

We chose to evaluate the performance of LESTAT against PSI-BLAST since both methods do not rely on a manually curated library but use an iterative machine learning approach to build and refine their profiles. While both LESTAT and PSI-BLAST incorporate local alignment and iterative model building cycles into their search strategies, LESTAT differs fundamentally from PSI-BLAST in a number of ways. The key difference is that LESTAT uses structural homologs to build an initial profile, while PSI-BLAST uses a profile constructed from a multiple sequence alignment returned from a BLAST search. Since the structures of homologs exhibit much greater conservation than their sequences, LESTAT’s

3 2 use of structural information for the generation of the initial model enables it to start with an initial profile built from the consensus of three distant homologs having low sequence similarity. Thus, the lack of sequence conservation is overcome in the construction of the position specific scoring matrix by identifying the structurally conserved positions associated with a given fold, and by modeling this information as a set of structurally conserved amino acid blocks and block separation distances (the BPSSM). This approach is in contrast to PSI-

BLAST that utilizes a position specific scoring matrix (PSSM) over the entire fold with a non position-specific affine gap cost, potentially increasing the chance of focusing on less relevant positions. Furthermore, while LESTAT employs a weighting scheme similar to that of PSI-BLAST, it also includes two additional components: a "lock-in" feature and an enhancement factor. As mentioned previously, these two features work together to compensate for the loss of underrepresented families that are found to be significant in previous alignments and enhance LESTAT’s ability to detect more distantly related sequences by retaining information of the low-abundant homologs in the model.

Evaluation Results

The performance results of LESTAT and PSI-BLAST for all seven test cases are depicted in the coverage versus error plots (Fig. 2.2 and Fig. 2.3). We assessed their performance by plotting the number of true positives returned against the number of errors, up to an error threshold of 100. The same error

3 3 threshold is also used in the analyses and discussions below with respect to the number of true positives identified by each method.

Fig. 2.2. Performance Comparison of LESTAT versus PSI-BLAST for the initial test set. (a) Coverage versus error plots for (a) neurolysin-like proteins assessed at SCOP superfamily level; (b) TIM barrel proteins assessed at SCOP fold level; (c) Rossmann-like proteins assessed at SCOP superfamily level. LESTAT (solid line), PSI-BLAST full length (dashed line), PSI-BLAST Rossmann-fold region (dotted line) and overlap - hits found by both LESTAT and PSI-BLAST (dotted- and-dashed line). 3 4

Fig. 2.3. Performance Comparison of LESTAT versus PSI-BLAST for the independent validation set assessed at superfamily level. Coverage versus error plots for: (a) alpha/beta hydrolase proteins; (b) (Trans)glycosidase proteins; (c) P-loop containing nucleoside triphosphate hydrolase proteins; (d) immunoglobulin proteins. LESTAT (solid line), PSI-BLAST full-length (dashed line), PSI-BLAST P-loop containing nucleoside triphosphate hydrolase region (dotted line) and overlap - hits found by both LESTAT and PSI-BLAST (dotted- and-dashed line).

3 5 At the superfamily/fold level, LESTAT uncovers more distant homologs than PSI-BLAST in all test cases, except for the (trans)glycosidase proteins. In addition, detailed analysis on the true positives indicates that for most test cases,

LESTAT finds homologs that encompass a broader range of families than PSI-

BLAST. Notably, this includes the (trans)glycosidases test case, for which

LESTAT finds less homologs but covers more families than PSI-BLAST (six families versus four families).

A more important observation, perhaps, is that the distribution of homologs uncovered by LESTAT is more evenly distributed among the different families. In contrast, the homologs identified by PSI-BLAST predominantly belong to one of the same families as the three sequences used to build the initial model, making them less interesting, since members from different families are considered to be more distantly related than those within the same family.

This finding demonstrates that LESTAT is more sensitive than PSI-BLAST in detecting the more remote and therefore more challenging homologs. It also implies that the hits found by LESTAT and PSI-BLAST are quite different.

Indeed, when the hits returned from each method are analyzed, there is generally only partial overlap in the hits identified, with a sizable number of hits being unique to each algorithm. The number of hits in common to both methods is represented by the overlap curve in the coverage versus error plot (Fig. 2.2 and Fig. 2.3). For most test cases, the overlap curve lies some distance to the left of either the LESTAT/PSI-BLAST curve, supporting the notion that the true positives discovered by the two methods are somewhat orthogonal and

3 6 suggesting that the two methods themselves are complementary. In the following section, the individual results for the initial three test cases that laid the groundwork of this study are discussed in detail, and the results of the four independent test sets taken from literature are summarized.

Neurolysin-like proteins (sccs: d.92.1.5): M32 carboxypeptidase and related homologs

Neurolysin-like proteins are metalloproteases that catalyze the degradation of peptides into shorter fragments. Enzymes in these families share a common sequence motif, His-Glu-Xaa-Xaa-His (HEXXH) that forms part of the metal binding site. Three neurolysin-like proteins, P. furiosus carboxypeptidase

(PDB:1KA2), R. norvegicus neurolysin (PDB:1I1I) and H. sapiens angiotensin- converting enzyme (PDB:1O86), which share no significant sequence identity as determined by BL2SEQ [64], were used to build the original model.

The coverage versus error plot reveals that for this test case, at the superfamily level, LESTAT is more sensitive and specific than PSI-BLAST (Fig.

2.2a). LESTAT identifies more distant homologs distributed over twelve families than PSI-BLAST which covers eleven families and retrieves 87% of its hits from the same family as the original model (d.92.1.5). The overlap curve, which lies close to PSI-BLAST’s curve suggests that most true hits found by PSI-BLAST are also found by LESTAT, indicating that most, if not all, of the sequenced neurolysin-like proteins have likely been identified.

3 7 TIM barrel proteins (sccs: c.1)

The TIM barrel structural motif is one of the most frequently observed folds in protein crystal structures. Proteins belonging to this family adopt a characteristic eight-stranded α/β barrel topology; however, they catalyze very different reactions. In addition, members of the TIM barrel proteins share little . Because of their functional and sequence diversity, sequence-based fold predictions of TIM barrel proteins remains a challenge for structural genomics efforts.

The three TIM barrel proteins used to build the initial model are:

Methanosarcina barkeri monomethylamine methyltransferase (PDB:1L2Q),

Moorella thermoacetica methyltetrahydrofolate (PDB:1F6Y) and Salmonella typhi dehydroquinate dehydratase (PDB:1QFE). Because each protein belongs to a different superfamily of the TIM barrel fold, the performance of this particular test case is assessed at the fold level. As illustrated by the coverage versus error plot

(Fig. 2.2b), LESTAT detects and discriminates significantly more distant TIM barrel homologs than PSI-BLAST. At a threshold of 100 errors (false positives), our algorithm reports 994 true positives whereas PSI-BLAST reports 838 – about a 20% increase in the number of correct fold assignments. Furthermore, LESTAT identifies 156 more homologs than PSI-BLAST before incurring its first error.

Significantly, the limited overlap of LESTAT and PSI-BLAST (about 55%) indicates that many of the true hits found are unique to each method, reflecting distinct differences in the profiles generated by the two algorithms. Similar to that observed in the neurolysin-like protein family, about 75% of PSI-BLAST’s

3 8 homologs originate from the same superfamily as the initial sequences used in the starting model, while the hits from LESTAT are more distributed.

Approximately 50% of the homologs identified by LESTAT belong to superfamilies not within the original model - an indication of enhanced sensitivity of our algorithm to identify more divergent sequences.

In terms of their superfamily coverage, PSI-BLAST and LESTAT fail to identify homologs from nine of the twenty-six TIM barrel superfamilies. A closer inspection of these missed superfamily members indicates that the structures of some of these TIM barrel proteins exhibit either a distorted barrel or an incomplete α/β barrel that deviates from the classic (α/β)8-barrel motif. For example, (trans)glycosidase has an additional subdomain and several helical structural elements inserted within its TIM barrel sequence. Considering the various forms of distortion to the classical TIM barrel architecture and that they are not common structural arrangements found in other superfamilies of barrels, it is likely that accommodating these variations does not result in an optimal alignment to our profile. Because our algorithm penalizes sequences having distances that deviate from the expected lengths between conserved blocks, for large deviations, the costs associated with these indels may become too prohibitive, thus resulting in a low score.

Rossmann-fold protein family (sccs: c.2.1)

The 3-layered α/β/α sandwich Rossmann fold is another highly populated fold in the protein structural database. While the classical Rossmann fold has a

3 9 central β-sheet with a typical strand order 32145, variations in the number of helices on either side of the central β-sheet or in the number of strands and strand order that made up the central β-sheet are common. SCOP segregates these variations into different SCOP folds/superfamilies. Most proteins that have a Rossmann domain belong to the c.2 fold class and are multi-domain, being comprised of the Rossmann domain and a distinct partner domain, whose fold classification SCOP uses to define family membership. Three multi-domain proteins having a Rossmann domain were used to build the original model. Two of these proteins, Pseudomonas aeruginosa alcohol dehydrogenase (PDB:

1LLU) and quinone oxidoreductase (PDB: 1QOR), share the same partner domain and belong to the same family (c.2.1.1). A third protein,

Rattus norvegicus S-adenosylhomocystein hydrolase (PDB: 1B3R) belongs to another family (c.2.1.4) because of its flavodoxin-like partner domain. For the

LESTAT search, full-length sequences were used to perform the multiple structural alignment, which was then utilized for constructing the initial model.

Superposition of the aligned structures was conducted to verify that the

Rossmann domain is the only structurally aligned region.

Since each of the three proteins involved have the Rossmann domain and a separate partner domain, two approaches were carried out to conduct the PSI-

BLAST search. One search used the full-length sequences (containing both

Rossmann domain and partner domain) while another search used only the segment that corresponds to the Rossmann region. According to the coverage versus error plot (Fig. 2.2c), at a threshold of 100 errors, neither of the PSI- 4 0 BLAST searches (using full length protein or only the relevant Rossmann region) outperforms LESTAT in identifying more distant homologs. In fact, the PSI-

BLAST search that uses the entire protein sequence performs significantly worse than the PSI-BLAST search using only the domain sequence. This finding suggests that isolating the relevant domain region to perform PSI-BLAST search can enhance the sensitivity and specificity of PSI-BLAST in detecting distant homologs.

In addition to finding more distant homologs, LESTAT detects about 2000 more true positives than PSI-BLAST before incurring its first false positive. A close examination of these true positives indicates that 44% of the hits returned by the PSI-BLAST Rossmann-domain-only search belong to the same families used in the original model, which do not contribute any of the remote homologs in new families that are of greater interest. In contrast, only 28% of the LESTAT’s hits come from the same families as the model sequences, further supporting the notion that LESTAT is better suited for distant homology detection.

Given that LESTAT finds twice as many hits than the PSI-BLAST-full- length-search, one might expect that the PSI-BLAST full-length hits are merely a subset of the LESTAT hits. However, only 75% of the PSI-BLAST full-length hits are also found by LESTAT - the remaining 25% are unique to PSI-BLAST. The overlap between the hits returned from the PSI-BLAST Rossmann-domain-only search and LESTAT is even more limited - less than 55%.

4 1 The four independent test cases taken from literature

In addition to the three aforementioned test cases, we have taken from the literature [59] four additional systems to verify that our test results are general and not biased towards LESTAT. The additional validation set is comprised of proteins from the protein-pairs that Eisenberg and coworker randomly chose from the DAPS database to study how dissimilar sequences (less than 25% sequence similarity) fold into similar structures. The protein families and their corresponding proteins used to build the initial model for each test case are listed in Table 1.

Based on the coverage versus error plots (Fig. 2.3), LESTAT detects more distant homologs than PSI-BLAST for three of the four protein test cases of the validation set, but falls short on the (trans)glycosidases superfamily. The difference comes from the beta-glycanases family (c.1.8.3) of which β- mannanase, a sequence used in the initial model building, belongs. Thus the better performance of PSI-BLAST for the (trans)glycosidase test case can be attributed to the fact that PSI-BLAST is better in finding homologs within the same family than LESTAT, a feature commonly observed in all the test cases we have studied. In the case of (trans)glycosidase proteins, 2145 homologs (out of

2148) identified by PSI-BLAST are from the same families as the model sequences whereas LESTAT identifies 38 homologs that belong to different family members. Thus, even though LESTAT identifies a smaller number of homologs in the (trans)glycosidases test case (Fig. 2.3b), it covers a broader range of families than PSI-BLAST - an observation that holds true for most of the test cases (six of seven). This broader coverage of families by LESTAT is best

4 2 illustrated in the alpha/beta hydrolases (c.69.1) test case. Here, LESTAT and

PSI-BLAST identify comparable number of homologs (Fig. 2.3a), covering all twenty-six of the families within the alpha/beta hydrolases superfamily. In the later version of ASTRAL (v.1.71), the cutinase-like family (c.23.9.1) was classified from the cutinase-like superfamily to the alpha/beta hydrolases superfamily (c.69.1.30). Notably, LESTAT is able to identify cutinase-like homologs with statistically significant E-values whereas PSI-BLAST fails to identify homologs from this new family.

As is similarly observed in the initial cases used to develop the algorithm, the four test cases under discussion also yields homologs with a distinct distribution among different families for LESTAT and PSI-BLAST. For the immunoglobulin proteins test case, the main difference in distribution comes from two families (out of four families): (1) the I set domains family (b.1.1.4), where

LESTAT finds 3008 homologs compared to 672 homologs found by PSI-BLAST; and (2) the C1 set domains family (b.1.1.2), where LESTAT finds only 1310 homologs, while PSI-BLAST finds 4461 homologs. Furthermore, between the

50784 homologs identified by LESTAT and 48814 homologs identified by PSI-

BLAST, over 7800 homologs are unique to only one method (Fig. 2.3c). These results serve to reaffirm the notion that LESTAT and PSI-BLAST are complementary methods that cover different regions of sequence space.

As observed in the Rossmann-fold test case from the initial set, evaluation of PSI-BLAST performance of the P-loop containing nucleoside triphosphate hydrolases test case reveals that using only the core domain to conduct the PSI-

4 3 BLAST-search yields significantly better results than using the entire protein sequence (Fig. 2.3d). In this instance, using just the conserved domain results in a three-fold increase in the number of true homologs identified. Nevertheless, even with the enhanced result associated with using the sequence region from only the conserved domain, PSI-BLAST still identifies much fewer homologs than

LESTAT (21283 versus 32530).

A major challenge to a sequence alignment method is in predicting an alignment on related proteins that share very little sequence identity. Since the four examples discussed in this section are taken from the DAPS database of protein pairs, it is an important question whether LESTAT or PSI-BLAST can use one protein of the protein pair to fish out its mate or the family that its mate belongs to. The test is not trivial since none of the four protein pairs share a sequence identity greater than 15% and care has been taken to ensure that none of the structural homologs selected to build the initial models belong to the same family as the other protein (“the mate”) in the protein pair.

For the pair with the highest similarity (of still only 14%), bromoperoxidase

A2 and triacylglycerol hydrolase, LESTAT is able to find triacylglycerol hydrolase while PSI-BLAST is not able to find it. Nevertheless, PSI-BLAST does find 90 homologs that belong to the same family as triacylglycerol hydrolase (c.69.1.17).

In the (trans)glycosidases test case, the other protein in the protein pair is concanavalin B, which share 10% sequence identity and belongs to the Type II chitinase family (c.1.8.5). LESTAT is able to identify homologs from the Type II chitinase family, though not concanavalin B in particular. PSI-BLAST fails to

4 4 identify concanavalin B or homologs from the Type II chitinase family. As for the

P-loop containing nucleoside triphosphate hydrolases protein family, the α- subunit of the bovine mitochrondrial F1-ATPase and Escherichia coli RecA protein share 9% sequence identity over the aligned residues. At this level of sequence identity, neither LESTAT nor PSI-BLAST is able to identify the bovine mitochrondrial F1-ATPase. It is likely that the 9% sequence identity over the aligned positions between these two proteins is out of the detection limit of either method.

Implications to protein folding

A potentially unexpected implication of this work relates to the general folding of proteins. While Anfison’s hypothesis states that the sequence of a protein determines its fold, there has been some debate as to whether the interactions of a non-specific set of native interactions provides the necessary energy to overcome the folding barrier [65-67], or if there is a specific set of interactions (folding nucleus) that are necessary to drive the folding process [65,

68-70]. The fact that LESTAT, which uses only a limited set of structurally conserved core residues and their separation distances, exhibits a better performance than PSI-BLAST that uses an entire sequence, would appear to support the folding nucleus model. Indeed, the success of LESTAT implies that the separation distances of these core residues may be an important property of the folding nucleus. It is likely that these distances could dictate the entropic cost of folding into a given fold intermediate, giving rise to the distance preferences 4 5 observed by LESTAT. Thus, like secondary structure, residue hydrophobicity/polarity and environmental factors, these block separation distances would provide yet another structural property that could be used to improve sequence-based fold prediction.

CONCLUSION

LESTAT uses a model that considers not only amino acid preferences, but also the separation distances between the conserved blocks of residues to generate its blocked-position specific scoring matrix. This model highlights the fact that although the amino acids in the regions between these blocks are not conserved, the number of residues in these regions is usually confined to certain lengths based on the model sequences. This conservation in length can be viewed as originating from a constraint imposed onto the sequence due to specific spatial separation distances and secondary structural elements associated with the fold. LESTAT utilizes the separation distances, in the form of gap costs, to mediate the flexibility of allowing any of these structural elements occupying the space as well as emphasizing the conservation of the geometric constraints. By incorporating this semi-conserved distances and the conserved amino acids blocks into building its profile, and integrating it with an iterative routine, LESTAT generates a profile that is different from that of PSI-BLAST, thus enabling it to uncover homologs that evade PSI-BLAST detection. As such,

LESTAT’s ability to discover more distant homologs distinct from PSI-BLAST and

4 6 its broad superfamily coverage make it a useful tool that can be used to complement PSI-BLAST when searching for distant homologs.

4 7

CHAPTER 3

SIMPLE IS BEAUTIFUL : A STRAIGHTFORWARD APPROACH TO IMPROVE THE DELINEATION OF TRUE AND FALSE POSITIVES IN PSI-BLAST SEARCHES

Sequence alignment is one of the most widely used techniques in , particularly for functional annotation of non-characterized sequences. The power of this approach is directly dependent on the ability of sequence alignment tools to detect weak homologies.

The highly popular sequence alignment tool, BLAST [2], is very good at finding close homologs. PSI-BLAST [3] - an extended version of BLAST, is more effective and sensitive in detecting distant homologs due to its iterative profile- based search strategy [71, 72]. In its first iteration, PSI-BLAST identifies related sequences that meet a specific inclusion threshold and utilizes these sequences to generate a profile or position specific score matrix (PSSM) to be used for the next iterative search. The PSSM is then used to align and score sequences in the database to search for new statistically significant hits. This process is performed iteratively until the model converges or no new statistically significant sequences are found. Previous studies have shown that iterative approaches, such as that used in PSI-BLAST, are more effective in finding additional homologs. However, a major potential problem of this approach is model

4 8 corruption. With each subsequent iteration, there exists an increased probability of including a non-homologous sequence that meets the threshold for PSSM inclusion, and whose presence can, in turn, lead to the incorporation of other false positives. Such model corruption is particularly treacherous if the non- homologous sequence belongs to a large family. A naive approach to overcome this problem is to set a stringent inclusion threshold, but the trade-off is a loss of sensitivity, especially for the more remote homologs, thus diminishing the strength and utility of PSI-BLAST. Hence, the general interest in developing novel strategies to distinguish and eliminate early false positive sequences from true hits in homology detection searches.

Different approaches have been studied and proposed to improve the discrimination of true and false positives. These methods typically incorporate extrinsic information into the search, such as, (predicted) structural [38, 40, 41,

73, 74] or literature data [75-77], or use completely different techniques such as

SAM [27, 28, 35-37, 78], or a combination of such other techniques [79]. Despite their better performance in remote homology detection, these approaches are computationally more expensive than PSI-BLAST itself, thus hindering their wide acceptance by the user community.

In this chapter, we describe a new methodology to improve the statistical assessment of hits returned from a PSI-BLAST search that yields better delineation of true and false positives. The appeal of our approach is its simplicity: it can be integrated easily without incurring significant computational

4 9 costs to the PSI-BLAST algorithm, while at the same time improving the accuracy of its searches.

This algorithm has been incorporated into a new web server, SIB-BLAST, that allows users to simply submit a protein sequence for a PSI-BLAST search, then processes the relevant PSI-BLAST outputs using the aforementioned algorithm and return a list of hits rank-ordered by its "combined-E-value", termed figure of merit, to the users.

METHODS

Key Ideas

The goal of our approach is to improve on the high sensitivity of PSI-

BLAST in terms of remote homolog detection without sacrificing computational efficiency - one of the main reasons for PSI-BLAST’s popularity. Our approach is based on a well-known property of PSI-BLAST searches: In its early iterations

PSI-BLAST is not very sensitive, but very specific. Upon performing more and more iterations, sensitivity increases while specificity suffers from what is called

“model corruption”. This behavior can be understood as follows: As an iterative tool, PSI-BLAST attempts to build better and better models of the homologs to the original query sequence. The first round of PSI-BLAST, which is essentially

BLAST, finds only relatively close homologs. These close homologs are then used to generate a profile for the next iteration that will be able to find more remote homologs. Iterating this process in theory yields progressively better models and thus finds more and more weakly related homologs to the original

5 0 query. However, in practice, non-homologous false positives are frequently incorporated into the model at some point during this iterative process. Once this happens, the model is “corrupted” and completely loses its specificity for the true homologs of the original query, resulting in the false identification of large numbers of non-homologous sequences as putative true homologs. For this reason, it is recommended to run at most five to six iterations of PSI-BLAST instead of running PSI-BLAST until no more new sequences are identified [4].

However, even after five or six iterations, it is common to find that some non- homologous sequences are already incorporated. Thus, there is a trade-off between an increased sensitivity and an increased possibility for model corruption as more iterations are performed.

The key idea of our approach is that by combining the results from different PSI-BLAST iterations, the benefits of low corruption from the early rounds, and the benefits of high sensitivity from the later rounds can both be realized. More precisely, we expect that a true homolog identified in a later iteration should have already revealed its identity as a homolog in the second round (the one using the first profile generated by PSI-BLAST and least likely corrupted). This distant homolog may not have had the level of homology that would have been statistically significant in the second round, but it should at least exhibit some homology. On the other hand, a non-homolog reported in a later round due to model corruption should show a much lower degree of homology in the second round than a true homolog. We thus hypothesize that “benchmarking”

5 1 hits from later iterations against hits from iteration two where the model is least corrupted should improve the accuracy of a PSI-BLAST search.

The beauty of our strategy is that it requires almost no additional computational effort - if PSI-BLAST is iterated to, say, the fifth round, the results for the second round will have been calculated anyway. All that is required is to store these results and combine them with the results of the last round. The obvious question that has to be resolved, though, is precisely how to combine the results from different rounds of PSI-BLAST. The remainder of this section will be devoted to providing an explicit formula for combining the E-values reported by

PSI-BLAST in the two rounds into a combined figure of merit. The derivation of this formula will be somewhat heuristic in nature. However, an evaluation of this method on a “gold standard” database presented in the verification protocol section will show that our method is indeed very successful in combining the benefits from the different rounds of PSI-BLAST in spite of the somewhat heuristic nature of its derivation.

Review of BLAST statistics

In order to derive a way to combine the E-values reported by PSI-BLAST for the different rounds into a single figure of merit, it is useful to review how these E-values are actually calculated. If a query sequence with L amino acids is aligned against a random subject sequence of M amino acids using the Smith

Waterman [33] local alignment algorithm, the local alignment score ∑ can be

5 2 considered a random variable. In the absence of gaps, this random variable has been proven [80-82] to follow an extreme value distribution

- S Pr{" < S} = exp -KLMe # (1) ( ) if the query and subject sequence are sufficiently long. Here, λ and Κ are two parameters that !d epend in a known way on the scoring system, the query sequence, and the random ensemble used to choose the random subject sequences.

Even in the presence of gaps, heuristic arguments and numerical studies

[4, 51, 54, 83-86] confirm that the local alignment score is still distributed according to the extreme value distribution Eq. (1), albeit with modified values of the two parameters λ and Κ. Thus, the p-value of an alignment score, or the probability of obtaining an alignment score of magnitude S or better in one alignment against a random subject sequence, is given by

- S p = 1- exp -KLMe " (2) ( )

BLAST and PSI-BLAST report E-values, or the expected number of random hits of score! S or higher, instead of p-values. These two quantities are different since the p-value refers to the probability of having at least one hit at the prescribed significance level, while the E-value quantifies the expected number of these hits. The general relationship between these two quantities for random variables with exponential tails is the Poisson formula

-E p = 1- e (3)

By comparing with Eq. (2) we find

! 5 3 E = -log 1- p = KLMe-S (4) seq ( )

It is important to note that this is the E-value for one single sequence comparison, which w e indicate by the subscript “seq”. The convenient property of

E-values is that the Bonferroni correction for repeating the alignment for every sequence in a database of N amino acids, and thus roughly N/M sequences, just manifests itself in a multiplication of the expected number of hits for a single sequence by the number of sequences, i.e.,

N -S Edb = Eseq = KLNe (5) M

It is this latter E-value that BLAST and PSI-BLAST report as the statistical significance of their hits.

Combining two E-values

In order to combine results from different rounds, we make the assumption that the different rounds of PSI-BLAST are statistically independent. At first glance, making this assumption seems heuristic since the models used in all PSI-

BLAST iterations describe the same query sequence and thus are related. It is worth pointing out, however, that for precisely the scenario we are most worried about, namely model corruption, the assumption of statistical independence between early and late rounds of PSI-BLAST is actually a valid one. As mentioned before, the main justification for this assumption is that it leads to good retrieval performance as shown in the next section using real biological data.

5 4 As a means to eliminate false positives resulting from model corruption, we have to accept a hit only if it is significant in the final round of PSI-BLAST as well as in the second round of PSI-BLAST. Then, a false hit occurs only if the same subject sequence is a false hit in the second round and in the final round.

Under the assumption of statistical independence, the total probability for a false hit with score S2 or higher in the second round and Sf or higher in the final round can be calculated using Bailey and Gribskov's equation for the combined p-value of two independent p-values [87]

= ptot p2 p f (1" log( p2 p f )) (6) where p2 and pf are the p-values calculated according to Eq. (2) for the two score ! thresholds S2 and Sf.. Since PSI-BLAST does not report these p-values we have to use Eq. (3) and the first equality in Eq. (5) to determine them. This yields

( " M %+ ( " M %+ " "( " M %+ ( " M %+ % % p = 1- exp - E 1- exp - E $1 log$ 1- exp - E 1- exp - E '' (7) tot * $ 2 '- * $ f '- . $* $ 2 '- * $ f '- ' ) # N &, ) # N &, $ ) # N &, ) # N &, ' # # && where E2 and Ef are the E-values reported by PSI-BLAST in the second and final ! round, respectively. In order to transform this p-value into a quantity that resembles the E-values reported by PSI-BLAST, we use as the final figure of merit (FOM)

N (8) Etot = - log(1- ptot ) M

We refer to this quantity as a figure of merit instead of an E-value since this quantity is strictly speaking only an E-value if the assumption of statistical independence of the two rounds in question is fulfilled. In general, we expect that

5 5 this number is an underestimate of the true E-value due to correlations between the rounds. However, as we will show in the next section, using this number as a figure of merit yields a superior discrimination between true and false positives.

Verification protocol

One major consideration in performance evaluation is the accuracy of the assignment of true and false positives to the hits returned from a search. An ideal test set would be one with known (expert-curated) true and false positive relationships of the query sequences to those in the test database, making the assignment unambiguous and straightforward. In light of this consideration and given that our proposed method is an elaboration on PSI-BLAST, it therefore makes sense to use the same dataset used by the PSI-BLAST development team to evaluate their search refinement strategies. Consequently, we have verified our proposed methodology using the “Aravind” dataset [30] employed by

Schaffer et. al. in their 2001 study of different methods to improve PSI-BLAST performance [4]. This Aravind dataset is comprised of 103 queries of single protein domain sequences of the yeast Saccharomyces cerevisiae and a sequence database containing 6406 yeast proteins. We rationalized that using the same data set should give an unbiased performance comparison between the two approaches. More importantly, the Aravind dataset contains the true positives to each query sequence and have been expertly curated by Schaffer’s coworkers at the NCBI - by conducting PSI-BLAST searches against the yeast database, and analyzing the resultant alignments on a case-by-case basis via

5 6 expert examination. Sequences in the yeast database that were not annotated as true positives were considered false by default.

For our study, a five-round PSI-BLAST search was conducted for each of the 103 queries against the complete non-redundant (NR) protein database with its 1,986,685 sequences (frozen as of 8/23/04). The choice of five rounds is based on previous empirical analysis by the PSI-BLAST developers that suggested most hits that will be found with an E-value less or equal to 0.005 (the default threshold that PSI-BLAST uses to include sequences for PSSM construction) are usually found by round five or six [3, 4]. The position specific score matrices (PSSM) constructed for iteration two and iteration five were saved as “checkpoints”. The two checkpoint files per query were then used to search the annotated yeast database to produce a list of yeast hits (up to E-value of

1000) for iteration two and iteration five, respectively. Each hit in iteration two and five was assigned a “yes” (for true positive) or “no” (for false positive) according to the true positives for the Aravind test set described previously. To generate a list of common hits to be used for our proposed approach, hits that were found in one iteration but missing in the other were arbitrarily assigned an E-value of

10,000. This list of hits and their corresponding E-values at iteration two and iteration five were combined to obtain the figure of merit (Etot) using the formula

Eq. (8) described above.

RESULTS

5 7 Evaluation and comparison to PSI-BLAST

To compare the performance of our proposed approach with PSI-BLAST's iterations two and five, all the hits generated from the 103 queries were pooled together and rank-ordered by their E-values, or in our case, figures of merit (Etot).

A coverage versus error analysis was conducted to examine the number of true positives identified by varying the error levels up to a threshold of 100. The result of this analysis is shown in Fig. 3.1.

Fig. 3.1. Performance comparison between rounds two and five of PSI- BLAST, and our new method for combining different rounds. The curves show the number of errors as a function of the number of true positives for the three methods as evaluated on the Aravind “gold standard” dataset. The dotted and dashed lines are for iterations two and five of PSI-BLAST respectively, while the solid line shows the performance of our “combined E-values” method. It can be seen that our method achieves the same sensitivity as round five of PSI- BLAST with (nearly) the same specificity as round two of PSI-BLAST.

As shown clearly by the respective position occupied by each curve, our proposed methodology outperforms both iterations two and five of PSI-BLAST.

5 8 The behavior of the two curves for PSI-BLAST’s iterations two and five shows that there is indeed a tradeoff between sensitivity and specificity between different levels of PSI-BLAST searches: higher specificity for iteration two (later occurrence of the first false positive), but higher sensitivity for iteration five (more true positives found). In contrast, the curve corresponding to our method lies furthest right in the plot, encompassing more true positives with minimal increase in false positives.

Comparison to SAM-T2K

Although widely recognized as the most popular sequence alignment tool,

PSI-BLAST is found to be less effective in remote homolog detection than the more computationally costly HMM-based methods [29]. Given the improved performance of our approach over the current PSI-BLAST method, it would be interesting to see how our proposed improvement to the PSI-BLAST algorithm performs when compared to state-of-the-art approaches. Similar to PSI-BLAST in its automated iterative search strategy, the HMM-based SAM-T2K [27, 28] has been shown to be superior to PSI-BLAST in distant homology detection in general [29] , thus we have chosen SAM-T2K for additional performance comparison.

We thus applied SAM-T2K to the Aravind test set following the standard procedure as prescribed in the SAM-T2K suite from the University of California,

Santa Cruz (http://www.soe.ucsc.edu/research/compbio/sam.html). In brief, for each query sequence in the Aravind test set, we used the script target2k to

5 9 perform five iterations of the following cycle: (1) search the NR database for homologs, (2) perform a multiple alignment of the homologs to construct an

HMM-model, and (3) score the sequences in the NR database using this HMM- model. The multiple alignment generated after the final round was then used to create an HMM-model using the w0.5 script. This HMM-model was used to score the 6406 yeast sequences in the Aravind test set using the script hmmscore with the alignment parameter set to local alignment. All other parameters used to conduct the SAM-T2K search were left at their default values.

As shown in Fig. 3.2, whereas our proposed method performs similarly to

SAM-T2K, PSI-BLAST falls short on both specificity and sensitivity to either method - finding fewer true homologs for the same number of false positives. The results here reaffirm the general notion that the HMM-based SAM-T2K is a better remote homolog detection tool than PSI-BLAST. Nevertheless, the more sophisticated and complex SAM-T2K is computationally more expensive, making it less popular than PSI-BLAST. The implication of the performance of our

“combined E-values” approach is thus more significant when put into the context of computational efficiency. For minimal additional computation cost (since our method uses outputs already produced by PSI-BLAST), our approach is able to elevate PSI-BLAST’s performance to match that of SAM-T2K.

6 0

Fig. 3.2. Performance comparison between PSI-BLAST, SAM-T2K, and our proposed method. The curves show the number of errors as a function of the number of true positives for the three approaches as evaluated on the Aravind “gold standard” dataset. The long dashed lines are for iteration five of PSI- BLAST, the short dashed lines are for iteration five of SAM-T2K, and the solid line represents the performance of our “combined E-values” method. It can be seen that both our method and SAM-T2K achieve higher specificity and sensitivity than PSI-BLAST.

Receiver Operating Characteristic Analysis

In addition to coverage versus error analysis, another common technique to quantify the performance of different approaches is a receiver operating characteristic (ROC) analysis, which measures the rate of true positives against the rate of false positives. The quantity ROC is derived by calculating the area under the curve and adopts a value between 0 (worst) to 1 (best). Although the recommended figure of merit for a comprehensive sequence database search is

ROC50 [88], we followed the lead of the PSI-BLAST developers [4] in their choice of ROC100 to compare the performance of PSI-BLAST iterations two and five, our 6 1 “combined E-values” method using these two PSI-BLAST iterations, and SAM-

T2K iteration five. We thus calculated the characteristic ROC100 as

100 1 ROC = - t (9) 100 100T i i=1 where T = 1005 is the total number of true positives in the yeast database and ti is the number of true positives ranked ahead of the ith false positive. As indicated in Table 3.1, the ROC100 for our “combined E-values” method (0.836) is comparable to that of SAM-T2K. When compared to either iteration two or iteration five of PSI-BLAST, however, our ROC100 is higher providing further support that combining results from early and late PSI-BLAST iterations improves the discrimination of true and false positives.

Table 3.1. ROC100 values characterizing retrieval performance for a "gold standard" yeast database with 103 query sequences and 1005 true positives

PSI-BLAST PSI-BLAST Combined E- SAM-T2K iteration 2 iteration 5 values

ROC100 0.764 0.807 0.836 0.825

Our “combined E-values” method has a higher ROC100, and thus shows better performance than PSI-BLAST’s iterations two or five alone.

SIB-BLAST WEB SERVER

We have created the SIB-BLAST web server under the URL (http://sib- blast.osc.edu) to facilitate the use of our algorithm by the outside community. 6 2 SIB-BLAST implements a combined protocol that integrates a standard PSI-

BLAST search with our algorithm and returns a list of hits rank-ordered by their corresponding combined E-values. An awk script that performs the figure of merit calculation from standard PSI-BLAST output is available online at http://bioserv.mps.ohio-state.edu/SimpleIsBeautiful.

Briefly, SIB-BLAST is comprised of the following steps: (1) performing a

PSI-BLAST search of a query protein sequence against the non-redundant database; (2) comparing resultant hits in iteration two and the last iteration; (3) calculating a figure of merit (FOM) for each hit by combining its corresponding E- values at iteration two and at the final round; and (4) re-ranking the merged list of hits according to their FOM.

Inputs

The SIB-BLAST web server (Fig.3.3) requests three inputs from the user: a protein sequence in FASTA format, the number of iterations of the PSI-BLAST search, and the maximal number of target sequences reported in the PSI-BLAST search. Users are given the option to either paste a protein sequence or upload a file containing a protein sequence in FASTA format. The number of PSI-BLAST iterations to be performed is limited to be between three to ten rounds, though users are advised to choose no more than five to six iterations as suggested by the PSI-BLAST developers [4]. The number of target sequences reported by PSI-

BLAST can be chosen as 1000, 2000, 5000, 10000, or 20000. This number is purposefully restricted to be higher than the PSI-BLAST default value to ensure

6 3 that even weak hits are reported in the individual rounds as they might become significant once results from different rounds are combined.

Fig. 3.3. Snapshot of the SIB-BLAST front page. The main section allows the user to submit the query protein sequence either by pasting in the window or by uploading a file. Dropdown menus allow the user to choose the number of rounds and the maximal number of target sequences. A brief explanation of each of these input parameters can be obtained by clicking the HELP link. Links to the Help manual and a downloadable SIB package are included on the front page.

Other than these three user-defined input parameters, the parameters used to conduct the search are preset. The non-redundant (NR) database

(updated weekly) is used for querying the sequence. The algorithm parameter for

6 4 the Expect threshold, which reports the number of sequence matches, is set to

1000 instead of PSI-BLAST's default value of 10. This higher threshold is to ensure that those true but more distant homologs identified in the final iteration but having very large E-value in round two are reported in both lists of iterations, which are subjected to downstream SIB processing. All other algorithm parameters such as the word size, the scoring matrix and PSI-BLAST threshold

E-value of 0.002 for inclusion of matches in the profile for the next round are set to default values.

Users are asked to provide an email address to which their results will be forwarded. Alternatively users can bookmark the result link to obtain the output interactively when it is available. A status page, which shows the progress of the

SIB-BLAST job is displayed upon submission.

Outputs

SIB-BLAST outputs (Fig. 3.4) a combined list of hits from the second iteration and the last iteration rank-ordered by their corresponding FOM. Based on an analysis on the FOM's coverage versus error curve on the Aravind test set

[30], it appeared that the incurrence of errors increases dramatically at a FOM of approximately 10-8. This empirical FOM threshold of 10-8 is mentioned in the

Manual page as a "point of reference" for users to "gauge" the statistical significance of each hit. Users are cautioned to the use of this threshold as this value is expected to depend on the size of the database.

6 5

Fig. 3.4. Snapshot of the SIB-BLAST result page. The list of hits is displayed and rank-ordered by its corresponding figure of merit, along with its E-value at round two and at final round. Users can access the corresponding annotation of each hit by clicking the hit's GI number.

Each hit on the results page is shown with its GI number, which has a link to the protein annotation page at the NCBI's website that allows the user to obtain details of the hit to ascertain whether the hit is indeed a true homolog or not. Along with the GI number and FOM, the E-values from the second and the final iterations are provided for each hit. Users can view the PSI-BLAST pairwise alignment between the query and hit sequence by clicking on the E-value link.

Different output options sorted by E-value at iteration two or E-value at the last iteration and in HTML or text format are also available to the users. The results pages for these different options are organized identically to the default results page.

6 6 Documentation and runtime

The SIB-BLAST manual page connected by a clickable link on the SIB-

BLAST front page provides users with a brief overview of our algorithm in addition to a detailed description for each input parameter and an explanation of the result page. A sample sequence is made available for users to test a trial run of the SIB-BLAST server.

As reflected in the algorithm's name, the beauty of our approach is in its simplicity - it requires minimal changes to the existing PSI-BLAST algorithm since it uses information already output by PSI-BLAST and the processing time to calculate the FOM for individual hits and re-sorting them are negligible. Thus,

SIB-BLAST improves the search accuracy of PSI-BLAST without compromising its computational efficiency.

DISCUSSION

We have established a novel strategy to improve the accuracy of PSI-

BLAST searches by validating the resultant hits from the final iterations against those obtained in earlier rounds. The verification results are significant given the improved performance shown by our method over PSI-BLAST at minimal computational cost. They demonstrate that our strategy enhances the discriminative capability of PSI-BLAST by maintaining the specificity conferred by the PSSM of iteration two, while at the same time, capitalizing on the advantage gained from iteratively improving the model to identify more true positives in iteration five. One may argue that further enhancement could potentially be

6 7 achieved by including the E-values from the interim rounds of a PSI-BLAST search, but such an approach would further raise concerns about statistical independence. While for corrupted models at least, the assumption of statistical independence between iterations two and five is reasonable, it becomes invalid if

E-values from immediately consecutive iterations are combined. Thus, incorporating E-values from intermediate rounds would eliminate the beauty of the mathematical framework used in this study, and therefore was not pursued.

Nevertheless, it may be worthwhile to explore this strategy in the future using machine learning techniques to replace the parameter free combination of E- values put forward here.

CONCLUSION

The problem of model corruption is a major concern to both the developers and users of PSI-BLAST. The reliability of PSI-BLAST's assessment on whether two sequences are related depends heavily on its ability to discriminate true hits from false positives produced from corrupted queries. Many different strategies have been and are being explored to overcome this problem.

Our proposed strategy is noteworthy in its simplicity - it utilizes information already output by PSI-BLAST and thus requires minimal changes to the existing algorithm. The web server, SIB-BLAST, was implemented for public usage of this novel approach.

6 8 Future development

The current FOM provides a relative measure of the statistical significance of the resultant hits against one another. It is not an E-value, however, due to

SIB’s implicit assumption in the calculation of the combined p-value - that the hits identified in the second and last rounds of PSI-BLAST are independent. It would be more meaningful if it were possible to calculate the combined p-value under the condition of the hits being dependent. Then the FOM calculated would be the true E-value, and the value obtained would provide an accurate measure of the error probability. Bailey and Grundy's POP (product of p-values) algorithm, which calculates the product of the p-values of correlated variables may provide some suggestions as to how to achieve this as a direction for future study [89]. To improve on the functionality of the web server, SIB-BLAST, it would also be useful to provide a multiple sequence alignment of the hits (as is being done now in PSI-BLAST ) in future versions of SIB-BLAST.

6 9

CHAPTER 4

PREDICTING INTERACTING PARTNERS OF HETERO-OLIGOMERS

Many proteins are known to form oligomers in vivo under physiological conditions, either with themselves as homo-oligomers or with other proteins as hetero-oligomers. It is generally believed that the assembly of protein molecules into oligomers offers many evolutionary advantages over a single protein molecule, such as, stability and regulation [90]. For some proteins, the formation of a complex is required in order to attain their biological functions.

Different algorithms have been developed to predict interacting protein partners. Most of these algorithms use structure-based approaches. For example, FTDOCK [91] and GRAMM [92] use docking methods to predict the structure of a complex from monomeric protein structures, and

MULTIPROSPECTOR uses threading [93] and homology modeling template method to predict protein-protein interactions. Nearly all of these methods focus on interacting proteins that are distinct from one another, as in, ClpX and ClpP that are non homologous but associate into the ClpXP complex that is involved in protein degradation.

7 0 Often, oligomerisation of a protein occurs with itself or with its homologs. A case in point is , a tetramer that is composed of two α subunits and two β subunits. The α polypeptide chain and the β polypeptide chain are paralogous arising from a gene duplication event and share over 60% sequence similarity.

Many organisms, especially eukaryotes, contain paralogous proteins or proteins that are homologous. For example, the eukaryotic Leishmania major carries two copies of M32 carboxypeptidase that are not identical but share high sequence similarity while the archeal Sulfolobus solfatricus carries three proliferating cell nuclear antigen (PCNA) proteins sharing ~22% sequence similarity [94]. It is commonly observed that these proteins associate to form either homo-oligomers or hetero-oligomers.

Our interest in developing a strategy to predict the interacting partners of a multimer comprised of homologous sequences stems from our crystallographic study of the Nitrosomonas europaea rhesus protein (NeRh) .

Rhesus protein (Rh) is an integral membrane protein found in nearly all animals but seldom in prokaryotes. It is related to the well-characterized ammonium transporter protein (Amt) commonly found in bacteria, plants and fungi [95]. Rh protein has long been an important research subject because of its human homologs, which form one of the major blood group systems in man.

Recently, the structure of a rare bacterial homolog from N europaea has been determined [96, 97], providing valuable insights into the oligomeric state of the human Rh complex.

7 1 In the bacterial N. europaea, there is one copy of Rh protein while in human, there are five different Rh proteins, classified into two categories based on their localization in the cell: RhD, RhCE and RhAG are found in red blood cells (RBC) whereas RhBG and RhCG are expressed in tissues and organs associated with ammonium transport, such as, kidney, liver, central nervous tissues and testes. Experimental evidence suggests that the human Rh proteins associate into a core complex comprised of different Rh homologs. Before the elucidation of the NeRh structure [96, 97], it was widely assumed that the human

Rh core complex was a heterotetramer. The trimeric structure of the NeRh protein calls into question this assumption and implies that the human Rh complex is most likely a heterotrimer rather than a heterotetramer. The exact identities of the interacting partners composing the trimer, however, remain unclear.

Studies have shown that using both structural and sequence information improves discrimination between true and false positives, resulting in better prediction accuracy. With the structure of NeRh in hand, and the general belief that oligomeric interfaces are conserved [98, 99], we hypothesize that the interacting subunits of NeRh homotrimer could be used to predict the constituent subunits of the human Rh proteins.

For proteins to form a complex, protein molecules have to interact.

Protein-protein interaction is mediated by the amino acids in the interfaces of the interacting proteins. Thus, there is a tendency for interacting amino acids to coevolve in order to preserve proper interactions and hence, the integrity of the

7 2 protein complex. One common method to predict protein-protein interaction is to analyze the correlation between interacting residues.

To predict the interacting subunits of the trimeric human Rh protein complex, we utilized the structural information of the interacting subunits of NeRh and applied correlation analysis to identify interacting residues, which were then used to build a covariance position specific scoring matrix. This matrix was used to score the different combinations of subunits forming the trimer. Homotrimer composed of RhD was predicted to be the most probable.

We were not able to validate our prediction result because the structure of the human Rh protein complex has not yet been determined. Although one can rely on the general assumption of the oligomeric state of the human Rh complex based on biochemical studies, we sought for a biological system that can provide unambiguous verification to our approach. As such, we have chosen SsoPCNA as our validation set.

The prediction of SsoPCNA as a homotrimer by our methodology does not corroborate with the heterotrimer revealed from its crystal structure [94]. Based on this verification result, we conclude that our approach to using protein-protein interactions in a homotrimer to infer interactions in a heterotrimer requires further refinement.

In the following method sections, we provide a general overview of our prediction strategy, followed by a detailed description of our approach as it was applied to the human Rh protein and to the validation set, SsoPCNA. In the result and discussion section, we reported the prediction results on the human Rh

7 3 proteins and SsoPCNA, and deciphered the possible reasons why our approach has fallen short.

METHOD

Overview of the prediction algorithm

Fig. 4.1 illustrates the basic steps of our method for predicting interacting subunits of an uncharacterized complex from a characterized homologous homo- oligomer.

Fig. 4.1. Schematic overview of the prediction algorithm for oligomeric assemblies.

The algorithm is comprised of six major steps described as follows:

1. Identification of a homo-oligomeric structural homolog

7 4 This can be achieved by searching the Protein Data Bank database (PDB)

[47] using the program BLASTP [2]. For this study, the human PCNA structure was obtained through a literature search while the structure of NeRh was determined by our laboratory.

2. Construction of an initial model of interface positions and interacting amino acids from a family of homo-oligomeric protein homologs

The interacting amino acids and interface positions of the selected structural homolog are identified using an automated method, the Accessible

Surface Area (ASA) algorithm in CCP4 [100], and manual inspection of the structure. The identified interface residues and positions are used to build a covariance model of putative interacting amino acids pairs and interface positions from the sequence homologs resulting from a blast search against the non- redundant (NR) database.

3. Statistical assessment of putative interacting amino acids and positions

Random simulation of 10000 residues pairs for each pair of interface positions was performed to assess the statistical significance of the covariance of each interacting residues pair at each pair of interface positions.

4. Generation of a covariance position specific scoring matrix

Log odds covariance scores are calculated for those residues pairs and positions deemed statistically significant. These log odds scores are used to build a covariance position specific scoring matrix, which is used in subsequent prediction of the constituent subunits forming the oligomer.

5. Multiple sequence alignment of the proteins of interest

7 5 Multiple sequence alignment is performed on the query sequence used in the initial model building and the sequences of the proteins of interest using various multiple sequence alignment (MSA) programs. Each resultant model is used in the following scoring step and the prediction outcome of each MSA model is evaluated.

6. Scoring subunit combinations forming the oligomer.

An interface score is calculated for each pair of interfaces between two interacting proteins. The score of the oligomer is obtained by summing the scores of the interfaces forming the oligomer.

Selection criteria of validation set

Since we are interested in using homo-oligomeric proteins that are distinct in one cell to predict the interacting partners of other homologous proteins that have three or more homologs, we looked for proteins that are known to form only homo-oligomers in a particular biological system, such as in bacteria, but form hetero-oligomers or a mixture of homo- and hetero-oligomers in another biological system. In addition, the selected homo-oligomer has to be experimentally well characterized to allow a straightforward verification of the prediction accuracy of our approach. Under these considerations, the proliferating cell nuclear antigen protein (PCNA) is well suited for our verification purpose. PCNA has been extensively studied and numerous PCNA structures from different organisms from all three domains of life have been determined

[101-103]. It is known that PCNAs form homotrimers in eukaryotes and bacteria.

7 6 But within the archeal domain, the PCNAs of the crenarchaea Solfolobus solfataricus are found to form heterotrimers [94, 104]. As such, we have chosen the Solfolobus solfataricus PCNA as our test case.

Identification of the amino acids composing the subunit interface

Most computational approaches developed for predicting protein-protein interactions rely on protein structures to obtain interfacial information. For proteins determined by X-ray crystallography, many have three or more crystal contacts that would not occur in the physiological state, although some of these contacts may be biologically relevant [99]. Thus, it is necessary to distinguish between non-biological and biological contacts to ensure that accurate structural information is used in subsequent model building.

According to the investigative study of protein subunits and domain interfaces by Patrick Argos [105], there is, on average, a 20% loss in water accessible surface when subunits associate. This loss increases proportionally as the oligomeric state increases from dimeric to tetrameric. Accordingly, we have chosen to use the Accessible Surface Area algorithm in CCP4 [100] to automatically screen for the putative interacting residues.

Solvent accessible area is defined as the area of the protein that is in contact with a solvent. The Accessible Surface Area algorithm (ASA) in CCP4 accepts a pdb file containing the atomic coordinates of a protein and calculates the surface area for all atoms in a protein, first as an oligomer, then as a

7 7 monomer, and the difference in the surface area is attributed to protein oligomerisation.

Because the pdb file obtained from the Protein Data Bank are for the entire complex while the pdb file submitted to ASA should contain only coordinates for a monomer (a single protein molecule), the original pdb file from the Protein Data Bank was processed to extract the coordinates for one individual subunit, which was used for accessible surface area calculation.

Residues that experienced a change in accessible surface area were considered to be situated within the biological interface of the interacting molecules. The predicted residues and their corresponding positions were checked against the original crystal structure to ensure that the putative interacting residues are indeed in contact with each other.

For the human Rh proteins, we omitted the ASA step and used instead the expert-curated interfacial information detailed in the crystallographic study of

NeRh [97].

Construction of a covariance model of putative interacting amino acids and positions

To obtain a family of protein homologs known to form homo-oligomers, the query sequence was submitted to BLAST to search the non-redundant database, with a constraint filter that limits the search to relevant genomes or superkingdom. For the Rh proteins, the constraint filter excluded the human genome; for PCNA, the constraint filter limited the search to only eukaryotes and

7 8 eubacteria of which PCNA is known to form homo-oligomers. Hits with E-value ≤

0.005 were selected for subsequent construction of the covariance scoring matrix.

Based on the residues and residue positions identified as described above, the frequency of amino acids pair for each pair of putative interface positions was tabulated from the selected hits returned from the blast search.

The covariance of each amino acids pair for each pair of interface positions was calculated as follows:

Covariance(xa, yb) = Pxayb -PxaPyb (1) where a and b are the putative interface positions and x and y are the amino acids at the interface position, respectively. These covariances were then used to construct a covariance model of interacting amino acid pairs and interface positions.

Statistical assessment of amino acids pair and interface positions

To assess the statistical significance of each pair of amino acids at each pair of interface positions, we generated 10000 random pairs of amino acids for each possible pair of interface positions based on the individual residue frequency at each interface position tabulated from the selected BLAST [2] hits described above. We considered the covariance of a residue pair statistically significant only if it lies outside of the 1st - 99th percentile of the sample distribution derived from the covariance of the random simulated residue pairs.

These statistically significant residue pairs and their corresponding interacting

7 9 positions are used to build the covariance position specific scoring matrix described below.

Generation of the covariance position specific scoring matrix and the position specific scoring matrix

To generate the covariance position specific scoring matrix, a log odds score for the covariance of the amino acids pairs at each pair of interface positions that was deemed statistically significant was derived using the following formulation:

Log odds (COVxayb) = log(Pxayb) -log(Pxa)-log(Pyb) (2) where x and y denote one of the 20 amino acids and a and b denote the corresponding position, respectively.

For the position specific scoring matrix, the log odds score of each corresponding amino acid at each corresponding position was calculated using a pseudo count of 1 and the background amino acid frequencies of Robinson and

Robinson [52].

The two scoring matrices were used in subsequent scoring of all possible interacting subunits composing the oligomer in question as described below.

Multiple sequence alignment of homologs forming the oligomer

An accurate multiple sequence alignment (MSA) model is critical for accurate prediction of the interacting subunits forming the complex in question

8 0 because the alignment is used to identify the positions and amino acids that are involved in the interaction between two subunits.

The query protein used in the initial blast search and the homologs of the oligomer in question were submitted to a multiple sequence alignment program.

We used ClustalW [106, 107], T-Coffee [108], and Muscle [109, 110] to construct three different MSAs to test whether we could obtain a consensus model. We were not able to obtain a consensus alignment model because all three programs produce different alignment to some extent. For the human Rh proteins, in addition to using these three programs, we also used our LESTAT algorithm (ref: chapter 2) to build a MSA model. Pairwise alignment between the query protein and each homolog was extracted from the MSA for subsequent scoring of the different subunit combinations forming the oligomer. If a multiple sequence alignment model was available from the literature, it would be used to verify against the computer-generated multiple sequence alignment, as is the case for the human Rh proteins.

Scoring the putative interacting subunits

The score of every possible subunit combination of the oligomer was derived in two steps. First, we constructed a matrix of interface score for every interface between two subunits, with itself and with other homologs. The interface score was the sum of all the log odds covariances and the log odds position specific scores for each statistically significant amino acids pair and interface position pairs described above. We then calculated the total score for each

8 1 possible combination by summing over all the scores of interfaces between constituent subunits.

Fig. 4.2 illustrates one possible subunit combination and the corresponding three interfaces of a heterotrimer. There are three subunits, A B and C forming the heterotrimer ABC. Each subunit has two interacting fronts: face 1 and face 2. The three interfaces shown in Fig. 4.2 is comprised of face 2 of A: face 1 of B, face 2 of B: face 1 of C, and face 2 of C: face 1 of A. (":" indicates interaction). An alternate arrangement (not shown) is face 1 of A: face 2 of B, face 1 of B: face 2 of C, and face 1 of C : face 2 of A. Because the sets of interacting residues and positions in the two faces of a subunit are usually distinct, the interaction of face 1 of A with face 2 of B, for example, would be different from that of face 2 of A with face 1 of B. Thus, we need to consider all possible interfaces formed between two subunits when scoring the subunit combination of an oligomer.

Fig. 4.2. Graphical representation of the interacting subunits and the corresponding interfaces of a heterotrimer. Subunits A, B, C are non-identical protein chains. Each subunit has two interacting faces: face 1 and face 2. Here, chain A's face 1 interacts with chain B's face 2; chain A's face 2 interacts with chain C's face 1; chain C's face 2 interacts with chain B's face 1. Ribbon diagrams depicting the interfaces between three different subunits are shown. 8 2 RESULTS AND DISCUSSION

Human Rh proteins

Although NeRh and all the Amt proteins characterized-to-date are homotrimers, it is also possible that the human Rh proteins exist as a heterotrimer. Support for this latter possibility stems in part from the apparent ratio of 2:1:1 of RhAG, RhD and RhCE, and the fact that display of RhD requires

RhAG expression. As such, we excluded the human genome from our initial search database. Homologs with E-value ≤ 0.005 were used to build the covariance matrix using the interacting residues and the interface positions detailed in the crystallographic study of NeRh [97] as mentioned previously.

Once the statistically significant residue pairs and interface positions were established, the two position specific scoring matrices described above were constructed and the results from different multiple sequence alignments (MSAs) of NeRh to the five human Rh proteins were used to score each possible combination of human Rh trimer.

Table 4.1 shows the top 10 predictions for the trimeric architecture of the human Rh complex based on different MSA models. The predictions are highly dependent on the alignment. While the top two predictions generated using

ClustalW's, T-Coffee's and Winkler et.al.'s MSAs are the same with a homotrimer of RhD being the most probable form, followed by a heterotrimer of RhD2RhAG, the predictions obtained using the LESTAT and Muscle MSAs were distinct: a homotrimer of RhAG was predicted for LESTAT's as the most likely while a heterotrimer of RhBRhAGRhD was predicted for Muscle's.

8 3 Table 4.1. Top ten predictions of the possible subunit combinations forming the human Rh complex using different multiple sequence alignment models.

Multiple Sequence Alignment

Rank ClustalW T-Coffee Muscle LESTAT Winkler et. al

1 RhD-RhD-RhD RhD-RhD-RhD RhD-RhB- RhAG-RhAG- RhD-RhD-RhD RhAG RhAG 2 RhD-RhD- RhD-RhD- RhB-RhB-RhB RhAG-RhAG- RhD-RhD- RhAG RhAG RhCE RhAG 3 RhB-RhB-RhB RhB-RhD- RhB-RhB-RhD RhAG-RhC- RhD-RhB- RhAG RhCE RhAG 4 RhB-RhD- RhD-RhD-RhC RhD-RhD-RhB RhCE-RhAG- RhD-RhD- RhAG RhC RhCE 5 RhD-RhD- RhD-RhD- RhB-RhC-RhD RhB-RhAG- RhD-RhCE- RhCE RhCE RhAG RhAG 6 RhD-RhD-RhB RhD-RhCE- RhD-RhD- RhC-RhAG- RhD-RhAG- RhAG RhAG RhAG RhAG 7 RhD-RhCE- RhD-RhAG- RhB-RhCE- RhC-RhD-RhCE RhD-RhD-RhC RhAG RhAG RhD 8 RhB-RhB-RhD RhB-RhB-RhD RhB-RhCE- RhB-RhC-RhCE RhD-RhC- RhAG RhAG 9 RhD-RhD-RhC RhB-RhB-RhB RhD-RhD-RhC RhB-RhAG-RhC RhB-RhCE- RhAG 10 RhD-RhCE- RhD-RhD-RhB RhD-RhD-RhD RhD-RhD-RhAG RhD-RhCE- RhCE RhCE

Unfortunately, there is no conclusive result one can draw from these data since no two sets produced similar results, nor is their a bias towards either homotrimers of (RhD)3, (RhAG)3, or (RhCE)3, or a group of heterotrimers from this set. We were also faced with the dilemma that the oligomeric state of human

Rh complex is unclear, therefore we could not even pinpoint which MSA gave the best alignment and use that model for further testing and refinement.

Nevertheless, extensive biochemical studies on the human Rh proteins have provided enough clues to the possible oligomeric architectures of the Rh 8 4 complex, thus allowing us to distinguish those predictions that were false positives. For example, the results obtained using the Muscle MSA appeared unlikely given that RhAG and RhD are expressed in RBC, while RhB is found in tissues and organs. Also, oligomers containing no RhAG are unlikely since RhAG is strictly required for displaying the Rh proteins on the RBC cell surface.

SsoPCNA validation set

Given the ambiguous results from our study of the Rh protein, we sought to apply our method to a known heterotrimer of homologous subunits whose structure was known and whose homologs form similarly well characterized homotrimers. With this in mind, we chose to apply our algorithm on the SsoPCNA system in order to assess the prediction accuracy of our strategy. A covariance matrix was calculated using the eukaryotic and bacteria homologs of this protein, which have been shown to form homotrimers.

The SsoPCNA heterotrimer is comprised of SsoPCNA1, SsoPCNA2 and

SsoPCNA3 as confirmed by its crystal structure [104]. Using the MSAs obtained from ClustalW, T-Coffee, and Muscle, a homotrimer of SsoPCNA1 was the most probable oligomer predicted for all three MSAs. The remaining prediction results were different depending on the MSA used. The best result came from using the

ClustalW's MSA, which was the only one to predict the correct heterotrimer as the next most likely oligomer. The scores obtained from using T-Coffee's and

Muscle's MSAs predicted dimerization of SsoPCNA1 in association with a monomer of either SsoPCNA2 or SsoPCNA3.

8 5 Biological interface and residues at interface position

To investigate where our algorithm failed, we first verified that the interacting residues and positions in the human PCNA homotrimer [59] were indeed preserved in the SsoPCNA heterotrimer. We examined the interfacial regions between the three hetero-subunits and observed that the interacting amino acids were similar, but their positions within the sequence were different from the positions of the correlated amino acids pair used to score the different subunit associations. In the heterotrimer, the positions of interaction have shifted slightly and do not corroborate well with the "corresponding" positions in the homotrimer. An example of such observation is illustrated in Fig. 4.3, which provides a close-up look of the interfaces formed by each of the subunit of

SsoPCNA: PCNA1, PCNA2, PCNA3. The residues of PCNA2 that interact with

PCNA1's (Fig. 4.3a) are located on one side of the interfacial helix (top of the yellow-colored helix of Fig. 4.3a) whereas the residues of PCNA2 that interact with PCNA3 (Fig. 4.3b) are located at the other side of the helix (bottom of the yellow-colored helix in Fig. 4.3b).

8 6

Fig. 4.3. Ribbon diagrams of the three interfaces in SsoPCNA. SsoPCNA1 is colored in green, SsoPCNA2 is colored in yellow and SsoPCNA is colored in magenta. (a) interface between SsoPCNA1 and SsoPCNA2; (b) interface between SsoPCNA2 and SsoPCNA3; and (c) interface between SsoPCNA3 and SsoPCNA1.

Since our approach is based on the assumption that the characteristic of the interface of a homotrimer is similar to that of a heterotrimer, the lack of a conserved common interface makes this approach invalid. Because the protein chains in a homotrimer are identical, the correlations derived from residue pairs and interface positions are biased towards those sets of interactions that are predominately found in a homotrimer. However, as we have discovered from this study, positions of the interfacial residues and surfaces are not highly conserved. 8 7 Rather, they exhibited different degrees of conservation. In retrospect, this seems highly reasonable. Presumably in order to promote the specific interactions needed between the subunits of heterotrimers such as the

SsoPCNA, shifting of the interface residues and positions is easier to do than simply altering the residues at key positions.

8 8

CHAPTER 5

CONCLUSION

In summary, I have presented two strategies to improve distant homology detection and my preliminary efforts to couple homology detection to predict protein-protein interactions. In Chapter 2, a new homology algorithm is described with improved sensitivity for the detection of distant homologs. This algorithm,

LESTAT, incorporates structural features in the form of conserved residues and conserved separation distances to develop a model comprised of sequence blocks and gap lengths that is better at detecting distant homologs than PSI-

BLAST, arguably the most popular sequence alignment tool. LESTAT performed better than or comparable to PSI-BLAST in six out of seven test systems. In addition, many of the true hits identified by LESTAT were different from those found by PSI-BLAST. These differences presumably stem from LESTAT's and

PSI-BLAST's distinct profiles, and suggests that LESTAT could be a useful complement to PSI-BLAST in distant homology detection.

Chapter 3 presents another strategy to improve distant homolog - by enhancing the discrimination of true and false positives. One major challenge inherent to iterative search strategy utilized by PSI-BLAST is the incorporation of

8 9 non-homologous sequences into its profile. As one way to overcome this problem, I have developed a new algorithm, SimpleIsBeautiful (SIB) that takes the two outputs (from round two and from the last round) produced by PSI-

BLAST and calculates a figure of merit (FOM) for each overlapped hit. When tested on the same test set used by the PSI-BLAST developers, SIB exhibits improved accuracy over PSI-BLAST at minimal additional computational cost. A web server, SIB-BLAST that runs the SIB algorithm has been developed and can be found under the URL: http://sib-blast.osc.edu.

To extend the application of homology detection to study protein-protein interactions, I have explored its utility towards predicting the oligomerization of homologous proteins. Specifically, I was interested in predicting the constituent proteins forming the human Rh complex. Chapter 4 presents my attempt to use a structure-based sequence alignment strategy and correlation analysis of amino acid pairs to predict the interacting homologous protein partners from a homo- oligomers. My verification result using the structurally characterized heterotrimer

SsoPCNA indicates that my approach requires further refinement. An analysis into the predicted interacting amino acid residues and the corresponding interfacial positions reveals that for hetero-oligomers, the purported interacting residues are not strictly conserved. Presumably, in trying to complex with another protein molecule, it is evolutionary more advantageous to shift the interfacial positions than strictly use amino acid mutations to dictate the specificity.

9 0

BIBLIOGRAPHY

1. Lander, E.S., et al., Initial sequencing and analysis of the human genome. Nature, 2001. 409(6822): p. 860-921.

2. Altschul, S.F., et al., Basic local alignment search tool. J Mol Biol, 1990. 215(3): p. 403-10.

3. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997. 25(17): p. 3389-402.

4. Schaffer, A.A., et al., Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res, 2001. 29(14): p. 2994-3005.

5. Chothia, C., Principles that determine the structure of proteins. Annu Rev Biochem, 1984. 53: p. 537-72.

6. Pearson, W.R., Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol, 1990. 183: p. 63-98.

7. Holm, L. and C. Sander, The FSSP database of structurally aligned protein fold families. Nucleic Acids Res, 1994. 22(17): p. 3600-9.

8. Chothia, C., Proteins. One thousand families for the molecular biologist. Nature, 1992. 357(6379): p. 543-4.

9. Wolf, Y.I., et al., Distribution of protein folds in the three superkingdoms of life. Genome Res, 1999. 9(1): p. 17-26.

10. Murzin, A.G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, 1995. 247(4): p. 536-40.

11. Orengo, C.A., et al., CATH--a hierarchic classification of protein domain structures. Structure, 1997. 5(8): p. 1093-108.

9 1 12. Lesk, A., Introducton to protein architecture. 2001.

13. Casari, G., C. Sander, and A. Valencia, A method to predict functional residues in proteins. Nat Struct Biol., 1995. 2(2): p. 171-8.

14. Kinch, L.N. and N.V. Grishin, Evolution of protein structures and functions. Curr Opin Struct Biol, 2002. 12(3): p. 400-8.

15. Chothia, C., et al., Evolution of the protein repertoire. Science, 2003. 300(5626): p. 1701-3.

16. Chothia, C. and J. Gough, Genomic and structural aspects of protein evolution. Biochem J, 2009. 419(1): p. 15-28.

17. Apic, G., W. Huber, and S.A. Teichmann, Multi-domain protein families and domain pairs: comparison with known structures and a random model of domain recombination. J Struct Funct Genomics, 2003. 4(2-3): p. 67-78.

18. Banaszak, L., Foundations of Structural Biology. 2000, San Diego: Academic.

19. Ochman, H., J.G. Lawrence, and E.A. Groisman, Lateral gene transfer and the nature of bacterial innovation. Nature, 2000. 405(6784): p. 299- 304.

20. Yang, A.S. and B. Honig, An integrated approach to the analysis and modeling of protein sequences and structures. III. A comparative study of sequence conservation in protein structural families using multiple structural alignments. J Mol Biol, 2000. 301(3): p. 691-711.

21. Yang, A.S. and B. Honig, An integrated approach to the analysis and modeling of protein sequences and structures. II. On the relationship between sequence and structural similarity for proteins that are not obviously related in sequence. J Mol Biol, 2000. 301(3): p. 679-89.

22. Huang, X.Q., R.C. Hardison, and W. Miller, A space-efficient algorithm for local similarities. Comput Appl Biosci, 1990. 6(4): p. 373-81.

23. Krogh, A., et al., Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol, 1994. 235(5): p. 1501-31.

24. Eddy, S.R., Profile hidden Markov models. Bioinformatics, 1998. 14(9): p. 755-63.

9 2 25. Karplus, K. and H. B., Evaluation of protein multiple alignments by SAM- T99 using BAliBASE multiple alignment test set. 2001. 17(8): p. 713-20.

26. Eddy, S.R., G. Mitchison, and R. Durbin, Maximum discrimination hidden Markov models of sequence consensus. J Comput Biol, 1995. 2(1): p. 9- 23.

27. Hughey, R. and A. Krogh, Hidden Markov models for : extension and analysis of the basic method. Comput Appl Biosci, 1996. 12(2): p. 95-107.

28. Karplus, K., B. C., and H. R., Hidden Markov models for detecting remote protein homologies. Bioinformatics, 1998. 14(10): p. 846-56.

29. Park, J., et al., Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol, 1998. 284(4): p. 1201-10.

30. Aravind, L. and E.V. Koonin, Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches. J Mol Biol, 1999. 287(5): p. 1023-40.

31. Lee, M.M., M.K. Chan, and R. Bundschuh, Simple is beautiful: a straightforward approach to improve the delineation of true and false positives in PSI-BLAST searches. Bioinformatics, 2008. 24(11): p. 1339- 43.

32. Needleman, S.B. and C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol, 1970. 48(3): p. 443-53.

33. Smith, T.F. and M.S. Waterman, Identification of common molecular subsequences. J Mol Biol, 1981. 147(1): p. 195-7.

34. Eddy, S.R., What is dynamic programming? Nat Biotechnol, 2004. 22(7): p. 909-10.

35. Eddy, S.R., Multiple alignment using hidden Markov models. Proc Int Conf Intell Syst Mol Biol, 1995. 3: p. 114-20.

36. Eddy, S.R., Hidden Markov models. Curr Opin Struct Biol, 1996. 6(3): p. 361-5.

9 3 37. Gribskov, M., A.D. McLachlan, and D. Eisenberg, Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A, 1987. 84(13): p. 4355-8.

38. Bowie, J.U., R. Luthy, and D. Eisenberg, A method to identify protein sequences that fold into a known three-dimensional structure. Science, 1991. 253: p. 164-253.

39. Fischer, D. and D. Eisenberg, Protein fold recognition using sequence- derived predictions. Protein Sci, 1996. 5(5): p. 947-55.

40. Kann, M.G., et al., A structure-based method for protein sequence alignment. Bioinformatics, 2005. 21(8): p. 1451-6.

41. Gough, J., et al., Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol, 2001. 313(4): p. 903-19.

42. Perutz, M.F., J.C. Kendrew, and H.C. Watson, Structure and function of haemoglobin II. Some relations between polypeptide chain configuration and amino acid sequence. . J. Mol. Biol., 1965. 13: p. 669-678.

43. Rice, D.W. and D. Eisenberg, A 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence. J. Mol.Biol., 1997. 267: p. 1026-1038.

44. Brenner , S.E., C. Chothia, and T.J.P. Hubbard, Assessing the sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci.,USA, 1998. 95: p. 6073- 6078.

45. Marchler-Bauer, A., et al., CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res, 2003. 31(1): p. 383-7.

46. Marchler-Bauer, A., et al., CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res, 2005. 33(Database issue): p. D192-6.

47. Berman, H.M., et al., The Protein Data Bank. Nucleic Acids Res, 2000. 28(1): p. 235-42.

48. Holm, L. and C. Sander, Mapping the protein universe. Science, 1996. 273(5275): p. 595-603.

9 4 49. Osborne, M., et al., The solution structure of ChaB, a putative membrane ion antiporter regulator from Escherichia coli. BMC Struc. Biol., 2004. 4: p. 4-9.

50. Lu, G., A new method for protein structure comparisons and similarity searches. J Appl. Cryst, 2000. 33: p. 176-183.

51. Smith, T.F., M.S. Waterman, and C. Burks, The statistical distribution of nucleic acid similarities. Nucleic Acids Res, 1985. 13(2): p. 645-56.

52. Robinson, A.B. and L.R. Robinson, Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. Proc Natl Acad Sci U S A, 1991. 88(20): p. 8880-4.

53. Gumbel, E.J., Statistics of Extremes. 1958, NY: Columbia University Press, New York.

54. Collins, J.F., A.F. Coulson, and A. Lyall, The significance of protein sequence similarities. Comput Appl Biosci, 1988. 4(1): p. 67-71.

55. Olsen, R., R. Bundschuh, and T. Hwa, Rapid assessment of extremal statistics for gapped local alignment. Proc Int Conf Intell Syst Mol Biol, 1999: p. 211-22.

56. Hadley, C. and D.T. Jones, A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. Structure Fold Des, 1999. 7(9): p. 1099-112.

57. Pearl, F.M.G., et al., The CATH extended protein-family database: Providing structural annotations for genome sequences. Protein Science, 2002(11): p. 233-244.

58. Mallick, P., D. Rice, and D. Eisenberg. DAPS: database of distant aligned protein structures. 2001; Available from: http://www.doe- mbi.ucla.edu/~DAPS/,2001

59. Kleiger, G., et al., The 1.7Å Crystal Structure of BPI: A Study of How Two Dissimilar Amino Acid Sequences can Adopt the Same Fold. J. Mol. Biol., 2000. 299: p. 1019-1034.

60. Lo Conte, L., et al., SCOP: a structural classification of proteins database. Nucleic Acids Res, 2000. 28(1): p. 257-9.

9 5 61. Brenner, S.E., P. Koehl, and M. Levitt, The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res, 2000. 28(1): p. 254-6.

62. Chandonia, J.M., et al., ASTRAL compendium enhancements. Nucleic Acids Res, 2002. 30(1): p. 260-3.

63. Chandonia, J.M., et al., The ASTRAL Compendium in 2004. Nucleic Acids Res, 2004. 32(Database issue): p. D189-92.

64. Tatusova, T.A. and T.L. Madden, Blast 2 sequences - a new tool for comparing protein and sequences. FEMS Microbiol Lett., 1999. 174: p. 247-250.

65. Shakhnovich, E., V. Abkevich, and O. Ptitsyn, Conserved residues and the mechanism of protein folding. Nature, 1996. 379(6560): p. 96-8.

66. Sali, A., E. Shakhnovich, and M. Karplus, How does a protein fold? Nature, 1994. 369(6477): p. 248-51.

67. Wolynes, P.G., J.N. Onuchic, and D. Thirumalai, Navigating the folding routes. Science, 1995. 267(5204): p. 1619-20.

68. Abkevich, V.I., A.M. Gutin, and E.I. Shakhnovich, Specific nucleus as the transition state for protein folding: evidence from the lattice model. , 1994. 33(33): p. 10026-36.

69. Fersht, A.R., Optimization of rates of protein folding: the nucleation- condensation mechanism and its implications. Proc Natl Acad Sci U S A, 1995. 92(24): p. 10869-73.

70. Itzhaki, L.S., D.E. Otzen, and A.R. Fersht, The structure of the transition state for folding of chymotrypsin inhibitor 2 analysed by protein engineering methods: evidence for a nucleation-condensation mechanism for protein folding. J Mol Biol, 1995. 254(2): p. 260-88.

71. Gotoh, O., Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol, 1996. 264(4): p. 823-38.

72. Thompson, J.D., F. Plewniak, and O. Poch, A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res, 1999. 27(13): p. 2682-90.

9 6 73. Bowie, J.U., et al., Three-dimensional profiles for measuring compatibility of amino acid sequence with three-dimensional structure. Methods Enzymol, 1996. 266: p. 598-616.

74. Lee, M.M., R. Bundschuh, and M.K. Chan, Distant homology detection using a LEngth and STructure-based sequence Alignment Tool (LESTAT). Proteins, 2007.

75. Becker, K.G., et al., PubMatrix: a tool for multiplex literature mining. BMC Bioinformatics, 2003. 4: p. 61.

76. Kaplan, N., A. Vaaknin, and M. Linial, PANDORA: keyword-based analysis of protein sets by integration of annotation sources. Nucleic Acids Res, 2003. 31(19): p. 5617-26.

77. Shatkay, H., et al., SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics, 2007. 23(11): p. 1410-7.

78. Grundy, W.N., Homology detection via family pairwise search. J Comput Biol, 1998. 5(3): p. 479-91.

79. Alam, I., et al., Comparative homology agreement search: an effective combination of homology-search methods. Proc Natl Acad Sci U S A, 2004. 101(38): p. 13814-9.

80. Karlin, S. and S.F. Altschul, Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A, 1990. 87(6): p. 2264-8.

81. Karlin, S. and A. Dembo, Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. Appl. Prob., 1992. 24: p. 113-140.

82. Karlin, S. and S.F. Altschul, Applications and statistics for multiple high- scoring segments in molecular sequences. Proc Natl Acad Sci U S A, 1993. 90(12): p. 5873-7.

83. Mott, R., Maximum likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores. Bull. Math. Biol., 1992. 54: p. 59-75.

84. Vingron, M. and M.S. Waterman, Sequence alignment and penalty choice. Review of concepts, case studies and implications. J Mol Biol, 1994. 235(1): p. 1-12.

9 7 85. Waterman, M.S. and M. Vingron, Rapid and accurate estimates of statistical significance for sequence data base searches. Proc Natl Acad Sci U S A, 1994. 91(11): p. 4625-8.

86. Altschul, S.F. and W. Gish, Local alignment statistics. Methods Enzymol, 1996. 266: p. 460-80.

87. Bailey, T.L. and M. Gribskov, Combining evidence using p-values: application to sequence homology searches. Bioinformatics, 1998. 14(1): p. 48-54.

88. Gribskov, M. and N.L. Robinson, Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem.,, 1996. 20: p. 25-33.

89. Bailey, T.L. and W.M. Grundy, Classifying proteins by family using the product of correlated p-values. Proceedings of the Third Annual International Conference on Computational Molecular Biology, 1999: p. 10-14.

90. Goodsell, D.S. and A.J. Olson, Structural symmetry and protein function. Annu Rev Biophys Biomol Struct, 2000. 29: p. 105-53.

91. Jackson, R.M., H.A. Gabb, and M.J. Sternberg, Rapid refinement of protein interfaces incorporating solvation: application to the docking problem. J Mol Biol, 1998. 276(1): p. 265-85.

92. Vakser, I.A., Evaluation of GRAMM low-resolution docking methodology on the hemagglutinin-antibody complex. Proteins, 1997. Suppl 1: p. 226- 30.

93. Lu, L., H. Lu, and J. Skolnick, MULTIPROSPECTOR: an algorithm for the prediction of protein-protein interactions by multimeric threading. Proteins, 2002. 49(3): p. 350-64.

94. Dionne, I., et al., A heterotrimeric PCNA in the hyperthermophilic archaeon Sulfolobus solfataricus. Mol Cell, 2003. 11(1): p. 275-82.

95. Javelle, A., et al., Ammonium sensing in Escherichia coli. Role of the ammonium transporter AmtB and AmtB-GlnK complex formation. J Biol Chem, 2004. 279(10): p. 8530-8.

96. Li, X., et al., Structure of the Nitrosomonas europaea Rh protein. Proc Natl Acad Sci U S A, 2007. 104(49): p. 19279-84.

9 8 97. Lupo, D., et al., The 1.3-A resolution structure of Nitrosomonas europaea Rh50 and mechanistic implications for NH3 transport by Rhesus family proteins. Proc Natl Acad Sci U S A, 2007. 104(49): p. 19303-8.

98. Valdar, W.S. and J.M. Thornton, Protein-protein interfaces: analysis of amino acid conservation in homodimers. Proteins, 2001. 42(1): p. 108-24.

99. Valdar, W.S. and J.M. Thornton, Conservation helps to identify biologically relevant crystal contacts. J Mol Biol, 2001. 313(2): p. 399-416.

100. The CCP4 suite: programs for protein crystallography. Acta Crystallogr D Biol Crystallogr, 1994. 50(Pt 5): p. 760-3.

101. Gulbis, J.M., et al., Structure of the C-terminal region of (WAF1/CIP1) complexed with human PCNA. Cell, 1996. 87(2): p. 297-306.

102. Matsumiya, S., Y. Ishino, and K. Morikawa, Crystal structure of an archaeal DNA sliding clamp: proliferating cell nuclear antigen from Pyrococcus furiosus. Protein Sci, 2001. 10(1): p. 17-23.

103. Imamura, K., et al., Specific interactions of three proliferating cell nuclear antigens with replication-related proteins in Aeropyrum pernix. Mol Microbiol, 2007. 64(2): p. 308-18.

104. Williams, G.J., et al., Structure of the heterotrimeric PCNA from Sulfolobus solfataricus. Acta Crystallogr Sect F Struct Biol Cryst Commun, 2006. 62(Pt 10): p. 944-8.

105. Argos, P., An investigation of protein subunit and domain interfaces. Protein Eng, 1988. 2(2): p. 101-13.

106. Thompson, J.D., D.G. Higgins, and T.J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 1994. 22(22): p. 4673-80.

107. Larkin, M.A., et al., Clustal W and Clustal X version 2.0. Bioinformatics, 2007. 23(21): p. 2947-8.

108. Notredame, C., D.G. Higgins, and J. Heringa, T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol, 2000. 302(1): p. 205-17.

109. Edgar, R.C., MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 2004. 5: p. 113.

9 9 110. Edgar, R.C., MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 2004. 32(5): p. 1792-7.

10 0