Metho ds for Global Organization of all

Known Protein Sequences

Thesis submitted for the degree

\Do ctor of Philosophy"

Golan Yona

Submitted to the Senate of the Hebrew University in the year 1999

1

This work was carried out under the sup ervision of

Prof. Nathan Linial, Prof. Naftali Tishby and Dr. Michal Linial.

2

Acknowledgments

First, I would like to thank my advisors: Nati Linial and Tali Tishby of the computer science

department and Michal Linial of the life science department at the Hebrew University. Within the

last ve years we have come a long way together, and I'm grateful for their guidance, supp ort,

warmth and generous help all these years.

I would like to thank my colleagues and friends in the machine learning lab, with whom I sp ent

most of the last ve years: Itay Gat, Lidror Troyansky, Elad Schneidman, Shlomo Dubnov, Shai

Fine, Noam Slonim, Ofer Neiman and Ran El-Yaniv, for all their help and for b eing great company.

A sp ecial thanks to Ran El-Yaniv, a dear friend who encouraged me to keep b elieving in myself. I

very much enjoyed working with Gill Bejerano, with whom I am still working on fascinating new

directions.

I would next like to thank all of my colleagues and friends in the molecular genetics and

biotechnology lab headed by : Ora Schueler, Eyal Nadir, Iddo Friedb erg, Yael

Mandel and Yael Altuvia, for the great time I had every Thursday in our journal club group

meetings. Besides the many pap ers I had in my bag after each such meeting, it was also great fun.

A very sp ecial thanks to Yael Altuvia for her continuous and wonderful help and supp ort, and to

Hanah Margalit, who was like an advisor for me, always happy to answer questions and give advice,

as well as give me encouragement in hard times.

I thank Amnon Barak for making the MOSIX parallel system available, and the p eople in

Compugen Ltd. who gave me access to their Bio ccelerator and to their software.

It was a pleasure to work with Alex Kremer, Avi Kavas, Yoav Etsion and Daniel Avrahami,

the great team of students with whom I created the ProtoMap web site.

I thank , Steven Brenner, Patrice Ko ehl, Boaz Shaanan, Yoram Gdalyahu and

Ran El-Yaniv for critically reading parts of this manuscript and for making many helpful comments,

and Avner Magen for valuable suggestions. A sp ecial thanks to Nati Linial, my advisor, who with

much patience read most of this thesis, commenting, correcting my English, and improving my

writing style.

To my family, esp ecially my mother and father for their tremendous love and encouragement,

and for continually reminding me that they will b e happy with whatever I cho ose to do (always

leaving me the option of b ecoming a carp enter).

And last, to my b est friends, Yoram Gdalyahu and Rami Doron, for their enormous help and

for their invaluable friendship during these intensive years.

Contents

1 Intro duction 7

1.1 What are proteins? ...... 7

1.2 Functional analysis of protein sequences ...... 8

1.3 The explosion of biological information ...... 9

1.4 Towards global organization ...... 10

2 Comparison of protein sequences 13

2.1 Alignment of sequences ...... 13

2.1.1 Global similarity of protein sequences ...... 14

2.1.2 Penalties for gaps ...... 15

2.1.3 Global distance alignment ...... 16

2.1.4 Position dep endent scores ...... 18

2.1.5 Lo cal alignments ...... 19

2.2 Other algorithms for sequence comparison ...... 19

2.2.1 BLAST (Basic Lo cal Alignment Search To ol) ...... 20

2.2.2 FASTA ...... 22

2.3 Probability and statistics of sequence alignments ...... 23

2.3.1 Basic random mo del ...... 23

2.3.2 Statistics of global alignment ...... 24

2.3.3 Statistics of lo cal alignment ...... 26

2.4 Scoring matrices and gap p enalties ...... 31

2.4.1 The PAM family of scoring matrices ...... 32

2.4.2 The BLOSUM family of scoring matrices ...... 34

2.4.3 Information content of scoring matrix ...... 35

2.4.4 Gap p enalties ...... 38

3 Large scale analyses of protein sequences 39

3.1 Motif and domain based analyses ...... 39

3.1.1 The PROSITE dictionary [Bairo ch 1991 ] ...... 39

3.1.2 The BLOCKS database [Heniko & Heniko 1991 ] ...... 40

3.1.3 The ProDom database[Sonnhammer & Kahn 1994 ] ...... 40

3.1.4 The Pfam database [Sonnhammer et al. 1997 ] ...... 41

3.1.5 The DOMO database [Gracy & Argos 1998 ] ...... 41

3.2 Protein based analysis ...... 41

3.2.1 The study by [Gonnet et al. 1992 ] ...... 41

3.2.2 The study by [Harris et al. 1992 ] ...... 42 3

4 CONTENTS

3.2.3 The study by [Watanab e & Otsuka 1995 ] ...... 42

3.2.4 The PIR classi cation [Barker et al. 1996 ] ...... 42

3.2.5 The COGS database [Tatusov et al. 1997 ] ...... 43

3.2.6 The SYSTERS database [Krause & Vingron 1998 ] ...... 43

3.3 Alternative representations of protein sequences ...... 44

3.3.1 The study by [van Heel et al. 1991 ] ...... 44

3.3.2 The study by [Ferran et al. 1994 ] ...... 44

3.3.3 The study by [Han & Baker 1995 ] ...... 44

3.3.4 The study by [Casari et al. 1995 ] ...... 45

3.4 Structural classi cations of proteins ...... 45

3.5 Summary ...... 46

4 Part I - The Euclidean emb edding approach 47

4.1 Constructing a metric space on the protein sequence space ...... 47

4.2 Why emb ed? ...... 49

4.3 Emb edding strategies ...... 50

4.3.1 Classical approach ...... 50

4.3.2 Current approach ...... 52

4.4 Application of emb edding techniques for protein sequences ...... 57

5 Data clustering 61

5.1 Intro duction ...... 61

5.1.1 Basic de nitions ...... 62

5.1.2 Clustering metho ds ...... 62

5.1.3 Partitional clustering algorithms ...... 63

5.1.4 The basic clustering pro cedure ...... 64

5.1.5 Hierarchical clustering algorithms ...... 65

5.1.6 Cluster validation ...... 66

5.2 Our clustering algorithm ...... 67

5.2.1 Algorithm outline ...... 68

5.2.2 Cross validation ...... 68

5.2.3 Algorithm sketch ...... 69

5.2.4 Conditions for agreement ...... 69

6 Global organization of protein segments 75

6.1 Results ...... 75

6.1.1 Clusters of protein sequences ...... 76

6.1.2 \Fingerprints" of biological families based on cluster memb ership ...... 86

6.1.3 Higher-level measures of similarity b etween sequences ...... 87

6.2 Discussion ...... 91

7 Part II - The graph-based approach 95

7.1 Preface ...... 95

7.2 Pairwise clustering algorithms ...... 96

7.2.1 The single linkage algorithm ...... 97

7.2.2 The complete linkage algorithm ...... 98

7.2.3 The average linkage algorithm ...... 98

CONTENTS 5

7.2.4 Other pairwise clustering algorithms ...... 99

8 The graph based approach - representation and algorithms 103

8.1 Intro duction ...... 104

8.1.1 Related works ...... 105

8.2 Metho ds ...... 107

8.2.1 De ning the graph ...... 107

8.2.2 Placing all metho ds on a common numerical scale ...... 108

8.2.3 Neighb ors' lists ...... 109

8.2.4 Exploring the connectivity ...... 111

9 ProtoMap: Automatic classi cation of protein sequences, a hierarchy of protein

families, and lo cal maps of the protein space. 115

9.1 General information on clusters ...... 115

9.2 Performance evaluation ...... 118

9.2.1 The evaluation metho dology ...... 118

9.2.2 The reference databases ...... 120

9.2.3 Evaluation results ...... 121

9.2.4 Critique of the evaluation metho dology ...... 122

9.2.5 New clusters ...... 123

9.3 ProtoMap as a to ol for analysis ...... 124

9.3.1 Tracing the formation of clusters ...... 124

9.3.2 Hierarchical organization within protein families and sup erfamilies ...... 125

9.3.3 Possibly related clusters and lo cal maps ...... 126

9.4 Discussion ...... 128

9.5 App endix A - Correlation of ProtoMap releases ...... 132

10 The ProtoMap website 135

10.1 How to use ProtoMap ...... 135

10.1.1 Search ...... 135

10.1.2 Browsing a cluster ...... 137

10.1.3 Analyzing new sequences ...... 141

11 Concluding remarks 143

11.1 The whole picture ...... 145

11.2 Future directions ...... 145

6 CONTENTS

Chapter 1

Intro duction

This study fo cuses on metho ds for global organization of all known protein sequences. The main

ob jective of this research is to explore high order organization in the protein sequence space, and to

identify clusters of related proteins, by utilizing information ab out pairwise similarities. By these

means we wish to achieve a classi cation of all protein sequences into groups of shared biological

function (protein families), and gain a b etter understanding of the geometry of the sequence space.

The category of this work is \computational ". This is an essentially new

discipline which is ro oted in two di erent disciplines: computer science and molecular biology.

Being on the b order line b etween the two disciplines, this study is related to elds of intensive

research in b oth. It incorp orates several recent developments in the theory of metric emb edding

and unsup ervised learning, ecient graph algorithms, algorithms for the comparison of protein

sequences, and the statistical theory of sequence comparison. While a full survey of each eld is

b eyond the scop e of this thesis, I have provided sucient background to evaluate this work.

This chapter b egins with a very brief intro duction to protein sequences, then de nes the ba-

sic problem on which this study is fo cused, and intro duces the metho ds that are employed. An

elab orated intro duction and discussion of each sub ject app ears in the following chapters.

1.1 What are proteins?

Proteins are the molecules which form much of the functional and structural machinery of each

cell in every organism. They carry out the transp ort and storage of small molecules, and are

involved with the transmission of biological signals. Proteins include the enzymes which catalyze

and regulate a variety of bio chemical pro cesses in the cell, as well as antib o dies which defend the

b o dy against infections. Dep ending on the tissue, each typ e of cell has several thousand kinds of

proteins which play a primary role in determining the characteristic of the cell and how it functions.

Proteins are complex molecules. They are assembled from 20 di erent amino acids, with

diverse chemical prop erties. The amino acids form the vo cabulary that allows proteins to exhibit a

great variety of structures and prop erties. In the cell, the co de for proteins is stored in the DNA

sequence, the cell's rep ository of genetic information, whose building blo cks are four di erent small

molecules, called nucleic acids. Each protein sequence is enco ded in a DNA sequence called a

gene, in which every blo ck of three nucleic acids (co don) corresp onds to an individual amino acid

(the set of rules that sp ecify which amino acid is enco ded in each co don is called the genetic

co de). Sp ecial machinery in the cell is resp onsible for the deco ding pro cess, i.e., the translation

and execution of a gene's \instructions" resulting in the formation of a protein sequence. 7

8 Chapter 1. Introduction

Each protein is created in the cell rst as a well-de ned chain of amino acids, whose sequence is

called the primary structure of a protein. The average length of a protein sequence is 350 amino

acids, but it can b e as short as few amino acids, and as long as a few thousand (the longest protein

known is over 5000 amino acids long). Schematically, each amino acid may b e represented by one

letter, and proteins can b e viewed as long \words" over an alphab et of 20 letters.

According to the central dogma of protein folding, the protein sequence (the primary structure)

dictates how the protein folds in three dimensions. It is the sp eci c three dimensional structure that

enables the protein to function in its particular biological role. This folding prescrib es the function

of the protein, and the way it interacts with other molecules. Di erent terms are used to describ e

the structural hierarchy in proteins: Secondary structures are lo cal sequence elements (3-40

amino acids long) that have a well determined regular shap e, such as a (alpha) helix, or a (b eta)

strand. Other lo cal sequence elements, which are neither helices or strands, are usually called lo ops

or coils, and they may adopt a large variety of shap es. Secondary structures are packed into what

is called the tertiary structure - the three dimensional (3D) structure of the whole sequence. In

general, helices and strands form the core of the protein structure, whereas lo ops are more likely

to b e found on the surface of the molecule. The shap e (the architecture) of the 3D structure is

called the fold. Each protein sequence has a unique 3D structure, but several di erent proteins

may adopt the same fold, i.e. the shap e of their 3D structure is similar. Finally, several protein

sequences may arrange themselves together to form a complex, which is called the quaternary

structure. The terms \motifs" and \domains" are also very common when describing protein

structure/function. A motif is a simple combination of a few consecutive secondary structure

elements with a sp eci c geometric arrangement (e.g., helix-lo op-helix). Some, but not all motifs

are asso ciated with a sp eci c biological function. A domain is the fundamental unit of structure.

It combines several secondary elements and motifs, not necessarily contiguous, which are packed in

a compact globular structure. A domain can fold indep endently into a stable 3D structure, and it

has a sp eci c function. A protein may b e comprised of a single domain or several di erent domains,

or several copies of the same domain.

Not every p ossible combination of amino acids can form a stable protein sequence that folds

and functions prop erly. Evolution has \selected" only (some of ) those sequences which could fold

into a stable functional structure. The term sequence space is used to describ e the set of all

currently known protein sequences, and those exp ected to exist. The term structure space refers

to the set of all folds adopted by protein sequences. Again, not every shap e is p ossible, and not

every p ossible shap e is exp ected to exist.

1.2 Functional analysis of protein sequences

One of the main goals of molecular biology is to understand the structure and function of proteins.

Knowledge of the biological function of a protein sequence is essential for the understanding of

fundamental bio chemical pro cesses, for drug design and for genetic engineering, to name a few. The

three dimensional structure of a protein gives the most information ab out its biological function.

However, determining the three dimensional structure of a protein (either using x-ray di raction

or nuclear magnetic resonance) is dicult, and not always feasible. At the moment, there are only

several thousand known structures. If the rules for folding were fully understo o d, it would have

b een p ossible, in principle, to deduce the three dimensional structure of a protein from its sequence

alone. However, this cannot yet b e done since we still lack a full understanding of the molecular

folding pro cess. As a result, structure prediction metho ds have so far achieved only mo derate

Chapter 1. Introduction 9

success.

In the absence of structural data, sequence analysis remains the main source of information for

most new proteins. Therefore, a considerable e ort has b een made to predict, based only on their

sequence, the bio chemical prop erties of proteins and their functional role in living cells. Shared

evolutionary ancestry is at the basis of such studies. Evolution has preserved sequences which have

proved to p erform correctly, and proteins with similar sequences are found in di erent organisms.

Similar proteins can also b e found within the same organism, due to duplication and shuing of

co ding segments in the DNA during evolution. In most cases, sequence similarity entails similar

or related functions. Therefore, detecting similarities b etween protein sequences can help to reveal

the biological function of new protein sequences, and to assess their biological relevance.

Sequence similarity is not always easily detectable. During evolution sequences have changed

by insertions, deletions and mutations. These evolutionary events may b e traced to day by ap-

plying algorithms for sequence comparison. Over the last 30 years many di erent algorithms and

approaches were tested. To day there are a numb er of generally accepted metho ds for comparing

pairs of sequences and assessing their similarity/distance (see chapter 2). These algorithms are

quite successful in exp osing sequence similarity, and have b ecome a standard to ol for biologists.

Few di erent \levels" of similarity are de ned. When sequences share a signi cant sequence

similarity they are usually assumed to have a common evolutionary ancestry, and are called ho-

mologous proteins. Proteins that are evolutionary related usually have the same core structure,

but they may have very di erent lo ops. A protein family is a group of homologous proteins,

and proteins which b elong to the same family have the same or very similar biological function. A

group of a few protein families which are distantly related through evolution make up a protein

sup erfamily. Proteins which b elong to the same sup erfamily have close or related biological func-

tions, but sequence similarity is not always detectable b etween memb ers of the same sup erfamily.

A few protein sup erfamilies may adopt the same fold. Proteins which have the same fold may have

related biological functions. This is not necessarily due to sequence similarity or evolutionary rela-

tionship, and therefore this kind of similarity is not always detectable using the standard metho ds

for sequence comparison.

1.3 The explosion of biological information

In recent years we have witnessed a massive ow of new biological data. Large-scale sequencing

pro jects throughout the world turn out new sequences, and create new challenges for investigators.

These ongoing sequencing e orts have already uncovered the sequences of over 100,000 proteins.

Such pro jects continue to yield the sequence of many new proteins whose function is not known.

Currently, 15 complete genomes (yeast, C. elegans, E. coli, other eubacteria, and several archae)

are known. Between 35%-50% of their proteins have not b een assigned a function yet [Pennisi 1997 ,

Do olittle 1998 ], and await analysis.

In the absence of structural data, analysis necessarily starts with the protein sequence. Sequence

analysis considers the basic prop erties of individual amino acids, as well as their combination.

Given a new protein sequence, the current approach to predicting its function and analyzing its

prop erties hinges on pairwise comparisons with the sequences of other proteins whose prop erties

are already known. A new sequence is analyzed by extrap olating the prop erties of its \neighb ors".

Such metho ds were applied during the last three decades with considerable success and help ed to

identify the biological function of many protein sequences, as well as to reveal many distant and

interesting relationships b etween protein families. To day, the routine pro cedure for analysis of a

10 Chapter 1. Introduction

new protein sequence almost always starts with a comparison of the sequence at hand with the

sequences in one or more of the main sequence databases. However, in many cases sequences have

diverged to the extent that their common origin is untraceable by a direct sequence comparison.

In such cases this metho d fails to provide clues ab out the functionality of the protein in question.

Consequently, many protein sequences in the protein databases still have unidenti ed functions.

1.4 Towards global organization

Over a decade ago, biologists b egan to realize that the amount of biological data accumulated in

the databases can no longer b e analyzed only by means of pairwise sequence comparison. Thus

p owerful to ols to analyze and organize this data b ecame a necessity. New metho ds for utiliz-

ing the information in these databases have b een applied, such as multiple alignment techniques

(comparisons of three or more sequences), pro le analysis (a p osition-sp eci c scoring matrix which

provides a representation of a domain or a protein family), and pattern recognition metho ds. These

techniques enhanced the ability to detect relationships b etween distantly related proteins.

Several large-scale analyses were carried out, most of which fo cused on nding common motifs,

domains and patterns of biological signi cance in protein sequences (see chapter 3 for details). The

basic approach is many of these studies is based on string analysis at the lo cal level, by applying

a semi-sup ervised pro cedure in which a pre-de ned group of related sequences serves as the basis

for building a mo del (e.g., consensus pattern, pro le, HMM). This is usually achieved by applying

multiple sequence alignment techniques. These analyses yielded excellent databases of motifs and

domains which can b e used to search for signi cant patterns in new sequences.

However, the prerequisite of having pre-de ned groups as input prevented most of these analyses

from b eing applied to all protein sequences. Moreover, these studies did not yield a mathematical

representation of protein sequences, and did not provide us with a global view of the sequence

space. Such a view can lead to the discovery of high-level features of the protein space. This is

extremely imp ortant in view of the fact that the common metho ds for protein sequence analysis

fail to assign a clear biological function to more than 40% of the sequences in the databases.

From a global p oint of view, each sequence is a p oint in the sequence space, and a high order

global organization of the protein sequence space is sought. The global p oint of view may help in

exp osing the geometry of the sequence space, and in detecting ordered structures (clusters) within

the space. Moreover, it is exp ected to add insight ab out the function of a protein from its relative

p osition in the space, and from its overall connections with its close neighb orho o d. Thus, when

currently available means of comparison fail to give satisfactory information ab out the biological

function of a protein, a global view may add the missing clues.

The task of nding high order organization entails the solution of several fundamental problems,

some of which are inherent to the analysis of protein sequences. Sp eci cally,

 How to de ne a distance measure among protein sequences? A distance measure is imp ortant

for the exploration of the geometry of the sequence space. Though a numb er of metho ds for

sequence comparison have b een develop ed, none of them pro duces a measure which is a

distance measure on one hand and sensitive to weak similarities on the other (see chapters 2,

4).

Moreover, existing similarity/distance measures cannot distinguish the far from the very

far, and for sequences that have diverged signi cantly, the numerical distance measure is

essentially meaningless. This p oses serious problems in de ning the geometry of the sequence space.

Chapter 1. Introduction 11

Apparently no e ective distance measure exists for protein sequences, by means of the se-

quence prop er. However, if one could represent protein sequences as feature vectors in real

space of some dimension, then a natural measure of distance would emerge. This leads to the

second diculty.

 What is the \correct" co ding of protein sequences? Several works studied the mathemat-

ical representation of protein sequences, based on sequence prop erties, such as amino acid

comp osition or chemical features (see section 3.3). Indeed, some information is enco ded in

the comp osition of amino acids in the sequence. However, this representation cannot capture

the essence of proteins as ordered sequences of amino acids. Statistics of longer substrings

in the sequence may enco de short term lo cal order within the sequence. But practically,

distributions of k-tuples are limited to dip eptides or trip eptides at the most, since already

for k=2 the numb er of p ossible dip eptides is 400, whereas the length of an average protein

is around 350 amino acids. Addition of chemical features may improve the representation

but still misses the order of amino acids within the sequence. Consequently, the information

enco ded in the order of amino acids within a protein sequence cannot b e fully recovered by

these representations.

 The organization is involved with problems of unsup ervised learning (how to obtain the \cor-

rect" clustering). In general, clustering (pattern recognition) is largely an unsolved problem.

Even the concept of a \correct" clustering is not well de ned.

There are many heuristics for the clustering problem (see chapter 5). The choice of the

clustering algorithm may have a great e ect on the outcome, but none of these algorithms

is guaranteed to nd the \b est" or the \correct" clustering. Moreover, structures within the

space may b e sub jected to noise and bias in the training data, which makes their detection

even more dicult.

Finally, is it not clear how to compare clusterings which are obtained by di erent approaches,

in the absence of an apparent general quality measure. Known quality measures are usually

application sp eci c, but no obvious measure exists for protein sequences. Therefore, given

a mo del, it is hard to evaluate its validity and its ability to generalize to new samples (see

chapter 5).

Starting from the novel concept of global organization, the work presented here fo cuses on

metho ds for global organization of all protein sequences. Sp eci cally, two di erent approaches were

tested: (i) a Euclidean emb edding approach, and (ii) a graph-based approach.

The Euclidean emb edding approach studies segments of protein sequences and consists of three

main steps: (i) constructing a metric space on the protein sequence space, (ii) low-distortion em-

b edding of this metric space in Euclidean space, and (iii) hierarchical cross-validated clustering.

The graph-based approach studies complete protein sequences. An exhaustive comparison of all

protein sequences in the sequence database yields a list of signi cant similarities b etween protein

sequences. Based on this list, the sequence space is represented as a directed weighted graph, and

graph algorithms are applied to identify clusters of related proteins in this space.

In the next chapters I will present these two approaches in detail. Chapter 2 is a preliminary

chapter which describ es algorithms for sequence comparison, and the statistics of sequence align-

ment. Chapter 3 is a survey of large-scale analyses of protein sequences. Chapter 4 describ es the

rst two steps of the Euclidean emb edding approach, and chapter 5 describ es the third step (cross-

validated clustering). The results of this approach when applied to protein sequences are given in

chapter 6. Chapters 7 and 8 intro duce the graph-based approach. The results of this approach

12 Chapter 1. Introduction

as well as a p erformance evaluation and few biological examples are given in chapter 9. A tour

through a web site which was constructed to present the results of this study follows in chapter 10.

Chapter 11 closes this thesis with concluding remarks and suggestions regarding the direction of

future research.

Chapter 2

Comparison of protein sequences

During the last three decades a considerable e ort has b een made to develop algorithms that com-

pare sequences of macromolecules (proteins, DNA). The purp ose of such algorithms is to detect

evolutionary, structural and functional relations among the sequences. Successful sequence com-

parison would allow to infer the biological prop erties of new sequences from data accumulated on

related genes. For example, a similarity b etween a translated nucleotide sequence and a known

protein sequence suggests a homologous co ding region in the corresp onding nucleotide sequence.

Signi cant sequence similarity among proteins may imply that the proteins share the same sec-

ondary and tertiary structure, and have close biological functions. The prediction of unknown

protein structures is often based on the study of known structures of homologous proteins.

This chapter is a survey of sequence comparison, scoring schemes, and the statistics of sequence

alignments which are essential for the purp ose of distinguishing true relations among proteins from

chance similarities. The chapter is based on b o oks by M. Waterman [1995] and by Setubal and

Meidanis [1996], as well as on various pap ers in this eld.

2.1 Alignment of sequences

As is well known, the DNA molecule serves as a blueprint for the genetic information of every

organism. Therefore, the evolution of organisms must b e related to changes in the DNA. The

simplest events of molecular evolution are the substitution of one base by another (p oint mutations)

and the insertion or deletion of a base pair. Supp ose that the sequence b is obtained from the

sequence a by substitutions, insertions and deletions. It is customary and useful to represent the

transformation by an alignment where a is written ab ove b with the common (conserved) bases

aligned appropriately. For example, say that a = AC T T GA and b is obtained by substituting the

second letter from C to G, inserting an A b etween the second and the third letters, and by deleting

the fth base (G). The corresp onding alignment will b e:

a = A C T T G A

b = A G A T T A

We usually do not actually know which sequence evolved from the other. Therefore the events are

not directional and insertion of A in b might have b een a deletion of A in a.

In a typical application we are given two related sequences and we wish to recover the evolu-

tionary events that transformed one to the other. The goal of sequence alignment is to nd the 13

14 Chapter 2. Comparison of protein sequences

correct alignment that enco des the true series of evolutionary events which have o ccurred. The

alignment can b e assigned a score which accounts for the numb er of identities (a match of two

identical letters), the numb er of substitutions (a match of two di erent letters), and the numb er

of gaps (insertions/deletions). With high scores for identities, and low scores for substitutions and

gaps, the basic strategy towards tracing the correct alignment seeks the alignment which scores

b est (see next section for details).

The algorithms describ ed b elow may b e applied to the comparison of protein sequences as well

as to DNA sequences (co ding or non-co ding regions). Though the evolutionary events o ccur at the

DNA level, the main genetic pressure is on the protein sequence. Consequently, the comparison of

protein sequences has proven to b e a much more e ective to ol [Pearson 1996 ]. Mutations at the

DNA level do not necessarily change the enco ded amino acid due to the redundancy of the genetic

co de. Mutations often result in conservative substitutions at the protein level, namely, replacement

of an amino acid by another amino acid with similar bio chemical prop erties. Such changes tend to

have only a minor e ect on the protein's functionality. Within the scop e of this work, and in view

of the last paragraph, this chapter fo cuses on the comparison of protein sequences.

2.1.1 Global similarity of protein sequences

Let a = a a :::a and b = b b :::b where a ; b 2 A, the alphab et of amino acids, b e two given

1 2 n 1 2 m i j

protein sequences . A global alignment of these sequences is an alignment where all letters of a

and b are accounted for.

Let s(a ; b ) b e the similarity of a ; b and let > 0 b e the p enalty for deleting/inserting of one

i j i j

amino acid. The score of an alignment with N matches of a and b and N insertions/deletions

ij i j g ap

is de ned as

X

N  s(a ; b ) N 

ij i j g ap

i;j

The global similarity of sequences a and b is de ned as the largest score of any alignment of

sequences a and b, i.e.

X

S (a; b) = max f N  s(a ; b ) N  g

ij i j g ap

al ig nments

i;j

Usually, pairwise scores are the logarithm of likeliho o d ratios (log-o dds), as explained in section

2.4. Therefore, the de nition of the alignment score as the sum of the pairwise scores may b e inter-

preted as likeliho o d. The corresp onding similarity score is thus essentially a maximum likeliho o d

measure.

How to nd the b est alignment ? The exp onentially large numb er of p ossible alignments makes

it imp ossible to p erform a direct search. For example, the numb er of p ossible alignments of two

600

sequences of length 1000 exceeds 10 [Waterman 1995 ]. However, a dynamic programming algo-

rithm makes it p ossible to nd the optimal alignment without checking all p ossible alignments, but

only a very small p ortion of the search space.

Calculating the global similarity score S (a; b)

Denote by S the score of the b est alignment of the substring a a ::a with the substring b b ::b ,

i;j 1 2 i 1 2 j

i.e.

S = S (a a ::a ; b b ::b )

i;j 1 2 i 1 2 j

Chapter 2. Comparison of protein sequences 15

The optimal alignment of a a ::a with the substring b b ::b can end only in one of following

1 2 i 1 2 j

three ways:

a a

i i

b b

j j

Also, each sub-alignment must b e optimal as well. Therefore, the score S (a; b) can b e calculated

recursively:

Initialize

S = 0 S = i  for i = 1::n S = j  for j = 1::m

0;0 i;0 0;j

Recursively de ne

S = maxfS + s(a ; b ); S ; S g

i;j i1;j 1 i j i;j 1 i1;j

In practice, the scores are stored in a two-dimensional array of size (n + 1)  (m + 1). The

initialization set the values at row zero and column zero. The computation pro ceeds row by row

so that the value of each matrix cell is calculated from entries which were already calculated.

Sp eci cally, we need the three matrix cells, on the west, the south and the southwest of the current

cell. The time complexity of this algorithm is (n  m).

Dynamic programming algorithms were already intro duced in the late 50's [Bellman 1957 ].

However, the rst to prop ose a dynamic programming algorithm for global comparison of macro-

molecules, were Needleman and Wunsch [Needleman & Wunsch 1970 ].

2.1.2 Penalties for gaps

In sequence evolution, an insertion or deletion of a segment (several adjacent amino acids) usually

o ccurs as a single event. That is, the op ening of the gap is the signi cant event. Therefore, most

computational mo dels assign a p enalty for a gap of length k that is smaller than the sum of k

indep endent gaps of length 1. If the p enalty for gap of length k is (k ), we are thus interested in

sub-additive functions and assume that (x + y )  (x) + (y ).

Denote by N the numb er of gaps of length k in a given alignment. Then the score of this

k g ap

alignment is de ned in this case as

X X

N  (k ) N  s(a ; b )

ij i j

k g ap

i;j

k

This discussion allows the alignment to end in a gap of arbitrary length, and therefore S is de ned

i;j

as follow:

Set

S = 0 S = (i) S = (j )

0;0 i;0 0;j

then

S = maxfS + s(a ; b ); max fS (k )g; max fS (l )gg

i;j i1;j 1 i j

1k j i;j k 1l i il ;j

In practice, to compute each matrix cell, we need now to check the cell in the diagonal, all the

cells in the same row and all the cells in the same column. For arbitrary function (k ), the time

P

2 2

(i + j + 1) = (n  m + n  m + n  m) plus n + m calculations of the function complexity is

i;j

3

(k ). For n = m we get (n ).

16 Chapter 2. Comparison of protein sequences

Optimization

A b etter time complexity can b e obtained for a linear gap function (k ) = +  (k 1) where is the p enalty

0 1 0

for op ening a gap, and is the p enalty for extending the gap.

1

Set E = max fS (k )g. This is the maximum value over all the matrix cells in the same row, where

i;j

1k j i;j k

a gap is tested with resp ect to each one of them. Note that

E = maxfS (1); max fS (k )gg

i;j i;j 1

2k j i;j k

= maxfS ; max fS (k + 1)gg

i;j 1 0

1k j 1

i;j (k +1)

= maxfS ; max fS (k )g + g

i;j 1 0 1

1k j 1

i;(j 1)k

= maxfS ; E g

i;j 1 0 i;j 1 1

Therefore, only two matrix cells from the same row need to b e checked. These cells corresp ond to the two options

we need to check, i.e. op ening a new gap at this p osition ( rst term) or extending a previously op ened gap (second

term). The sub-linearity of implies that it is never b ene cial to concatenate gaps.

Similarly, for a column de ne F = max fS (l )g and in the same way obtain

i;j

1lj il;j

F = maxfS ;F g

i;j i1;j 0 i1;j 1

Initialize

E = F = S = 0

0;0 0;0 0;0

E = S = (i) F = S = (j )

i;0 i;0 0;j 0;j

then

S = maxfS + s(a ; b ); E ; F g

i;j i1;j 1 i j i;j i;j

and the time complexity is (m  n)

For arbitrary convex gap functions it is p ossible to obtain an O (n  m) algorithm [Waterman 1995 ]. However,

such functions seems to b e inappropriate for the context of comparison of macromolecules. For arbitrary concave

2

function (and in particular, for sub-additive functions) it is p ossible to obtain time complexity of O (n  log (n)) with

a more complex algorithm [Waterman 1995 ].

Gonnet et al. [Gonnet et al. 1992 ] have prop osed a mo del for gaps that is based on gaps o ccurring in pairwise

alignments of related proteins. The mo del suggests an exp onentially decreasing gap p enalty function (see section

2.4.4). However, a linear p enalty function has the advantage of b etter time complexity, and in most cases the results

are satisfactory. Therefore the use of linear gap functions is very common.

2.1.3 Global distance alignment

In some cases it is interesting to de ne a distance among sequences D (a; b) - for example for the

construction of evolutionary trees, or when investigating the geometry of the sequence space. The

advantage of de ning a distance among sequences, is the construction of a metric space on the

space of sequences.

In general a metric is a function D : X  X ! R which is:

 Nonnegative: D (a; b)  0 for all a; b 2 X with equality if and only if a = b

 Symmetric: D (a; b) = D (b; a)

 Satis es the triangle inequality: D (a; b)  D (a; c) + D (c; b) 8a; b; c 2 X

Chapter 2. Comparison of protein sequences 17

In contrast to similarity, where we are interested in the b est alignment and its score gives a

measure of how much the strings are alike, distances approach assigns a cost to elementary edit

op erations (evolutionary events) and seeks a series of op erations that transforms one string to

another, with the minimal cost.

The de nition of distance resembles the de nition of similarity, except that s(a ; b ) is replaced

i j

with d(a ; b ) which re ect the distance b etween amino acids a and b , and (k ), the gap p enalty

i j i j

now adds to the total distance (instead of decreasing the similarity). The minimum distance is

obtained by minimizing the sum of matches/mismatches costs and the p enalties for gaps.

X X

N  (k )g N  d(a ; b ) + D (a; b) = min f

ij i j

k g ap al ig nments

i;j

k

One restriction is that d(; ) is a metric on A, the alphab et of amino acids (and so satis es the

three requirements ab ove). Also (k ) has to b e p ositive. These restrictions imply that the total

global distance D (; ) is zero only if the two sequences are identical.

In some cases, distance and similarity measures are related by a simple formula: Let s(a ; b )

i j

b e a similarity measure over A and (k ) the p enalty for gap of length k . Let d(a ; b ) a metric on

i j

A and ^ (k ) a corresp onding cost for gaps of length k . If there is a constant c such that

d(a ; b ) = c s(a ; b ) 8a ; b 2 A

i j i j i j

and

k c

^ (k ) = (k ) +

2

then each alignment A satis es:

l

c(n + m)

D (A ) = S (A )

l l

2

where n (m) is the length of the sequence a (b). In particular,

D (a; b) = min D (A )

al ig nments A l

l

c(n + m)

= max S (A )

al ig nments A l

l

2

c(n + m)

S (a; b) =

2

i.e. an alignment is similarity optimal if and only if it is distance optimal (the pro of of this claim

P P

N ). N + is based on the observation that n + m = 2 

ij

k g ap

k i;j

Though the formula suggests a simple transformation from a similarity measure to a distance

measure, it should b e noted that the transformation from s(a ; b ) to d(a ; b ) do es not yield a metric

i j i j

when applied to the common scoring schemes, since the value s(a ; a ) varies among di erent amino

i i

acids (see section 2.4). Consequently, there is no c such that d(a ; a ) = c s(a ; a ) = 0 for all

i i i i

a 2 A, as needed. i

18 Chapter 2. Comparison of protein sequences

2.1.4 Position dep endent scores

In many proteins, mutations are not equally probable along the sequence. Some regions are func-

tionally/structurally imp ortant and consequently, the e ect of mutation in these regions can b e

drastic. They may create a nonfunctional protein or even prevent the molecule from folding into

its native structure. Such mutations are unlikely to survive, and therefore these regions tend to b e

more evolutionary conserved than other, less constrained regions (e.g. lo ops) which can signi cantly

diverge.

Accordingly, it may b e appropriate to use p osition-dep endent scores for mismatches and gaps.

The incorp oration of information ab out structural preferences can lead to alignments that are more

accurate biologically. If a protein's structure is known, the secondary structure should b e taken into

account. In the absence of such data, general structural criteria, such as the prop ensities of amino

acid for o ccurring in secondary structures versus lo ops can b e taken into account. For example, the

probability of op ening a gap in existing secondary structure can b e decreased, while the probability

for op ening/inserting a gap in lo op regions can b e increased.

The dynamic programming algorithm can b e adapted to account for p osition-sp eci c scores as

follow: Let s (a; b) b e the score for aligning a and b in the i-th p osition of a = a a :::a and j -th

i;j 1 2 n

p osition of b = b b :::b , resp ectively, and let ( ) b e the p enalty for op ening a gap at p osition

1 2 m i j

i (j ) and ( ) for extending it. Recursion is used as b efore to de ne the similarity of a and b:

i j

E = maxfS ; E  g

i;j i;j 1 j i;j 1 j

F = maxfS ; F g

i;j i1;j i i1;j i

and

S = maxfS + s (a ; b ); E ; F g

i;j i1;j 1 i;j i j i;j i;j

Usually p osition-sp eci c scoring matrices, or pro les, are not tailored to a sp eci c sequence. Rather, they

are built to utilize the information in a group of related sequences, and provide representations of protein families

and domains. These representations are capable of detecting subtle similarities b etween distantly related proteins.

Without going into detail, pro les are usually obtained by applying algorithms for multiple alignment (i.e., a combined

alignment of several proteins) to align a group of related sequences. The frequency of each amino acid at each p osition

along the multiple alignment is then calculated. These counts are normalized and transformed to probabilities, so

that a probability distribution over amino acids is asso ciated with each p osition. Finally, the scoring matrix is

de ned based on these probability distributions as well as on the similarities of pairs of amino acids (taken from a

standard scoring matrix). For example, the score for aligning the amino acid a at p osition i of the pro le is given by

P

pr ob(b at p osition i)s(a; b) (a few minor mo di cations are needed to formulate the ab ove algorithm for s (a) =

i

b2A

comparison of a sequence with a pro le). For a review of algorithms for multiple alignment and pro le techniques

see [Waterman 1995 , Setubal & Meidanis 1996 , Gribskov & Veretnik 1996 , Taylor 1996 ].

Chapter 2. Comparison of protein sequences 19

2.1.5 Lo cal alignments

In many cases the similarity of two sequences is limited to a sp eci c motif or domain, the detection of

which may yield valuable structural and functional insights, while outside of this motif/domain the

sequences may b e essentially unrelated. In such cases global alignment may not b e the appropriate

to ol. In the search for an optimal global alignment, lo cal similarities may b e masked by long

unrelated regions. Consequently, the score of such an alignment can b e as low as for totally

unrelated sequences. Moreover, the algorithm may even misalign the common region. A minor

mo di cation of the previous algorithm solves this problem.

First we de ne a lo cal alignment of a and b as an alignment b etween a substring of a and a

substring of b. The lo cal similarity of sequences a and b is de ned as the maximal score over all

p ossible lo cal alignments.

Calculating the lo cal similarity score

De ne H to b e the maximum similarity of two segments ending at a and b

i;j i j

H = maxf0; S (a a :::a ; b b :::b ) : 1  x  i; 1  y  j g

i;j x x+1 i y y +1 j

This quantity is calculated recursively as b efore. Initialize

H = H = 0 1  i  n; 1  j  m

i;0 0;j

then

H = maxf0; H + s(a ; b ); max fH (k )g; max fH (l )gg

i;j i1;j 1 i j

1k i ik ;j 1l j i;j l

and the lo cal similarity of sequences a and b is obtained by maximizing over all p ossible segments

H (a; b) = maxfH 1  k  n; 1  l  mg

k ;l

For linear gap functions (that improve computation time) the formulation is done as b efore, with

initialization H = E = F = 0 for i  j = 0.

i;j i;j i;j

In the literature, this algorithm is often called the Smith-Waterman (SW) algorithm, after those

who intro duced this mo di cation [Smith & Waterman 1981 ].

It should b e noted that whereas global similarity measures can sometime b e transformed into

global distance measures (as in section 2.1.3), no such transformation is known for lo cal similarity

measure. A case which seems to rule out the p ossibility of de ning a lo cal distance measure is

depicted in Fig. 2.1. However, it is p ossible to de ne pseudo-metrics on protein sequences based

on lo cal similarity measures. For a brief discussion see chapter 4.

2.2 Other algorithms for sequence comparison

During the last two decades the sequencing techniques have greatly improved. Many large scale

sequencing pro jects of various organisms are carried throughout the world, and as a result the

20 Chapter 2. Comparison of protein sequences

(a) (b)

a a

b b c

Figure 2.1: Lo cal similarities make it dicult to de ne a distance b etween protein sequences. Two

cases in mind are: (a) If b is a subsequence of a, the sequences are obviously related. However, a distance measure

should account for those parts of the sequence a which are not matched with b. Consequently, the distance D (a; b)

may b e as high as for totally unrelated sequences. (b) Multi domain proteins. If b and c are unrelated sequences

(i.e., D (b; c) >> 1), then assigning a low distance for D (a; b) and D (a; c) will violate the triangle inequality.

numb er of new sequences which are stored in the databases is rapidly increasing. In a typical

application new protein sequence is compared with all sequences in the database, in search of

related proteins.

The dynamic programming algorithms describ ed ab ove that seek optimal pairwise alignment

of sequences (with insertions/deletions and substitutions) may not b e suitable for this purp ose.

The complexity of this algorithm is quadratic (with linear gap function), and the comparison of

a sequence, of average length of 350 amino acids, against a typical database (like SWISSPROT

[Bairo ch & Bo eckman 1992 ], with more than 70,000 sequences), may take few CPU hours on a

standard PC of nowadays (p entium-I I 400 MHz).

Several algorithms have b een develop ed to sp eed up the search. The two main algorithms

are FASTA [Pearson & Lipman 1988 ] and BLAST [Altschul et al. 1990 ]. These are heuristic al-

gorithms which are not guaranteed to nd the optimal alignment. However, they proved to b e

very e ective for sequence comparison, and they are signi cantly faster than the rigorous dynamic

programming algorithm.

In the last few years, biotechnology companies such as Compugen and Paracel, have develop ed

sp ecial purp ose hardware that accelerates the dynamic programming algorithm [Compugen 1998 ].

This sp ecial-purp ose hardware has again made the dynamic programming algorithm comp etitive

with FASTA and BLAST, b oth in sp eed and in simplicity of use. However, meanwhile, FASTA and

BLAST have b ecome standard in this eld and are b eing used extensively by biologists all over the

world. Both algorithms are fast, e ective, and do not require the purchase of additional hardware.

BLAST has an additional advantage, as it may reveal similarities which are missed by the dynamic

programming algorithm, for example when two similar regions are separated by a long dissimilar

region.

2.2.1 BLAST (Basic Lo cal Alignment Search To ol)

BLAST compares two sequences and seeks all pairs of similar segments, whose similarity score

exceeds a certain threshold. These pairs of segments are called \high scoring segment pairs"

(HSPs). A segment is always a contiguous subsequence of one of the two sequences. Segment pair

is a pair of segments of the same length, one from each sequence. Hence the alignment of the

segments is without gaps. The score of the match is simply the sum of matches of the amino acids

Chapter 2. Comparison of protein sequences 21

(de ned by a scoring matrix) along the segment pair. The segment pair with the highest score is

1

called the \maximum segment pair" (MSP) .

The algorithm is an outgrowth of the statistical theory for lo cal alignments without gaps (see

section 2.3). This theory gives a framework for assessing the probability that a given similarity

b etween two protein sequences (i.e. the MSP) could have emerged by chance. If the probability

is very low, then the similarity is statistically signi cant and the algorithm rep orts the similarity

along with its statistical signi cance.

It should b e kept in mind that BLAST's similarity score is only an approximate measure for the

similarity of the two sequences, since gaps are ruled out. The algorithm may indeed miss complex

similarities which include gaps. However, the statistical theory of alignments without gaps provided

a reliable and ecient way of distinguishing true homologies from chance similarities, thus making

this algorithm an imp ortant to ol for molecular biologists.

Implementing BLAST

The algorithm lo cates \seeds" of similarity among the query sequence and the database sequence,

and then extends them. The algorithm op erates in three main steps:

 Compile a list of words of length w (usually w = 3 or 4 for protein sequences) that score at

least T with some substring of length w of the query sequence.

 Scan database sequences in search for hits with words in the list from the rst step. Each

hit is a seed. To allow a fast scanning of the database two approaches are used to store the

list. The rst is a hash table, and the second employs a deterministic nite automaton. Both

metho ds scan each library sequence only once.

 Extend seeds: each seed is extended in b oth directions, without gaps, until the maximum

p ossible score for the extension is reached. The resulting HSP is record. If the score of

the extension falls b elow a certain threshold then the pro cess stops. Therefore, there is a

chance (usually small, dep ending on the threshold) for the algorithm to miss a p ossible go o d

extension.

 Attempt to combine multiple MSP regions. For each consistent combination, calculate the

probability of this combination using the Poisson or sum statistics [Altschul et al. 1994 ] and

rep ort the most signi cant one (lowest probability).

In the latest version of BLAST the criterion for extending seeds has b een mo di ed, to save

pro cessing time [Altschul et al. 1997 ]. The new version requires the existence of two non-

overlapping seeds on the same diagonal (i.e. the seeds are at the same distance apart in b oth

sequences), and within a certain distance (typically 40) of one another, b efore an extension

is invoked. To achieve comparable sensitivity, the threshold T is lowered, yielding more hits

1

It is p ossible to nd the MSP with the dynamic programming algorithm, by setting the gap p enalty to 1.

However, BLAST nds it much faster b ecause of its ecient implementation.

22 Chapter 2. Comparison of protein sequences

than previously. However, only a small fraction of these hits are extended, and the overall

sp eed increases.

Changes in the threshold T p ermit a tradeo b etween sp eed and sensitivity. A higher value

of T yields greater sp eed, but also an increased probability of missing weak similarities. Though

there are no analytic b ounds on the time complexity of this algorithm, in practice, the run time is

prop ortional to the pro duct of the lengths of the query sequence and the database size.

Current improvements of BLAST allow gapp ed alignments, by using dynamic programming

to extend a central seed in b oth directions [Altschul et al. 1997 ]. This is complemented by PSI-

BLAST, an iterative version of BLAST, with a p osition-sp eci c score matrix (see section 2.1.4)

that is generated from signi cant alignments found in round i and used in round i + 1. The latter

may b etter detect weak similarities that are missed in database searches with a simple sequence

query.

2.2.2 FASTA

FASTA is another heuristic that p erforms a fast sequence comparison. The algorithm starts by

creating a hash table of all k-tuples in the query sequence (usually, k = 1 or 2 for protein sequences,

where k=1 gives higher sensitivity). For each such k-tuple there is an index vector with all p ositions

of the k-tuple in the query sequence.

Then, when scanning a library sequence, a vector indexed with o sets (see b elow) is initialized

with zeros. Each k-tuple of the library sequence is lo oked up in the hash table. If the k-tuple

app ears in the hash-table then p er each app earance of this k-tuple in the query sequence, the o set

(the relative displacement of the k-tuple) is calculated (if the k-tuple app ears in p osition i in the

query sequence and p osition j in the library sequence then the o set is i j ), and the o set vector

is incremented at the index corresp onding to the o set value. After the library sequence has b een

scanned, the diagonal which corresp onds to the o set with the maximal numb er of o ccurrences

(highest density of identities) can serve as a seed for the dynamic programming algorithm. In

practice the algorithm pro ceeds as follow.

At a second stage, the ten regions with the highest density of identities are rescanned. Common

k-tuples which are on the same diagonal (same o set) and not very far apart are joined (the exact

parameters are set heuristically) to form a region (a gapless lo cal alignment, or HSP in BLAST

terminology). The regions are scored to account for the matches as well as the mismatches, and the

b est region is rep orted (its score is termed \initial score" or \init1"). Then, the algorithm tries to

join nearby high scoring regions, even if they are not on the same diagonal (the corresp onding score

b eing termed \initn score"). Finally, a b ounded dynamic programming is run in a band around

the b est region, to obtained the \optimized score". If the sequences are related then the optimized

score is usually much higher than the initial score.

Chapter 2. Comparison of protein sequences 23

2.3 Probability and statistics of sequence alignments

In the evolution of protein sequences, not all regions mutate at the same rate. Regions which

are essential for the structure and function of proteins, are more conserved. Therefore, signi cant

sequence similarity of two proteins may re ect a close biological function or a common evolutionary

origin.

The algorithms that were describ ed in the previous sections can b e used to identify such sim-

ilarities. However, on any two input protein sequences, even if totally unrelated, the algorithms

always nd some similarity. For unrelated sequences this similarity is essentially random. As the

length of the sequences compared increase, this random similarity may increase as well. Therefore,

in order to assess the signi cance of a similarity score it is imp ortant to know what score to exp ect

simply by chance.

Naturally, we would like to identify those similarities which are genuine, and biologically mean-

ingful. In the view of the last paragraph, the raw similarity score may not b e appropriate for this

purp ose. However, when the sequence similarity is statistically signi cant we can deduce, with high

2

con dence level, that the sequences are related . The reverse implication is not always true. We

encounter many examples of low sequence similarity despite functional and structural similarity.

Though statistically signi cant similarity is neither necessary nor sucient for a biological

relationship, it may give us a go o d indication of such relationship. When comparing a new sequence

against the database, in search of close relatives, this is extremely imp ortant, as we are interested

in rep orting only signi cant hits, and sorting the results according to statistical signi cance seems

reasonable.

To estimate the statistical signi cance of similarity scores, a statistical theory should b e devel-

op ed. A great e ort was made in the last two decades to establish such statistical theory. Currently,

there is no complete theory, though some imp ortant results were obtained. These results have very

practical implications and are very useful for estimating the statistical signi cance of similarity

scores.

The statistical signi cance of similarity scores for \real" sequences is de ned by the probability

that the same score would have b een obtained for random sequences. The statistical results concern

the similarity scores of random sequences, when the similarity scores are de ned by ungapp ed

alignments. However, these results have created a framework for assessing the statistical signi cance

of various similarity scores, including gapp ed sequence alignments, and recently, even structural

alignments [Levitt & Gerstein 1998 ].

2.3.1 Basic random mo del

In the basic random mo del, the sequences are random sequences of characters where the characters

are drawn indep endently and identically (i.i.d) from a certain distribution over the alphab et A.

2

Two exceptions are segments with unusual amino acid comp osition, and similarity that is due to convergent

evolution.

24 Chapter 2. Comparison of protein sequences

Each sequence is thus viewed as a sequence of i.i.d random variables drawn from the distribution

P over the alphab et A. In what follows, upp er case letters (A ) denote random variables and lower

i

case letters (a ; a 2 A) indicate a sp eci c value of the random variable. Upp er case b old letters

i i

denote sequences of random variables, and lower case b old letters denote sequences of amino acids.

For two random sequences A and B, the scores S (A; B) (the global similarity score) and

H (A; B) (the lo cal similarity score) are random variables. The distributions of these scores for

randomly drawn sequences di er, and the next two sections summarize the main prop erties known

ab out these distributions.

2.3.2 Statistics of global alignment

Fixed alignment - global alignment without gaps

This is a very simplistic case, but the results are straight forward. Let A = A A :::A and

1 2 n

B = B B :::B where A ; b are i.i.d random variables as de ned ab ove. For the alignment with

1 2 n i j

no gaps,

: : : A A A = A

n 2 1

B : : : B B B =

n 2 1

P

n

s(A ;B ). the score is simply de ned as S =

i i

i=1

S is the sum of i.i.d random variables and therefore for large n it is distributed approximately

2

normally with exp ectation E (S ) = n  E (s(A; B )) = n and V ar (S ) = n  V ar (s(A; B )) = n ,

where  and  are the mean and standard deviation of the scoring matrix s(a; b).

The main characteristic of this distribution is the linear growth (or decline, dep ends on the

mean of the scoring matrix) with the sequence length. Surprisingly, p erhaps, this characteristic

holds for the general case as well.

Unknown alignment

In the general case, gaps are allowed in the alignment, and the similarity score is de ned as the

maximum over all p ossible alignments,

X X

S = S (A; B) = max f N  (k )g (2.1) N  s(a ; b )

ij i j

al ig nments k g ap

i;j

k

where (k ) (the p enalty for a gap of length k ) is a general non-negative sub additive function, i.e.

(x + y )  (x) + (y ).

The normal distribution limit law no longer holds b ecause of the optimization over all p ossible

alignments. However, it is p ossible to partly characterize the (limit at large n) distribution of S

based on Kingman's theory and the Azuma-Ho e ding lemma [Waterman 1995 ]. Kingman's theory

is applied to show that the global similarity score grows linearly with the sequences length.

Theorem 1: Let A = A A :::A and B = B B :::B where A ;B are i.i.d random variables

1 2 n 1 2 n i j

drawn from the distribution P over the alphab et A. De ne S = S (A; B) as in equation 2.1, and

Chapter 2. Comparison of protein sequences 25

under the same restrictions on the function (k ), then there is a constant   E (s(A; B )) such that

S

lim = 

n!1

n

with probability 1 and in the mean.

The last result obtained for an in nitely long sequence. However, since it holds almost surely, than the average

over a large ensemble of nite sequences satis es

E (S )

! 

n

Theorem 1 de nes the exp ected global similarity score for random sequences. The statistical

signi cance of a similarity score obtained for \real" sequences, which exceeds the exp ected score

by a certain amount, is estimated by the probability that the similarity score of random sequences

would exceed the exp ected mean by the same amount. The Azuma-Ho e ding lemma gives a b ound

on the probability that such a random variable exceeds its mean by a certain amount (it provides

a concentration of measure result for a broad class of random variables which includes this case).

Theorem 2: Let A = A A :::A and B = B B :::B where A ;B are i.i.d and de ne S =

1 2 n 1 2 n i j

S (A; B) as in equation 2.1, then

2

n

2

2c

P r ob(S E (S )   n)  e

where c is a constant that dep ends only on the scoring matrix and the p enalty for a gap.

Though the linear growth of the global similarity score is an imp ortant feature, b oth results are

theoretical and have little practical use. In spite of much e ort,  has not yet b een determined.

Therefore, theorem 1 remains only a qualitative result. Moreover, the b ound obtained by the

Azuma-Ho e ding lemma is not very useful for a typical protein where n is of the order of 300.

For example, for a typical scoring matrix such as the BLOSUM 62 matrix (see section 2.4.2), and

a gap op ening p enalty of 12, the constant c in theorem 2 equals 30. If a global similarity score

exceeds the exp ected mean by 3  n (which, for a global similarity score is usually signi cant), then

the corresp onding b ound would b e P r ob(S E (S )  3  n)  0:223, which is not very signi cant

(suggesting, p erhaps, that the b ound is not tight). The variance of the global similarity score has

not b een determined either, and the b est results give only an upp er b ound.

In practice, it is p ossible to empirically approximate the distribution by shuing the sequences

and comparing the shued sequences. By rep eating this pro cedure many times it is p ossible to

estimate the mean and the variance of the distribution, and a reasonable measure of statistical

3

signi cance (e.g. by means of the z-score) can b e obtained .

3 2

Denote by S the global similarity score. Let  and  b e the mean and the variance of the distribution of scores.

S 

. This score measures how many units of standard Then, the z-score asso ciated with the score S is de ned as



deviation apart the score S is from the mean of the distribution. The larger it is, the more signi cant is the score S .

26 Chapter 2. Comparison of protein sequences

2.3.3 Statistics of lo cal alignment

Exact matching

As for global alignments, interesting results for lo cal alignment are obtained already under a very

simplistic mo del. Sp eci cally, it is interesting to study the asymptotic b ehavior of the longest

p erfect match b etween two random sequences of length n, when the alignment is given and xed.

P

2

where p is the probability of the letter a. The problem can p Let  = P r ob(M atch) =

a

a a2A

b e rephrased in terms of the length R of the longest run of heads in n coin tosses, when the

n

probability for head is .

According to Erdos and Renyi [Erdos and Renyi 1970 ] R ! log n. The intuition is that the

n

1=

m

probability of run of m heads is  , and for m << n there are approximately n p ossible runs (one

for each p ossible starting p oint). Therefore,

m

E (numb er of runs of length m) ' n  

R

n

If the longest run is unique then R should satisfy 1 = n  

n

) R = log n

n

1=

The pro of is now concluded, using Borel-Cantelli lemma. It is given in details in [Waterman 1995 ].

This result holds for exact matches, which start at the same p osition in b oth sequences. Allowing

shifts makes the problem more interesting to molecular biology. The length of the longest match

in this case is actually the score of the b est lo cal alignment S = H (A; B) given that the score for

a match is 1, it is 1 for a mismatch, and 1 for op ening a gap.

2

When shifts are allowed, there are n p ossible starting p ositions (i; j ) for the match (the numb er

of p ossible starting p ositions of a match de nes the size of the search space). Therefore, following

the intuition of the previous claim we exp ect that

2

S ! log n = 2  log n

1= 1=

and the logarithmic characteristic is preserved (for a pro of see [Waterman 1995 ]).

The same result is obtained when p ostulating that the sequences are generated by rst order

Markov chains, and even when the the two sequences are drawn from di erent distributions (this

makes more sense, biologically). However, these results hold for p erfect matches, while the original

problem allows mismatches as well. Surprisingly, the results holds even when mismatches are

allowed (see next section).

Matching with score - lo cal alignment without gaps

So far, the score for a match was 1. This section pro ceeds to the case of a general scoring scheme



s(a; b) for a; b 2 A, with the constraints (i) E (s(a; b)) < 0 and (ii) s = maxfs(a; b)g > 0. The rst

requirement implies that the average score of a random match will b e negative (otherwise, extending

a match would tend to increase the score, and this contradicts the idea of lo cal similarity). The

Chapter 2. Comparison of protein sequences 27

second condition implies that a match with a p ositive score is p ossible (otherwise a match would

always consist of a single pair of residues).

The following theorem concerns lo cal matches (as de ned in section 2.1.5) with a general scoring

scheme but without gaps (i.e., the p enalty for op ening a gap is 1). It characterizes the maximal

score of lo cal alignment S = H (A; B) of two random sequences, and the distribution of amino acids

in the maximal scoring segments.

Theorem 3: Let A = A A :::A and B = B B :::B where A ;B are i.i.d and sampled

1 2 n 1 2 n i j

from the same distribution P over an alphab et A. Assume that s(a; b) satis es E (s(A; B )) < 0



and s = maxfs(a; b) : a; b 2 Ag > 0. Let  > 0 b e the largest real ro ot of the equation

s(A;B )

E (e ) = 1. Then

S

lim = 1 (2.2)

2

ln n

n!1



and the prop ortion of letter a aligned with the letter b in the b est scoring match converges to

s(a;b)

q = p p e

a

a;b b

The direct implication of this theorem is that for two random sequences of length n and m,

ln(nm)

. the score of the b est lo cal alignment (the MSP score in BLAST jargon) is centered around



That is, the lo cal similarity score grows logarithmically with the length of the sequences, and

with the size of the search space n  m. Practically, given the distribution P (for example the overall

background distribution of amino acids in the database),  is obtained by solving the equation

X

s(a;b)

p p e = 1 (2.3)

a

b

a;b

using a metho d such as Newton's metho d.

This result in itself is still not enough to obtain a measure of statistical signi cance for lo cal

similarity scores. This can b e done only once a concentration of measure result is obtained or the

distribution of similarity scores is de ned. Indeed, one of the most imp ortant results in this eld

is the characterization of the distribution of lo cal similarity scores without gaps. This distribution

was shown to follow the extreme value distribution [Karlin & Altschul 1990 , Demb o & Karlin 1991 ,

Demb o et al. 1994b ].

Formally, as the sum of many i.i.d random variables is distributed normally, then the maxi-

mum of many i.i.d random variables is distributed as an extreme value distribution [Gumb el 1958 ].

This distribution is characterized by two parameters: the index value u and the decay constant 

(for u = 0 and  = 1, the distribution is plotted in Fig. 2.2). The distribution is not symmetric.

It is p ositive de nite and unimo dal with one p eak at u. Practically, the score of the b est lo cal

alignment (the MSP score) is the maximum of the scores of many indep endent alignments, which

explains the observed distribution. This is summarized in the next theorem, which concerns the

distribution of lo cal similarity scores for random sequences.

Theorem 4: Let S b e a random variable whose value is the lo cal similarity score for two

random sequences of length n and m, as de ned ab ove. S is distributed as an extreme value

distribution and

(xu)

P r ob(S  x)  1 exp(e ) (2.4)

28 Chapter 2. Comparison of protein sequences

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

-2 -1 0 1 2 3 4 5

Figure 2.2: Probability density function for the extreme value distribution with u = 0 and  = 1.

ln K mn

, where K is a constant that can b e estimated where  is the ro ot of equation 2.3 and u =



from rst-order statistics of the sequences (i.e., the background distribution) and the scoring ma-

trix s (K is given by a geometrically convergent series which dep ends only on the p and s

a

ab

[Karlin & Altschul 1990 ]).

Theorem 4 holds, sub ject to the restriction that the amino acid comp osition of the two sequences

that are compared are not to o dissimilar [Karlin & Altschul 1990 ]. Assuming that b oth sequences

are drawn from the background distribution, the amino acid comp osition of b oth should resemble

the background distribution. Without this restriction theorem 4 overestimates the probability of

similarity scores, and indeed, this is observed in protein sequences with unusual comp ositions (see

also chapter 8).

x x

For a large x we can use the approximation 1 exp(e )  e . Therefore, for a large x,

(xu) x u x

P r ob(S  x)  e = e e = K mne (2.5)

This result helps to calculate the probability that a given MSP score could have b een obtained by

chance. The score will b e statistically signi cant at the 1% level if S  x where x is determined

0 0

x

0

by the equation K mne = 0:01. In general, a pairwise alignment with score S has a p-value

S

of p where p = K mne . I.e., there is a probability p that this score could have happ ened by

chance.

The probability p, that a similarity score S could have b een obtained simply by chance from the

comparison of two random sequences, should b e adjusted when multiple comparisons are p erformed.

One example of this is when a sequence is compared with each of the sequences in a database of

size D . Denote by p-match a match b etween two sequences that has a p-value  p (i.e., its score

 S ). The probability of observing at least one p-match (i.e., at least one \success"), in a database

search is distributed as a Poisson distribution (the distribution is in fact binomial, but in practice,

the Poisson  = D p approximation is applied for the Binomial(D ; p) distribution for D large and

m





 = D p small). The probability of m successes is P r (m) ' e . Therefore, the probability P of

m!

obtaining get by chance at least one p-match, when searching in a database containing D sequences

Chapter 2. Comparison of protein sequences 29

is

 D p

P = P r ob(m  1) = 1 P r ob(m = 0) = 1 e = 1 e (2.6)

and for P < 0:1 this is well approximated by D p. Thus,

D p S

P ' 1 e ' D p = K D mne (2.7)

This discussion assumes that all sequences in the database (library sequences) have the same

probability of sharing a similar region with the query sequence. However, it is more appropriate

to assume that all equal-length protein segments in the database have an equal apriori probability

that they are related to the query sequence (since similarity usually follows a domain). Therefore,

if the query sequence is of length n, and the (pairwise) alignment of interest involves a library

segment of length m, and the database has a total of N amino acids, then D should b e replaced

with N =m. Hence, equation 2.7 is corrected to obtain

S

P ' K N ne (2.8)

where the term N n can b e viewed as the \e ective size" of the search space.

It is very common to use the exp ectation value (e-value) as a measure of statistical signif-

icance. It is well known that the parameter  of the Poisson distribution is also the exp ectation

value of this distribution (as well as of the original binomial distribution which is approximated by

the Poisson distribution). Therefore,

E = E (numb er of pmatches) =  = D p (2.9)

and as discussed ab ove, D should b e replaced with N =m. Hence

S

E = K N ne (2.10)

This is the exp ected numb er of distinct matches (segment pairs) that would obtain a score  S

by chance in a database search, with a database of size N (amino acids) and comp osition P (the

background distribution of amino acids). If E = 0:01, then the exp ected numb er of random hits

with a score  S is 0:01. In other words, we may exp ect a random hit once in 100 indep endent

searches. If E = 10, then we should exp ect 10 hits with a score  S by chance, in a single database

search. This means that such a hit is not signi cant. Note that E ' P for P < 0:1 (equations 2.8

and 2.10), and consequently, in many publications in this eld there is no clear distinction b etween

E and P . However, for P > 0:1 they di er, where P = 1 exp(E ) (equations 2.6 and 2.9).

Finally, by setting a value for E and solving the equation ab ove for S , it is p ossible to de ne a

threshold score, ab ove which hits are rep orted. This is the score ab ove which the numb er of hits

that are exp ected to o ccur at random is < E . Therefore, we can deduce that a match with this

score or ab ove re ects true biological relationship, but we should exp ect up to E errors p er search.

The sp eci c value of E a ects b oth the sensitivity of a search (the numb er of true relationships

detected) and its selectivity (the numb er of errors). A lower value of E would decrease the error

30 Chapter 2. Comparison of protein sequences

rate. However, it would decrease the sensitivity as well. A reasonable choice for E is b etween 0:1

and 0:001 (see chapter 8 for more details).

In general, the same two sequences may have more than one high-scoring pair of segments. This

may happ en whenever insertions/deletions should b e intro duced to align the sequences prop erly.

The combined assessment of scores from several ungapp ed alignments can b e evaluated by applying

(xu)

Poisson statistics with the parameter p = e . Given the k highest scoring HSPs, among which

the lowest HSP score is x, the Poisson distribution can b e used to calculate the probability that

at least k segments pairs, all with a score of at least x, would app ear by chance in one pairwise

comparison. This approach has the disadvantage of b eing dep endent on the lowest score among

the k highest scores. Another alternative is to calculate the sum S of the highest k scores. The

k

distribution of such sums has b een derived [Karlin & Altschul 1993 ] and the probability of a given

sum is calculated (numerically) by a double integral on the tail of the distribution. In either case,

the HSPs should rst satisfy a consistency test b efore the joint assessment is made.

Lo cal alignment with gaps

Though lo cal alignments without gaps may detect most similarities b etween related proteins, and

give a go o d estimation of the similarity of the two sequences, it is clear that gaps in lo cal alignments

are crucial in order to obtain the correct alignment, and for a more accurate measure of similarity.

However, no precise mo del has b een prop osed yet to explain gaps in alignments. Moreover, in-

tro ducing gaps in alignments greatly complicates their mathematical tractability. Rigorous results

have b een obtained only for lo cal alignments without gaps.

Recent studies suggest that the score of lo cal gapp ed alignments can b e characterized in the

same manner as the score of lo cal ungapp ed alignments: According to theorem 3 the lo cal un-

gapp ed similarity score grows logarithmically with the sequence's length and the size of the search

space. Arratia & Waterman [1994] have shown that for a range of substitution matrices and gap

p enalties, lo cal gapp ed similarity scores have the same asymptotic characteristic. Furthermore,

empirical studies [Smith et al. 1985 , Waterman & Vingron 1994 ] strongly suggest that lo cal gapp ed

similarity scores are distributed according to the extreme value distribution, though some correction

factors may apply [Altschul & Gish 1996 ].

Based on empirical observations, Pearson [1998] has derived statistical estimates for lo cal align-

ment with gaps, using the extreme value distribution for scores obtained from a database search.

A database search provides tens of thousands of scores from sequences which are unrelated to the

query sequence, and therefore are e ectively random. As discussed ab ove, these scores are thus

exp ected to follow the extreme-value distribution. This is true as long as the gap p enalties are not

to o low. Otherwise the alignments shift from lo cal to global and the extreme value distribution no

longer apply.

Since the logarithmic growth in the sequence length holds in this case, scores are corrected rst

for the exp ected e ect of sequence length. The correction is done by calculating the regression

line S = a + b  ln n for the scores obtained in a database search, after removing very high scoring

Chapter 2. Comparison of protein sequences 31

sequences (probably related sequences). The pro cess is rep eated as many as ve times. The

regression line and the average variance of the normalized scores are used to de ne the z-score:

S (a + b  ln n)

z score =

v ar

and the distribution of z-scores is approximated by the extreme value distribution

c xc

1 2

p = P r ob(z score > x) = 1 exp(e )

where c and c are constants, and the exp ectation value is de ned as b efore by E (z score > x) =

1 2

N  p where N is the numb er of sequences in the database (the numb er of tests).

This empirical approach has the advantage of internal calibration of the accuracy of the es-

timates, and has proved to b e very accurate in estimating the statistical signi cance of gapp ed

similarity scores [Pearson 1998 ] (see also chapter 8 and [Brenner et al. 1998 ]).

2.4 Scoring matrices and gap p enalties

Protein sequences are compared using scoring matrices that are not just +1 for a match and 0

for a mismatch. Since the late 60s, several di erent approaches were taken to derive reliable and

e ective scoring matrices. I'll brie y discuss some of this work in this section. For more details,

and comparison of di erent scoring metho ds see [Feng et al. 1985 , Johnson & Overington 1993 ].

The genetic co de matrix [Fitch & Margoliash 1967 ] measures amino acid similarity by the num-

b er of common nucleotide bases in their co dons. This quantity is maximized by considering the

closest matching representatives co dons. Identical amino acids share 3 bases, while non-identical

share 2 or less. Despite this nice rational which returns to the very basic evolutionary events at the

DNA level, most of the genetic pressure is on the protein sequence. Although there is some correla-

4

tion b etween the co dons of di erent amino acids and their bio chemical prop erties , this correlation

is not strong enough to detect weak homologies.

Some matrices are based directly on similarities b etween physico-chemical prop erties of amino

acids [Grantham 1974 , Miyata et al. 1979 ]. However, these matrices do not p erform very well.

The reason is that there is no single prop erty of amino acids which can account for preserving the

structure and function of a protein.

The most e ective matrices are those that are based on actual frequencies of mutations that

are observed in closely related proteins. Exchange o ccur more frequently among amino acids that

share certain prop erties. Indeed, these matrices re ect the bio chemical prop erties of the amino

acids, which in uence the probability of mutual substitution, and amino acids with similar prop-

erties have high pairwise score. Matrices which are based on sequence alignments include the

family of PAM matrices [Dayho et al. 1978 ] (and their improvement by [Jones et al. 1992 ]), the

4

[Gonnet et al. 1992 ] have shown that for low divergence, the structure of the co de in uence the distribution of

accepted p oint mutations.

32 Chapter 2. Comparison of protein sequences

BLOSUM matrices [Heniko & Heniko 1992 ], Gonnet matrix [Gonnet et al. 1992 ] and more (e.g.

[McLachlan 1971 ]).

Several matrices were extracted from secondary structure prop ensities of the amino acids

[Levin et al. 1986 , Mohana Rao 1987 ]. Other matrices, which proved to b e very e ective for protein

sequence comparison, are those that are based on structural principles [Johnson & Overington 1993 ,

Risler et al. 1988 ]. These matrices re ect the statistics of pairwise substitutions that are observed

at structurally equivalent p ositions in aligned structures of proteins from the same family. These

matrices may increase the accuracy of sequence alignment and p erform well in detecting homolo-

gous proteins. They might b e a go o d choice in aligning relatively mutable regions (sequence derived

matrices are based on conserved regions, and therefore cannot accurately mo del more mutable re-

gions). However, in such regions gaps are much more frequent and it is necessary to include gaps in

the alignments. No theoretical mo del has b een established for gaps (see 2.4.4), and when mo deled

improp erly, gaps can signi cantly reduce the p erformance and the alignment accuracy. Moreover,

it is not clear how to select the parameters for gap p enalties relative to the substitution matrix.

The two most extensively used families of scoring matrices are the PAM matrices and the

BLOSUM matrices. A detailed description of these matrices is given in the next two sections.

2.4.1 The PAM family of scoring matrices

PAM matrices were prop osed by Dayho et al in 1978 based on observations of hundreds of align-

ments of closely related proteins. The frequencies of substitution of each pair of amino acids were

extracted from alignments of proteins of small evolutionary distance, b elow 1% divergence, i.e. at

most one mutation p er 100 amino acids, on average. These frequencies, normalized to account for

the frequencies of single amino acids, resulted in the PAM-1 matrix. The PAM-1 matrix re ects

an amount of evolutionary change that yields on average one mutation p er 100 amino acids. Ac-

cordingly, it is suitable for comparison of proteins which have diverged by 1% or less. The PAM-1

matrix is then extrap olated to yield the family of PAM-k matrices. Each PAM-k matrix is ob-

tained from PAM-1 by k consecutive multiplication, and is suitable for comparison of sequences

250

which have diverged k%, or are k evolutionary units apart. For example, PAM-250 = (PAM-1)

re ects the frequencies of mutations for proteins which have diverged 250% (250 mutations p er 100

5

amino acids ). These matrices were later re ned by [Jones et al. 1992 ] based on much larger data

set. The signi cant di erences were detected for substitutions that were hardly observed in the

original data set of [Dayho et al. 1978 ].

The acronym PAM stands for Percent of Accepted Mutations (and hence the distance is in

p ercentages) or for Point Accepted Mutations (and hence the distance in numb er of mutations p er

100 amino acids).

5

Though the de nition of PAM-250 seems o dd, it still make sense, as is subsequently explained.

Chapter 2. Comparison of protein sequences 33

Computing PAM matrices (following the scheme describ ed in [Setubal & Meidanis 1996 ])

Each evolutionary distance de nes a probability transition matrix M . The scores matrix S , is then obtained from

M . The initial matrix - PAM-1 is derived from:

(1) A list of accepted mutations.

P

p = 1). (2) The probability of o ccurrence p of each amino acid (

a a

a

The list of accepted mutations is created by observing mutations that o ccur in alignments of closely related homologous

proteins. Such pairs of proteins which have diverged through only small numb er of mutations, so that the correct

alignment is obvious. The term \accepted" is used to emphasize that these mutations didn't destroy the protein's

functionality, and the ma jor prop erties of the protein were preserved. Selecting closely related proteins also minimizes

the probability of mediated mutations (a ! b ! c), which are inappropriate for the computation of PAM-1.

The probability for each amino acid can b e estimated from the relative frequencies of amino acids in a large,

heterogeneous set of proteins.

The frequencies f of mutations (a $ b) are calculated from the list of accepted mutations. The mutations are

ab

undirected events, that is, given a pair of aligned amino acids, we do not know which amino acid mutated into the

P

f , i.e. the total numb er of mutations in which a was involved. Let other. Therefore, f = f . Let f =

a

ab ab ba

b6=a

P

f = f b e the total numb er of amino acids involved in mutations (twice the numb er of mutations).

a

a

In the PAM-1 matrix the element M denotes the probability that amino acid a mutates into the amino acid

ab

b. The element M denotes the probability that the amino acid remains unchanged during the corresp onding

aa

evolutionary interval, and it is derived from the relative mutability m of the amino acid a (the tendency/probability

a

of amino acid a to change). It is de ned as

f

a

m = (2.11)

a

100  f  p

a

f

a

Intuitively, gives the relative frequency of the mutations in which amino acid a is involved. The division by p

a

f

normalize this relative frequency in the frequency of o ccurrence of the amino acid a, to avoid bias due to di erent

frequencies of the amino acids. The division by 100 normalize this quantity to corresp ond to the sp eci c evolutionary

time interval. This way the total amount of p ossible change sums into 1 mutation p er 100 amino acids, as will b e

shown so on (in terms of probability, the probability of a mutation is 0.01).

The probability that amino acid a remains unchanged is the complementary probability M = 1 m . The

aa a

probability that a mutates into b is given by

M = P (a ! b) =

ab

f

ab

P (a ! b=a changed)  P (a changed) =  m

a

f

a

These calculations are based on a simple Markov-typ e mo del of protein evolution, according to which the prob-

ability that amino acid change do es not dep end on the past, nor do es is dep end on the other amino acids in the

sequence. These two assumptions are clearly oversimpli ed.

The matrix M has the following prop erties: The probability that amino acid a mutates into any amino acid b

including itself (the probability that it remains unchanged) sum to 1, as required:

X X

f f

a

ab

 m + M = m + M = m + 1 m = 1 M =

a aa a aa a a

ab

f f

a a

b b6=a

Also, the probability that the sequence remains unchanged is

X X X X

f 1

a

p p p (1 m ) = p M = = 1 = 0:99 (2.12)

a a a a a aa

100  f  p 100

a

a a a a

Therefore, the exp ected rate of mutations is 1 p er 100 amino acids, and the PAM-1 matrix corresp onds to one unit

of evolution, as required (this explains the normalization in equation 2.11). It should b e noted that one unit of

evolution do es not necessarily equal to one unit of time, as di erent proteins from various protein families have

evolved at di erent rates.

34 Chapter 2. Comparison of protein sequences

Given the basic matrix M, it is now p ossible to get the transition probabilities for larger evolutionary distances.

P

M M . Therefore, For example, the probability that a mutates to b in two PAM units of evolution is given by

ac

cb

c

the matrix that corresp ond to two units of evolution is given by PAM-2 = M  M , and for k units of evolution PAM-k

k k

= M . For large k (on the order of a thousand) M converges to a matrix with identical rows, where each row equals

the distribution of amino acids. Therefore, for large evolutionary distances, no matter what amino acid is chosen,

the probability that it mutates into b is simply p .

b

The scoring matrix S is obtained from the transition probability matrix M . The entries are de ned by the

likeliho o d ratio of two events: a pair is a mutation versus a random o ccurrence. The probability that a changes into

b is M but there is a probability p that we encounter b in the second sequence, just by chance. The likeliho o d

ab b

M

ab

. The score S is de ned as 10 times the logarithm of this ratio. The logarithm is taken b ecause the ratio is

ab

p

b

alignment score is the sum of pairwise scores, and this way it corresp onds to the logarithm of the pro duct of the

ratios (log-likeliho o d). The factor 10 is used b ecause the similarity scores are rounded to integers to sp eed up the

calculations, and this reduces the numerical error. For distance of k evolutionary units

k

(M )

ab

k

S = 10  log

ab

10

p

b

k k

and the matrix is symmetric S = S (recall that the accepted mutation used for de ning the scores are undirected

ab ba

events).

The PAM-250 matrix

The PAM-250 matrix is one of the most extensively used matrices in this eld. This matrix corresp onds to a divergence

of 250 mutations p er 100 amino acids. Naturally one may ask whether it makes sense to compare sequences which have

diverged this much. Surprising as it may seem, when calculating the probability that a sequence remains unchanged

after 250 PAMs (applying equation 2.12 for PAM-250), the outcome is that such sequences are exp ected to share

ab out 20% of their amino acids. For reference, note that the exp ected p ercentage of identity in a random match is

P

2

p . Therefore, for a typical distribution of amino acids (in a large ensemble of protein sequences), we should 100 

a

a

exp ect less than 6% identities.

2.4.2 The BLOSUM family of scoring matrices

Unlike PAM matrices, which are extrap olated from a single matrix PAM-1, the BLOSUM series

of matrices was constructed by direct observation of sequence alignments of related proteins, at

di erent levels of sequence divergence. The matrices are based on \blo cks" - a collection of multiple

alignments of similar segments without gaps [Heniko & Heniko 1991 ], each blo ck representing a

conserved region of a protein family. These blo cks provide a list of (accepted) substitutions, and a

log-o dds scoring matrix can b e de ned:

q

ab

s = log

ab

e

ab

P P

f and where q is the observed relative frequency of the pair a $ b given by q = f =

ab ab ab ab

a b

e is the exp ected probability of the pair, estimated from the p opulation of all observed pairs as

ab

P

q =2, hence follows: the probability of the o ccurrence of amino acid a in a pair is p = q +

a aa

ab

b

the exp ected probability of the pair a $ b is p p for a = b and p p + p p = 2p p for a 6= b.

a a a a

b b b b

To reduce the bias in the amino acid pair frequencies caused by multiple counts from closely

related sequences, segments in a blo ck with at least x% identity are clustered and pairs are counted

Chapter 2. Comparison of protein sequences 35

b etween clusters, i.e., pairs are counted only b etween segments less than x% identical. When

counting pairs frequencies b etween clusters, the contributions of all segments within a cluster are

averaged, so that each cluster is weighted as a single sequence. Varying the p ercentage of identity

x within clusters results in a family of matrices BLOSUM-x, where x ranges from 30 to 100. For

example, BLOSUM-62 is based on pairs that were counted only b etween segments less than 62%

identical.

2.4.3 Information content of scoring matrix

Theorem 3 has a direct b earing on the question of how to cho ose the appropriate substitution

matrix. It states that the frequency of a letter a aligned with the letter b in the b est scoring match

of two random sequences converges to

s(a;b)

q = p p exp (2.13)

a

ab b

as the length of the compared sequences grows without b ound. These frequencies are called target

frequencies. Hence, any substitution matrix has an implicit target distribution for aligned pairs

of amino acids, which can b e easily calculated from the scores s(a; b). According to equation

2.3 these frequencies sum to 1. The implicit target frequencies of a matrix characterize the b est

scoring alignments, i.e. the alignments this matrix is optimized to nd. In other words, only if the

frequencies of aligned pairs in a match resemble the target frequencies will the corresp onding match

have a high score. Therefore, it is claimed [Karlin & Altschul 1990 , Altschul 1991 ] that a matrix is

optimal for distinguishing true distant homologies from chance similarities, if the matrix's target

frequencies corresp ond to the real frequencies of paired amino acids in the alignment of distantly

related proteins.

Equation 2.13 can b e restated as:

 

1 q

ab

s = ln (2.14)

ab

 p p

a

b

I.e. the score for an amino acid pair can b e written as the logarithm (to some base) of the pair's

target frequency divided by the pro duct of their probability of o ccurrence. This denominator is the

probability of the o ccurrence of this pair under indep endent selection. This ratio thus compares the

probability of an event under two alternative hyp otheses, and is termed likeliho o d or o dds ratio.

Therefore, each scoring matrix is implicitly a \log-o dds" matrix, even if the underlying mo del is

not based on observed substitutions. The PAM and the BLOSUM matrices are explicitly of this

form.

By equation 2.3 it follows that multiplying a substitution matrix by a constant factor is

equivalent to dividing  by but do es not alter the implicit target frequencies, nor the implicit

36 Chapter 2. Comparison of protein sequences

6

form of log-o dds matrix . Such scaling merely corresp onds to a di erent base for the logarithm

in equation 2.14. If  is chosen to b e  = ln 2, the base for the logarithm is 2, and scores can b e

viewed as bits of information.

How many bits of information are needed to deduce a reliable relationship? If the exp ected

numb er of MSPs with a score of at least S (equation 2.10) is set to E and the equation is solved

for S , then

K

S = log + log N n

2 2

E

Recall that this is the score ab ove which the numb er of hits that are exp ected to o ccur at random

is < E . If a very low value is set for E , then the match is very signi cant and probably biologically

meaningful. Therefore, this score, expressed in bits, can b e viewed as the (minimal) numb er of

bits needed to distinguish an MSP from chance (with error rate < E ). For a typical substitution

matrix K is of the order of 0.1, and for an alignment to b e considered signi cant, E should b e

0:05 or less [Altschul 1991 , Altschul & Gish 1996 ]. Therefore the dominant term is the log N n. In

2

other words, to distinguish an MSP from chance, the numb er of bits needed is roughly the numb er

of bits needed to sp ecify where the MSP starts among N n p ossible p ositions [Altschul 1991 ]. For

example, for an average protein length of 350, and the SWISSPROT database with over 20,000,000

amino acids, at least 33 bits are needed.

With this interpretation, it is p ossible to get a rough measure of which matrix is appropriate

for a search. Matrices can b e evaluated by their information content, i.e., the average information

p er p osition,

X X

q

ab

q log q s = H =

ab ab ab

2

p p

a

b

a;b a;b

which is the relative entropy of the target and background distributions. The higher the value, the

b etter the distributions are distinguished, and the shorter is the length of an alignment with the

target distribution that can b e distinguished from chance (i.e., the minimum signi cant length

is shorter).

For PAM matrices the information content decreases as the PAM distance increase. For example,

PAM-120 has an information content of 0.98 bits p er p osition. Assuming that at least 33 bits

are needed to distinguish a true relationship from chance similarity (see ab ove), the minimum

signi cant length is 34. PAM-250 has an information content of 0.36 bits p er p osition and the

minimum signi cant length is 92. This is much longer than many regions of p ossible similarity

such as domains or motifs. Therefore, short motifs can b e detected by PAM matrices only if they

have diverged a small PAM distance. For BLOSUM matrices the information content is higher

when the index of the matrix is higher (e.g., BLOSUM-100 has an information content of 1.45

bits p er p osition, while BLOSUM-45 has an information content of 0.38 bits p er p osition). It is

6

Theorem 3 and its implications for scoring matrices hold for lo cal alignments. For global alignments, multiplying

all scores by a constant has no e ect on the relative scores of di erent alignments (as for lo cal alignments). However,

the same is true when adding a constant a to the score of each aligned pair and a=2 to a gap. Such transformation

entails that no unique log-o dds interpretation of global substitutions matrices is p ossible, and probably no theorem

ab out target frequencies can b e proved [Altschul et al. 1990 ].

Chapter 2. Comparison of protein sequences 37

imp ortant to note that higher information content do es not signify a b etter p erformance in terms

of detecting distant homologies. It is the target distribution of a matrix that determines if the

matrix is optimal for a sp eci c search. Therefore, a matrix like the BLOSUM-100, which re ects

substitutions b etween closely related proteins, is not appropriate for the purp ose of detecting distant

homologies, despite its high information content. A commonly used matrix is the BLOSUM-62,

which has an information content of 0.7 bits p er p osition (the same as PAM-160). This matrix is

appropriate for comparison of mo derately diverged sequences, and it is considered to b e one of the

b est matrices for database searches due to its overall p erformance, which is sup erior to all PAM

matrices (see next section).

Cho osing the scoring matrix

When comparing two sequences, the most e ective matrix to use is the one which corresp onds to

the evolutionary distance b etween them (see previous section). However, we usually do not know

this distance. Therefore, it is recommended to use several scoring matrices which cover a range

of evolutionary distances, for example PAM-40, PAM-120 and PAM-250. In general, low PAM

matrices are well suited to nding short but strong similarities, while high PAM matrices are b est

for nding long regions of weak similarity.

Exhaustive evaluations have b een carried out to compare the p erformance of di erent scoring

matrices [Heniko & Heniko 1993 , Pearson 1995 ]. The evaluation pro cedures are p erformed as

follows: a few hundred protein families are used as a b enchmark, and the quality of a matrix is

estimated by the p ortion of these families it is able to detect in a database search. Sp eci cally, a

query sequence is chosen from each family, and a database search is p erformed, each time with a

di erent scoring matrix. All family memb ers with a score ab ove a certain threshold are considered

to have b een detected. The threshold can b e chosen, for example, as the the score at 1% error rate,

or the score ab ove which the numb er of related sequences equals the numb er of unrelated sequences

[Pearson 1995 ] (see also chapter 6). The matrix which detects the maximum numb er of memb ers

from the family is considered optimal for this family. The b est matrix may vary among di erent

families, therefore the quality is averaged over all families in the reference set.

These studies show that log-o dds matrices derived directly from alignments of highly conserved

regions of proteins (such as BLOSUM matrices or the Overington matrix, which is based on struc-

tural alignment [Johnson & Overington 1993 ]) outp erform extrap olated log-o dds matrices based

on an evolutionary mo del, such as PAM matrices. Moreover, the accuracy of alignments based on

extrap olated matrices decreases as the evolutionary distance increases. This suggests that extrap-

olation cannot accurately mo del distant relationships, and that the PAM evolutionary mo del is

inadequate. BLOSUM matrices were shown to b e more e ective in detecting homologous proteins.

Sp eci cally, BLOSUM-62 and BLOSUM-50 gave sup erior p erformance in detecting weak homolo-

gies. These matrices o er go o d overall p erformance in searching the databases. The b est hybrid of

matrices for searching in di erent evolutionary ranges is either BLOSUM 45/62/100 or BLOSUM

45/100 plus the Overington matrix.

38 Chapter 2. Comparison of protein sequences

2.4.4 Gap p enalties

There is no mathematical mo del to explain the evolution of gaps. Practical considerations (the

need for a simple mathematical mo del, time complexity) have led to the broad use of linear gap

functions. However, there is evidence that the real b ehavior is quite di erent. By observing

alignments of related proteins, Gonnet et al. [Gonnet et al. 1992 ] have empirically shown that

the probability for (op ening) a gap increase linearly with the PAM distance of the two sequences.

3

2

However, the probability of a gap of length k decreases as k , indep endent of the PAM distance

of the two sequences. This o ers a further evidence to the hyp othesis that gaps are not created

by consecutive events of insertions/deletions. Here is a p ossible simpli ed explanation for this

functional dep endency: Given an accepted gap (inserted/deleted chain), it is reasonable to assume

that the two ends of the extracted/inserted chain lie structurally close to each other, so that the

chain's insertion/deletion do es not a ect much the global 3D structure of the protein, and hence

its functionality. The probability that the two ends of a randomly coiled chain are placed spatially

3

2

for close is inversely prop ortional to the mean volume it o ccupies. This volume increases as k

3

2

random chains of length k which explains the dep endency of k .

Chapter 3

Large scale analyses of protein

sequences

To prop erly evaluate the present study, it is imp ortant to place it in the context of other large-scale

analyses of protein sequences. The purp ose of this chapter is to provide a brief survey of large-scale

studies which considered all or many of the known protein sequences. This has b een an active

research eld since the early 90's. Several di erent approaches have b een tested. These studies

are mainly divided into two categories: those fo cused on nding signi cant motifs, patterns and

domains within protein sequences, and those which apply to complete proteins. Another class of

studies which use alternative representations of protein sequences is also discussed.

3.1 Motif and domain based analyses

Most of these studies yielded databases of protein motifs and domains. Such databases have b e-

come an imp ortant to ol for the analysis of newly discovered protein sequences. Among these are

PROSITE [Bairo ch 1991 ], Blo cks [Heniko & Heniko 1991 ], PRINTS [Attwo o d & Beck 1994 ],

ProDom [Sonnhammer & Kahn 1994 ], Pfam [Sonnhammer et al. 1997 ], and Domo [Gracy & Argos 1998 ].

The manually de ned patterns in PROSITE have served as an excellent seed for several such studies.

These studies di er from each other in several asp ects. Some are based on manual or semi-

manual pro cedures (e.g. PROSITE, PRINTS), others are generated semi-automatically (Pfam)

and the rest - fully automatically (e.g. ProDom, Blo cks, Domo). Some fo cus on short motifs

(PROSITE, PRINTS, Blo cks) while others seek whole domains and try to de ne domain b oundaries

(Pfam, ProDom, Domo). Most databases also give the domain/motif structure of proteins.

3.1.1 The PROSITE dictionary [Bairo ch 1991]

The PROSITE dictionary contains information ab out functional and structural motifs, domains

and protein families. Each motif, domain or protein family is represented either as a signature

pattern (in the form of a regular expression) or a pro le matrix (see section 2.1.4) derived from 39

40 Chapter 3. Large scale analyses of protein sequences

a multiple alignment of related proteins. Some families are characterized by more than one pat-

tern. Altogether, this database contains more than 1000 patterns. The length of signature patterns

ranges from 3 to 400 amino acids. Many of the short patterns match short motifs (such as glyco-

sylation sites, sub cellular lo calization signals, attachment sites, etc.) which are common to many

di erent protein families. Consequently, the numb er of functionally or structurally de ned families

in PROSITE is smaller. For many families no consensus or signature pattern can b e de ned since

the sequences have signi cantly diverged. Some of these families are describ ed in PROSITE by

means of a pro le. The PROSITE database is linked and cross referenced to the entries in the

SWISSPROT database [Bairo ch & Bo eckman 1992 ]. As such it serves as a reference classi cation

for many other studies in this eld.

3.1.2 The BLOCKS database [Heniko & Heniko 1991]

The BLOCKS database is a collection of blo cks (short ungapp ed multiple alignments); each blo ck

represents a conserved region of a protein family. Unlike the patterns in PROSITE, the motifs in

BLOCKS are not necessarily asso ciated with a known biological function. The blo cks are generated

automatically from multiple alignments of groups of related proteins. The groups are derived from

the protein families in PROSITE, PRINTS, Pfam and DOMO. Given a set of related sequences a

two-step pro cedure is applied: (1) detection of candidate blo cks alignments (2) assembly of blo cks

into a \path".

Candidate blo cks are detected by rst identifying short motifs using the algorithm describ ed

in [Smith et al. 1990 ]. This algorithm do es not require prealigned sequences, but is limited to

short patterns with limited exibility (i.e. some constraints are imp osed on the pattern). These

motifs are then extended in b oth directions till the similarity score drops (for more details see

[Heniko & Heniko 1991 ]). The resulting blo cks are at most 60 amino acid long.

The assembly step creates an ordered set of non overlapping blo cks (a \path") that o ccur in a

\critical numb er" of sequences. Each individual blo ck is given a score based on its length, the level

of similarity, the numb er of sequences it contains, etc., and the score of a path is the sum of its

blo ck scores, weighted by the numb er of sequences in the path. The set of blo cks de ning the path

which scores b est is chosen to characterize the corresp onding group of related proteins.

3.1.3 The ProDom database[Sonnhammer & Kahn 1994]

Sonnhammer et al. analyzed the mo dular organization of all sequences in the SWISSPROT

database, and created a database of protein domains. Their analysis is fully automatic, and the

database is built in three steps: (1) a BLAST all-versus-all comparison of the sequences to identify

(p ossibly gapp ed) high scoring segments pairs (HSPs), (2) construction of homologous segment

sets by transitive closure of HSPs, when the common segments overlap ab ove a minimum overlap

parameter (the non-overlapping regions are put into di erent sets), (3) detection and breakage of

domain b oundaries and generation of multiple alignment and a consensus sequence for each domain

family (for fast homology searches).

Chapter 3. Large scale analyses of protein sequences 41

3.1.4 The Pfam database [Sonnhammer et al. 1997]

Pfam is a database of Hidden Markov Mo dels (HMMs) for protein families. For each family, the

pro cess starts from a seed alignment (either a published multiple alignment or an alignment from

other databases such as PROSITE or ProDom) of a non-redundant representative set of known

memb ers. Pfam alignments represent complete domains. The alignment is checked manually (to

verify that the conserved features are correctly aligned, and the alignment has enough information

content to distinguish chance similarities from true relationships), and a HMM is build from the

seed alignment. The HMM is then used to scan SWISSPROT, in search for all the other memb ers

of the family. If a true memb er is missed then it is added to the seed alignment, and the pro cess

is rep eated.

Finally a full alignment is constructed for the family by aligning all memb ers to the HMM.

This full alignment is checked again (manually) and if it is not correct, the alignment metho d is

mo di ed or the whole pro cess starts with a new improved seed.

3.1.5 The DOMO database [Gracy & Argos 1998]

DOMO is a classi cation of the SWISSPROT and PIR sequence databases, by applying comp o-

sitional and lo cal similarity searches, followed by multiple sequence alignments. The pro cedure

starts by detecting global similarities from the comparison of amino acid and dip eptide comp osi-

tion of each protein, and grouping the most closely related sequences into clusters. Each cluster is

represented by one sequence and the representatives are compiled into a sux tree. This tree is

self-compared to detect lo cal sequence similarities. The lo cal similarities are clustered and multi-

ply aligned (without gaps) to detect domain b oundaries and the sequences are split into domains

accordingly. Finally, multiple alignment is generated for each set of proteins which share similar

segments.

3.2 Protein based analysis

The studies in this category are applied to whole protein sequences. Most of them draw di-

rectly on pairwise comparison [Gonnet et al. 1992 , Harris et al. 1992 , Watanab e & Otsuka 1995 ,

Ko onin et al. 1996 , Barker et al. 1996 , Tatusov et al. 1997 , Krause & Vingron 1998 ]. All these

works cluster the input database, using single linkage clustering (see chapter 7). Selected stud-

ies are describ ed here in brief.

3.2.1 The study by [Gonnet et al. 1992]

Gonnet et al. p erformed an exhaustive matching of the protein sequence database. Their analysis

starts with organization of the data by indexing on a patricia tree, so that similar sequences lie near

42 Chapter 3. Large scale analyses of protein sequences

each other in the tree. Subtrees are compared to identify p otentially similar sequences, and these

sequences are aligned globally. The matching of subtrees is ab orted when the score falls b elow a

2

prede ned threshold. Therefore, their matching pro cess is involved with less than n comparisons.

At the rst phase classical matrices and gap p enalties were used for the matching. The results were

then used to construct new mutation matrices and a mo del for scoring gaps, and the matching

pro cess was rep eated, with these new parameters. Finally, the sequences are organized as a set of

evolutionary connected comp onents.

3.2.2 The study by [Harris et al. 1992]

Harris et al. p erformed a systematic classi cation of protein sequences. Their analysis starts

from pairwise comparisons using BLAST. Since BLAST may break a hit into fragments due to

the existence of gaps, their rst step is the assembly of multiple hits which represent pieces of a

domain into a continues region of lo cal similarity (unless a large gap is intro duced). At the second

stage assembled hits are group ed into equivalence classes, using the transitive closure of pairwise

similarities. Clusters are initialized by assigning each assembled hit to one group, and then merged

if they share a single region with sucient overlap.

They also tested a variation of this algorithm, in which groups are required to share at least k

overlapping regions. However, they concluded that k = 1 is the b est choice for maximum accuracy.

Multi-domain proteins are considered in this classi cation by allowing each assembled hit to b e

classi ed to more than a single class. For example, an assembled hit which contains two domains,

A and B, is classi ed to group AB, while a hit which contains domain A will b e classi ed to group

A as well as to group AB.

3.2.3 The study by [Watanab e & Otsuka 1995]

In their study, Watanab e and Otsuka use single-linkage algorithm to cluster approx. 2000 E. coli

proteins into groups. Their analysis starts with the calculation of all-vs-all pairwise similarities

using FASTA. A threshold value is set, and an adjacency matrix is created with values of 0/1. By

rep eatedly multiplying the adjacency matrix by itself they get similarity paths b etween proteins,

which re ects direct and indirect relatedness b etween proteins. The numb er of edges on the shortest

path is de ned as the distance. After each matrix multiplication the resulting matrix is transformed

to a binary matrix to avoid exp onential growth. When this matrix converges, subgroups are de ned

as sets of proteins corresp onding to identical vectors.

3.2.4 The PIR classi cation [Barker et al. 1996 ]

This study classi es all proteins in the PIR database [George et al. 1996 ] into families based on

global similarities, and into homology domains based on lo cal similarities. In this study sequences

are classi ed into families and sup erfamilies based on similar overall architecture (same domains

in the same order if the sequences b elong to the same family; more exibility is allowed if the

Chapter 3. Large scale analyses of protein sequences 43

sequences b elong to di erent families within the same sup erfamily). Homology domains are de ned

using multiple alignments of homologous segments (identi ed based on lo cal similarities). Both

classi cations dep end on semi-automatic pro cedures and careful manual insp ection.

3.2.5 The COGS database [Tatusov et al. 1997]

Comparison of proteins from seven complete genomes ( ve phylogenetic lineages), and a single-

linkage clustering algorithm resulted in 720 clusters of orthologous groups (COGs). Each COG

consists of orthologous proteins (genes in di erent sp ecies that evolved from a common ancestral

gene) or orthologous sets of paralogs (genes from the same genome, which are related by duplication)

from at least three sp ecies, which typically have the same function. The COGs are created by

starting from triangles of orthologous proteins from di erent sp ecies (the minimal COG), and then

merging triangles which share a side. Consequently, the nal COGs may contain paralogs as well.

An additional step is carried out to split COGs which were incorrectly merged due to the existence

of multi-domain proteins. The total numb er of proteins in the COGs is 37% of the total numb er

of genes in the seven complete genomes. Finally, COGs are merged to form sup erfamilies, using

PSI-BLAST [Altschul et al. 1997 ] with the criteria that at least two proteins from the rst COG

hit memb ers of the second COG.

3.2.6 The SYSTERS database [Krause & Vingron 1998]

Krause and Vingron apply an iterative metho d for database searching to cluster proteins in the

SWISSPROT database and in the PIR database [George et al. 1996 ]. For each seed protein, all

30

blastp hits in a database search with p-value of 10 at the most are retained and the lowest

scoring sequence is used as a query for the next search to explore the sequence space b elow the

cut-o for sequences p ossibly related to the seed. The pro cess rep eats until no new sequences ab ove

the cut-o are found, or if the search has no sequence in common with the set of accepted hits from

the rst search. The resulting clusters corresp ond to groups of sequences that do not share domains

with other families. If a group of proteins share a domain with other families (like homeob ox) then

each family is mapp ed to a di erent cluster.

They validate their clustering pro cedure by applying an internal consistency test. For each

cluster, all sequences are tested as a seed for the ab ove pro cedure. If all sequences in a cluster

pro duce the same cluster, then the cluster is considered consistent. Those clusters which are not

consistent, are sorted into two typ es: 1) separate maximal clusters (clusters containing smaller

clusters, which are not a subset of a bigger cluster and which do not overlap with other clusters),

2) overlapping maximal clusters (these clusters ambiguously assign sequences).

44 Chapter 3. Large scale analyses of protein sequences

3.3 Alternative representations of protein sequences

Several studies employed alternative representations of protein sequences to the analysis of protein

sequences and families. Some representations were based on dip eptide comp osition [van Heel et al. 1991 ,

Ferran et al. 1994 ] or combination of comp ositional prop erties and other physical/chemical prop-

erties [Hob ohm & Sander 1995 ]. In some cases these representations induced measures of similar-

ity/dissimilarity b etween complete protein sequences, that were used to classify the sequences into a

xed numb er of clusters [van Heel et al. 1991 ], or to search for close relatives [Hob ohm & Sander 1995 ].

In other cases, neural nets were applied to cluster the sequences based on the new representation

[Wu et al. 1992 , Ferran et al. 1994 ]. Other studies utilized information from multiple alignments

to derive a new representation and study signi cant patterns in protein families [Han & Baker 1995 ,

Casari et al. 1995 ].

3.3.1 The study by [van Heel et al. 1991]

In his work from 1991, M. van Heel represents sequences by their content of dip eptides. Using this

representation, he compared sequences by means of the correlation of their representations. This

way sequences can b e compared even if they are unrelated. This is an advantage of this approach,

since sequence alignment do es not pro duce a meaningful measure of similarity/dissimilarity in such

cases. An eigenvector pro jection of the corresp onding 400 dimensional sequence space was followed

by a clustering of 10,000 sequences into 600 families using a variant of the K-means clustering

pro cedure with the squared error criterion function (see chapter 5).

3.3.2 The study by [Ferran et al. 1994 ]

Ferran et al. applied two layer (input, output) neural-network to cluster protein sequences into

families. They trained the neural-net with the (top ographic) Kohonen's unsup ervised learning

algorithm (inspired by the K-means algorithm), using as inputs the dip eptide comp osition matrix

(20x20) of each sequence. The neural-net was trained to classify the sequences into 225 classes

(15x15 output neurons), where the synaptic connections b etween each output neuron and the

input neurons can b e viewed as the centroid comp osition matrix of the corresp onding cluster.

Another application of neural networks to classify proteins app eared in the study by [Wu et al. 1992 ].

Based on k-tuple enco ding of proteins, proteins were classi ed into pre-de ned 620 families with

90% accuracy.

3.3.3 The study by [Han & Baker 1995]

Han and Baker [1995] studied frequent patterns in multiple alignments. In their study, multiple

alignments are represented as sequences of frequency distributions (pro les). The pro les are

splited to individual columns (20-dimensional vectors). Then, the K-means algorithm is applied to

Chapter 3. Large scale analyses of protein sequences 45

cluster these vectors, using the variational distance (l norm) b etween vectors. They applied their

1

metho d to 20,000 columns derived from multiple alignments of 154 protein families with known 3D

structure. Each one of the resulting clusters is represented by its most frequent amino acids, and the

observed patterns are mostly of hydrophobic or charged amino acids. They also generalaized their

metho d to segments, 3-15 long, of contiguous p ositions along multiple alignments. The distance

b etween segments is de ned as the sum of distances b etween the individual columns.

3.3.4 The study by [Casari et al. 1995]

Another study of multiple alignments is describ ed in [Casari et al. 1995 ]. Their goal was to predict

functional residues in proteins. The input for their analysis is multiple sequence alignments. Given

a multiple alignment of length L, each participating sequence is represented as a vector in 20L-

dimensional sequence space, with residue p ositions and residue typ es as the basic dimensions. Then

principal comp onent analysis (PCA) is applied to de ne the most p opulated regions in the sequence

space. These regions are exp ected to corresp ond to protein families and subfamilies. The principal

comp onents p oints out individual residues and p ositions that are characteristic of the di erent

subfamilies. The direction of the main principal comp onent corresp ond to the consensus pattern

of the entire family. Preferred directions in this space reveal which residue typ e at which p ositions

b est distinguish a subfamily.

Another study which used multiple alignments to induce a pairwise distance measure b etween

1

sequences is describ ed in [Agra otis 1997 ]. Sammon's non-linear mapping algorithm is then ap-

plied to the pairwise distances in order to obtain a representation of the corresp onding family that

captures the essential features of the distance matrix.

3.4 Structural classi cations of proteins

Sequence based analyses di er markedly from structure based analyses. The essential di erence

b etween the representation of a protein as a sequence of amino acids, and its representation as a

3D structure dictates di erent metho dologies, di erent similarity/distance measures and di erent

comparison algorithms. Clearly, classi cation of protein structures based on structural similarities

is extremely imp ortant. For example, for the understanding of the relation b etween structure and

function, and for more accurate predictions on the functional role of proteins. Structure based

analyses are not discussed in this thesis. For studies which are fo cused on classi cation of known

protein structures into structural classes see [Orengo 1994 , Murzin et al. 1995 , Orengo et al. 1997 ].

It should b e noted that structure comparison algorithms are mostly heuristics, and are usually

very computationally intensive. Moreover, an exp ert knowledge is usually needed to obtain accurate

1

This metho d pro jects the sequence space onto an two-dimensional Euclidean space, so that the Euclidean distances

approximate the original distances (for details on this metho d, also called multi-dimensional scaling, see chapter 4)

46 Chapter 3. Large scale analyses of protein sequences

classi cations. Furthermore, for most known proteins only the sequence is available (currently, only

several thousand structures have b een determined, while the numb er of known sequences is over

100,000). In this view, and considering the diculty of determining the 3D structure of a protein,

sequence based studies play a ma jor role in genome analysis.

3.5 Summary

In view of the many studies that are describ ed in this chapter, one may ask which study pro duces

the \b est" or the \true" classi cation, or provides the b est means by which protein sequences should

b e analyzed. This is a dicult question, which probably do es not have a single answer, since no

`true classi cation" clearly exists. One may suggest an external criterion and a reference set, to

test which classi cation is the b est according to this criterion and with resp ect to that reference

set. However, this approach has its own pitfalls (see section 9.2 in chapter 9).

The aim of the work presented here is to obtain a global view of the sequence space. We b elieve

that such a view can help to reveal higher order structures within the sequence space. Though

few attempts were made to provide means by which a bird's eye view can b e achieved (see section

3.3), they were either limited to small sets of homologous sequences, or limited by the sensitivity of

the representation. Our work applies di erent approaches and new algorithms for this task. The

results are compared with the generally accepted classi cations describ ed in sections 3.1 and 3.2.

Chapter 4

Part I - The Euclidean emb edding

approach

This chapter fo cuses on techniques for emb edding general metric spaces into low-dimensional Eu-

clidean spaces, and the application of these techniques to protein segments.

The basic concept is simple: By emb edding a set of protein sequences into Euclidean space,

each protein is mapp ed to a sp eci c p oint in Euclidean space, where similar proteins are mapp ed

to close p oints. Thus, distinct families of proteins are mapp ed to separate clusters, allowing us

to predict the biological function of a protein by observing the other memb ers in its cluster. This

metho dology thus converts the problem of nding a protein's function to a problem of pattern

recognition.

This approach must address three problems: (1) Constructing a metric space on the protein

sequence space, (2) Finding the optimal emb edding, and (3) Discovering statistical regularities in

Euclidean space (clustering). The rst two stages are describ ed in this chapter.

4.1 Constructing a metric space on the protein sequence space

In order to analyze the structure of the protein sequence space, a distance function, that measures

the dissimilarity b etween protein sequences, is essential. Such a function should satis es the three

requirements as in section 2.1.3.

The straightforward approach is to use the dynamic programming algorithm to obtain a global

distance measure (see section 2.1.3). However, a global distance measure is not sensitive enough

to re ect lo cal similarities among sequences (section 2.1.5). A global alignment is likely to miss

signi cant lo cal similarities, and when the two sequences share only a short segment, the resulting

alignment may b e of p o or quality. Moreover, global measures are strongly a ected by the length

di erences among sequences.

On the other hand, the lo cal similarity measures are not suitable for de ning distance measures,

as are (see section 2.1.5). Moreover, they may fail to detect complex similarities. In principle, lo cal 47

48 Chapter 4. The Euclidean embedding approach

similarities, as those observed in multi-domain proteins, make the problem of de ning distances

among protein sequences ill-p osed (see Fig. 2.1 a and b).

In this view it seems hard or even imp ossible to de ne a reliable and sensitive distance function

for complete proteins. However, it is p ossible to de ne a distance function for protein segments of

xed length (details b elow). Therefore, we chose segments of 50 amino acids, and not complete

proteins, as our basic building blo cks. In a preliminary phase, all protein sequences in SWISSPROT

database release 30 [Bairo ch & Bo eckman 1992 ] with fewer than 50 amino acids are eliminated.

Each of the 38,106 remaining longer sequences is divided into segments of 50 amino acids with

a 50% or higher overlap among consecutive segments, yielding a total of 543,627 segments. This

pro cedure maps each protein sequence to 14 segments on the average. Since most protein motifs and

domains are few tens of amino acids long, then most of them are mapp ed to one or two segments

at the most.

To de ne the distance b etween two segments, the Smith-Waterman dynamic programming algo-

1

rithm is applied to nd the b est lo cal similarity of the two segments , and a simple transformation

is invoked to obtain a distance measure. If the similarity score of segments x and y is s(x; y ), then

their distance is de ned as:

d(x; y ) = s(x; x) + s(y ; y ) 2  s(x; y ) (4.1)

This transformation obviously de nes a non-negative, symmetric function. This function is not

guaranteed to satisfy the triangle inequality, but for all practical purp oses, the triangle inequality

2

is almost never violated . Thus, this metric turns the space of protein sequences into a nite metric

space.

It is interesting to compare the distribution of similarity scores and the distribution of their

accompanying distances. These two distributions clearly di er, as can b e seen in Fig. 4.1a and

Fig. 4.1b. As exp ected, lo cal similarity scores are distributed nicely following the extreme value

distribution (see section 2.3.3), with parameters  = 0:31 and u = 17:8. These parameters are in

p erfect agreement with the theoretical estimates: Solving equation 2.3 in section 2.3.3 for , given

the overall amino acid distribution in the SWISSPROT database release 30, and the BLOSUM-

ln K mn

62 scoring matrix, yields  = 0:31. The parameter u is estimated by the formula u =



(see theorem 4 in section 2.3.3). For m = n = 50, the BLOSUM-62 matrix (with K ' 0:1, see

3

[Altschul & Gish 1996 ]) and  = 0:31 as estimated ab ove, we obtain u = 17:8.

Interestingly, the distribution of DP distances can b e approximated by a normal distribution

with a mean of 486 and a standard deviation of 19. This distribution is explained by the normal

1

The algorithm was run with the BLOSUM-62 scoring matrix and gap p enalties of 10 for op ening a gap and 0.5

for each extension. See section 2.1.5 for algorithm description.

2 7

Based on random sampling, failures o ccur with frequency b elow 10 , and hardly a ect our results.

3

It should b e noted that K ' 0:1 only for ungapp ed alignments. For gapp ed alignments using the BLOSUM-62

matrix, the value of K is actually smaller (based on empirical estimations: see [Altschul & Gish 1996 ]). In practice,

gaps o ccur in more than 20% of the pairwise alignments that we have calculated. However, the value of u for the

observed distribution suggests that the e ect of gaps on the distribution is minor, probably b ecause the segments

compared are relatively short, and consequently the gaps are short as well.

Chapter 4. The Euclidean embedding approach 49

(a) Distribution of SW similarity scores (b) Distribution of DP distances 0.12 0.025 SW scores DP distance extreme value approximation normal approximation 0.1 0.02

0.08 0.015 0.06 0.01 0.04 relative frequency relative frequency

0.005 0.02

0 0 0 10 20 30 40 50 60 0 100 200 300 400 500 600

pairwise similarity score distance

Figure 4.1: (a) Distribution of SW similarity scores. (b) Distribution of DP distances. The distributions

were obtained based on all pairwise similarities among 5000 randomly chosen segments. Note that the distribution

(xu) (xu)

of similarity scores is approximated by the extreme value distribution   exp e  e where  = 0:31

2

(x)

1

2

p 2

where  e and u = 17:8, while the distribution of distances is approximated by the normal distribution

2

2 

 = 486 and  = 19.

distribution of self similarity scores. Note that DP distances are de ned (equation 4.1) from the

self similarity scores s(x; x) and s(y ; y ) and from the lo cal similarity score s(x; y ). Practically, self

similarity scores can b e viewed as the sum of n (n = 50) i.i.d random variables (the self similarity

of a random amino acid). Let a b e a random amino acid drawn from the background distribution

P over the alphab et A. If  = E (s(a; a)) is the exp ectation value of the self similarity of a random

2 2

amino acid, and  = E ((s(a; a) ) ) is the variance, then the self similarity scores of segments of

p

2

length 50 are distributed normally with mean  = 50   and variance  = 50   . For the

sel f sel f

background distribution of amino acids in the SWISSPROT database, and the BLOSUM-62 scoring

matrix,  = 260 and  = 12:5. Combining this distribution with the distribution of lo cal

sel f sel f

similarity scores characterized ab ove (with mean  ' 17:8 and standard deviation  = 4:7)

l ocal l ocal

preserves the normal characteristics. The extreme value distribution of lo cal similarity scores can

b e describ ed roughly as a normal distribution with a long tail; therefore the normal distribution

dominates. Following equation 4.1, the combined distribution has a mean of 2 2 =

sel f l ocal

q

2 2

2  260 2  17:8 = 484:4 and standard deviation 2( ) + 2( ) = 18:9. These numb ers are

sel f l ocal

in excellent agreement with the estimated parameters (486, 19) of the normal approximation of the

DP distances.

4.2 Why emb ed?

In search of high order organization within the protein space, our work is essentially involved

with data clustering. Clustering of statistical data calls for grouping sample p oints in well de ned

groups, so that the distances b etween p oints from the same cluster are signi cantly smaller than

distances b etween p oints from distinct clusters. Traditionally, clustering pro cedures (see chapter

5) were used to identify classes and sub classes within lab eled and unlab eled data.

50 Chapter 4. The Euclidean embedding approach

While it is p ossible to construct a metric on the protein sequence space (as shown in section 4.1)

and thus de ne pairwise distances b etween protein sequences, most clustering algorithms are not

applicable to general metric spaces. Many of the traditional classi cation metho ds require that the

data b e presented as ob jects with a list of attributes or features, and require an appropriate distance

function (see chapter 5). Usually, samples are assumed to reside in a real normed space (e.g.,

Euclidean). Moreover, the curse of dimensionality plagues most pattern recognition pro cedures with

unmanageable computation time and problems of convergence and reliability [Duda & Hart 1973 ,

Jain & Dub es 1988 ]. Thus most clustering algorithms may b e applied only when the data p oints

reside in a low dimensional space. When sample p oints are in high dimensional space, or when

distances b etween p oints do not conform with any norm (as for protein sequences), clustering

b ecomes a dicult problem.

It should b e emphasized that not all clustering algorithms are con ned to vector data. Pairwise

clustering algorithms are another class of algorithms which are applicable to proximity data (pair-

wise distances/similarities). These algorithms are discussed in chapter 7. However, the application

of pairwise clustering algorithms to the organizing of all protein sequences is far from trivial. Space

and time complexity of these algorithms usually rules out a straightforward application of these

algorithms to more than a few thousand proteins. Moreover, classical pairwise algorithms are very

sensitive to noise. Furthermore, solutions have no established prop erties and it is not clear how to

generalize the solutions to new samples. Sp eci cally, it is much more dicult to assess the validity

of the results of such clustering algorithms (see the next chapter, section 5.1.6) and usually the

validity is assessed by means of a given reference classi cation. We have tested this direction as

well, and our approach is describ ed in detail in chapters 7 to 9.

In view of the last paragraphs it follows that in exploring metric spaces, it is a great advantage

if the metric space under consideration is Euclidean. Euclidean metrics have additional advantages.

For example, if we want to select a typical representative of a p oint set, in Euclidean space the

centroid of the set is a natural choice, while otherwise there is no satisfactory general solution.

In addition, the Euclidean metric is geometrically intuitive, and concepts such as direction are of

great imp ortance and can b e very informative. Therefore, we start by emb edding the metric space

under consideration into Euclidean space.

4.3 Emb edding strategies

4.3.1 Classical approach

The classical approach to dimensionality reduction is the multi-dimensional scaling metho d (see

[Duda & Hart 1973 ]). This metho d represents the data p oints as p oints in some lower-dimensional

space in such a way that the distances b etween the p oints in the lower-dimensional space corresp ond

to the dissimilarities b etween p oints in the original space. The metho d was traditionally applied

to visualize high-dimensional data in two of three dimensions.

The metho d pro jects the given set of samples into a sub space of desired dimension and norm.

Chapter 4. The Euclidean embedding approach 51

A random pro jection (or pro jection by PCA, see fo otnote b elow) serves as the initial emb edding.

A stress function which compares proximity values with Euclidean distances of the corresp onding

p oints (usually a sum-of-squared-errors function) is used to measure the quality of the emb edding,

and a gradient descent pro cedure is applied to improve the emb edding till a lo cal minimum, i.e.

lo cal optimum of the error function, is reached.

Formally, given a set of n samples x ; x ; :::; x , denote by d the distance b etween x and

1 2 n i;j i

x . Let y b e the lower-dimensional image of x , and  the distance b etween y and y . The

j i i i;j i j

goal is to nd a con guration of image p oints y ; y ; :::; y such that the n(n 1)=2 distances 

1 2 n i;j

b etween image p oints are as close as p ossible to the corresp onding original distances. This problem

can b e formulated as a continuous optimization problem where the criterion function can b e a

sum-of-squared-error function such as:

X

1

2

P

J = ( d )

ee i;j i;j

2

d

i

i;j

i

which emphasizes the largest errors, or

X

 d

i;j i;j

2

J = ( )

f f

d

i;j

i

which emphasizes the largest fractional errors, or

2

X

1 ( d )

i;j i;j

P

J =

ef

d d

i;j i;j

i

i

which is a compromise and emphasizes the largest pro duct of error and fractional error. Since

the criterion functions involve only distances b etween p oints, these transformations are invariant

under rotations and translations of con gurations. Moreover, it should b e noted that this emb ed-

4

ding technique is non-linear, and contrary to linear emb edding techniques , it can b e applied for

proximity data as well as to high dimensional feature vector data.

The disadvantages of this approach are (i) There is no b ound on the exp ected distortion of the

emb edding, (ii) The dimension has to b e set by the user. Moreover, the deterministic algorithms

(gradient descent) which are usually applied to solve this problem, tend to get trapp ed in a lo cal

minimum which can b e far from optimal. It is p ossible to use sto chastic techniques, like simulated

annealing, to reduce the probability of b eing trapp ed in a lo cal minimum, but for a cost in compu-

tation time. Recently, Klo ck and Buhmann [Klo ck & Buhmann 1997 ] have demonstrated the use

5

of deterministic annealing to solve this optimization problem while combining the merits of b oth

approaches. Practically, such pro cedures are not e ective for more than few thousands of sample

p oints. This restriction rules out the use of these pro cedures in our case.

4

Classical linear emb edding techniques, such as principal comp onent analysis (PCA), reduce dimensionality by

rotating the axes to coincide with the eigenvectors of the sample covariance matrix, and keeping only those eigenvectors

with high eigenvalues (large variance). However, these techniques require the data to come from a real normed space.

Since the space of protein sequences is represented by proximities, linear emb edding techniques are inapplicable.

These techniques can b e applied at a later stage, after the protein space is emb edded in a real normed space.

5

See next chapter for details on the deterministic annealing approach.

52 Chapter 4. The Euclidean embedding approach

4.3.2 Current approach

The approach is based on nding a faithful low dimensional representations of the statistical

data in cases where there is a well de ned notion of distance among data p oints. The core of

this approach consists of two algorithms develop ed by N. Linial, E. London and Yu. Rabinovich

[Linial et al. 1995 ], that is based on earlier work by Bourgain [Bourgain 1985 ].

Given a nite metric space, these algorithms emb ed it in real normed spaces so as to minimize

 The dimension of the host space.

 The distortion, i.e. the extent to which the distances b etween the original p oints disagree

with the distances b etween their representatives.

This section contains a brief survey of the eld of metric emb edding as presented in [Linial et al. 1995 ].

One of these algorithms was applied to the space of protein sequences.

De nitions

+

A metric space is a pair (X ; d), where X is a nite or in nite set, and d : X  X ! R is a

distance function (a metric) over the set X .

A norm k  k asso ciates a nonnegative numb er k~x k with every p oint ~x in the d-dimensional real

d

space R , where

~

 k~x k  0 for all ~x and equality holds i ~x = 0

d

 k~x k = jj  k~xk for every ~x 2 R and every  2 R

d

 k~x + ~y k  k~x k + k~y k for every ~x; ~y 2 R

The metric asso ciated with the norm k  k is d(x; y ) = k~x ~y k.

d

The norm l applied on R is de ned by

p

d

X

p 1=p

k~xk = ( jx j )

p i

i=1

Sp eci cally, three norms of main interest are l

1

d

X

k~x k = jx j

1 i

i=1

the Euclidean norm l :

2

v

u

d

u

X

t

2

x k~x k =

2

i

i=1

Chapter 4. The Euclidean embedding approach 53

and the max or l norm:

1

k~x k = max jx j

1 i i

d d

The space R equipp ed with the norm l is denoted by l .

p

p

Low-distortion low-dimensional emb eddings

When mapping one metric space into another, the ideal mapping would b e a one which preserve

distances. An isometry is a mapping ' from a metric space (X ; d) to a metric space (Y ; ) which

preserves distances, i.e. ('(x);'(y )) = d(x; y ) for all x; y 2 X .

n

: As- Any metric space (X ; d) with n p oints x ; x ; :::; x can b e emb edded isometrically in l

1 2 n

1

so ciate with each p oint x 2 X a vector x~ of dimension n whose comp onents are the dis-

i i

tances from all the n p oints. The k -th comp onent (x~ ) is the distance from the k -th p oint

i

k

d = d(x ; x ). I.e. x~ = ( d ; d ; :::; d ). From the de nition of the max norm it follows that

i i i;1 i;2 i;n

i;k k

kx~ x~ k = max j(x~ ) (x~ ) j = max jd d j  jd d j = d . On the other hand, from

i j 1 i j i;j j;j i;j

k k k k i;k j;k

the triangle inequality it follows that kx~ x~ k = max j(x~ ) (x~ ) j = max jd d j  d .

i j 1 i j i;j

k k k k i;k j;k

Therefore, kx~ x~ k = d for all x ; x 2 X .

i j 1 i;j i j

This mapping has the disadvantage that the resulting space is of high dimension. In general,

isometries may b e to o restricted, and this ideal situation cannot b e attained. Therefore, we seek

to approximate it. Allowing the emb edding to distort the metric may add some exibility. This

leads to the following de nition.

An emb edding ' of metric space (X ; d) in a metric space (Y ; ) is said to have a distortion

 c if every two p oints x; y 2 X satisfy

1

d(x; y )  ('(x);'(y ))   d(x; y )

c

The minimal distortion with which a metric space (X ; d) can b e emb edded in l (of any dimension)

p

is denoted by C (x).

p

The next theorem states a result of imp ortant practical implications for metric emb edding. The

prop osed algorithm was applied to emb ed the space of all protein sequences into a Euclidean space.

Theorem 1: There exists a randomized polynomial-time algorithm that embeds every n-points

O (log n)

2

, for every p  1 with distortion O (log n). metric space (X ; d) into l

p

Note: log n is to base 2.

54 Chapter 4. The Euclidean embedding approach

The algorithm: Given a metric space (X ; d) with n p oints, pick random subsets as follows: For

each k < n which is a p ower of 2, pick O (log n) sets A 2 X of cardinality k (jAj = k ). Sp eci cally,

 log n sets of size 1

 log n sets of size 2

 log n sets of size 4

 log n sets of size 8

 ...

blog nc1

 log n sets of size 2

2

This gives a total of O (log n) sets. Denote them by A , A , ... A . Then, map every p oint

2

1 2

O (log n)

2

x to the vector of dimension O (log n)

d(x; A ); d(x; A ); :::; d(x; A )

2

1 2

O (log n)

i.e. one co ordinate is asso ciated with each subset selected. Its value is the distance of the p oint x

from the subset, where d(x; A) = minfd(x; y )jy 2 Ag.

2

O (log n)

6

This mapping from X to l has almost surely an O (log n) distortion (at the

p

most), for almost all choices of subsets A  X .

Pro of (Taken from the pap er [Linial et al. 1995 ]):



Let B (x; ) = fy 2 X jd(x; y )  g and (x; ) = fy 2 X jd(x; y ) < g denote the closed and op en balls of radius 

B

centered at x. Consider two p oints x 6= y 2 X :

t t

Let  = 0; and let  b e the least radius  for which b oth jB (x; )j  2 and jB (y ; )j  2 (i.e. the cardinality of

0 t

d(x;y )

t

1

^

d(x; y ), and let t b e the largest such index. Also let  . = each ball is at least 2 ). De ne  as long as  <

t t

^

t+1

4 4

By this construction B (y ;  ) and B (x;  ) are always disjoint, as the largest balls centered at x and y are at least

j i

d(x;y )

apart.

2

Note that for each subset A 2 X :



A \ (x;  ) = ; () d(x; A)   (4.2)

B

t t

and

A \ B (y ;  ) 6= ; () d(y ; A)   (4.3)

t1 t1

Therefore if b oth conditions hold, then jd(x; A) d(y ; A)j    , and we can say that A testi es of the distance

t t1

b etween x and y . What is the probability that b oth conditions hold for a random set ?



t

Assume that j (x;  )j < 2 (otherwise the argument holds for y ). Therefore the probability that a random set

B

t

6

With a prop er normalization (see pro of ).

Chapter 4. The Euclidean embedding approach 55



7 t1

n

of size ( ) misses (x;  ) is ' 0:6 (see fo otnote ). On the other hand jB (y ;  )j  2 , and therefore the

B

t t1

t+1

2

probability of the same set to intersect B (y ;  ) is ' 0:22 (this is the complement of the probability that the two

t1

1

4

sets are disjoint, which according to the previous fo otnote is  e ' 0:78). It follows that a random set of size



n

( ) has a constant probability to b oth intersect B (y ;  ) and miss (x;  ).

B

t1 t

t+1

2

l l

n

, then, with high probability, at By randomly selecting q = O (log n) sets of cardinality 2 , such that 2 

t+1

2



q

of the sets chosen will intersect B (y ;  ) and miss (x;  ). This is true for each pair x; y and the same least

B

t1 t

10

. Therefore for almost every choice of A ;:::;A :  applies to 

1 q

^ ^

t t+1

q

X

log n

jd(x; A ) d(y ; A )j   (  )

i i t t1

10

i=1

^

This step is rep eated for every l = 1; 2;:::; blog nc 1 (practically, the lowest value of l is de ned by the value of t

2

but adding more sets can just improve the mapping by decreasing the distortion). This gives a total of Q = O (log n)

sets for which:

^

Q

t+1

X X

d(x; y ) log n log n

   :  log n  (  ) = jd(x; A ) d(y ; A )j 

i i1 i i

^

t+1

10 10 40

i=1 i=1

The reverse inequality is obtained by observing that jd(x; A ) d(y ; A )j  d(x; y ) for every A : w.l.o.g. assume that

i i i

d(x; A )  d(y ; A ). Let d(y ; A ) = d(y ; z ) for some z 2 A . By de nition, d(x; A )  d(x; z ), hence

i i i i i

d(x; A ) d(y ; A )  d(x; z ) d(y ; z )  d(x; y )

i i

Therefore,

Q

X

2

jd(x; A ) d(y ; A )j  C  log n  d(x; y ):

i i

i=1

d(x;A )

i

j i = 1; 2;:::;Q), emb eds (X ; d) in Thus, the mapping which maps every p oint x to the vector (

Q

2

an O (log n)-dimensional space with the l -norm and has distortion O (log n).

1

The same statement is satis ed for every norm l where p  1, by a prop er normalization. Sp eci cally, each x

p

d(x;A )

i

is mapp ed to ( j i = 1; 2;:::;Q). The correctness follows from the monotonicity of p-th moment averages.

1=p

Q

Another theoretical result concerning e ective emb eddings is given in the next theorem. The

theorem addresses the problem: given n p oints in Euclidean space, what is the smallest k = k (n)

so that these p oints can b e mapp ed into a k-dimensional Euclidean space such that all pairwise

distances are contracted or expanded by a factor of at most 1 + .

7

The probability that a set of size a and a set of size b are disjoint is

 

n a

 

b

b

ab

(n a)(n a 1)    (n a b + 1)

n a

n

  e =

 

n(n 1)    (n b + 1) n

n

b

1

t

n

2

' 0:6. and 2 the corresp onding probability is  e For sets of size

t+1 2

56 Chapter 4. The Euclidean embedding approach

Theorem 2 - Johnson-Lindenstrauss Theorem [Johnson & Lindenstrauss 1984 ]: Any set of

log n

t

where t = O ( n points in a Euclidean space can be mapped to l ) with distortion  1 +  for

2

2



every 0 <  < 1. Such a mapping may be found in random polynomial time.

Pro of (sketch): Assume w.l.o.g. that the original space is n-dimensional, then an orthogonal

pro jection of the space on a random t-dimensional subspace, almost surely pro duces the desired

mapping: Let fu~ g b e the base of the t-dimensional subspace. If t = (log n) then the length

i

p

of the pro jected vector is strongly concentrated around t=n (see [Johnson & Lindenstrauss 1984 ,

8

Frankl & Maehara 1988 ] for a pro of ). Therefore the desired emb edding is obtained by mapping

p

n=t(< ~x; u~ each sample p oint x to y = >; < ~x; u~ >; : : : ; < ~x; u~ >).

1 2 t

Several restrictions encountered in the pro of, should b e taken into account in applying the emb ed-

ding algorithm. To obtain an emb edding with low distortion, the pro of requires that t should b e

2

at least 50  log n and n should satisfy n > t (see [Frankl & Maehara 1988 ]). In numb ers, it follows

that n should b e at least 400; 000, and the dimension of the subspace exceeds 600. Otherwise, the

distortion may b e higher than guaranteed.

From theorem 1, and the Johnson-Lindenstrauss theorem the following corollary is obtained:

Theorem 3 (Bourgain): Every n-point metric space (X ; d) can be embedded in an O (log n)-

dimensional Euclidean space with an O (log n) distortion.

By this theorem it follows that the optimal distortion C (X ) = O (log n) (Examples given in

2

[Linial et al. 1995 ] show that this b ound is tight).

Another result from [Linial et al. 1995 ] suggests an alternative approach to obtaining an emb edding

with the same order of distortion. The emb edding is carried in two steps. The rst is deterministic,

and emb eds the given set of n p oints in Euclidean space, of unknown dimension (assumed to b e

n), with distortion close to the optimal, C (X ) (the problem is rephrased as a p ositive de nite

2

programming problem so that the ellipsoid algorithm [Khachian 1979 ] can b e invoked to nd an

-approximation of C in p olynomial time). In the second step, the dimension is further reduced to

O (log n) by applying the Johnson-Lindenstrauss theorem. The time complexity of this algorithm

is p olynomial, but of high order, which makes this approach unsuitable for large data sets.

Other results stated in [Linial et al. 1995 ] concern emb eddings into l for p > 2 and 2  p  1.

p

8 n

An equivalent claim is: Let ~x 2 R b e a random unit vector. Then its rst co ordinate x , is concentrated around

1

1

. This claim can b e veri ed by writing an explicit formula for the distribution and then estimating the relevant

n

integral.

Chapter 4. The Euclidean embedding approach 57

4.4 Application of emb edding techniques for protein sequences

The application of Theorem 1 to \real world" data has a practical limitation. The diculty stems

from the need to compute co ordinates asso ciated with large A's. In practice, when applied to

the space of all protein sequences, the large sets should contain tens of thousands of sequences.

10

Calculating the distance from these sets involves more than 10 pairwise comparisons, a heavy

computational load, even with sp ecial-purp ose hardware for pairwise comparison. Therefore, a

straightforward implementation of the algorithm is not feasible.

Basically, given the matrix of all-against-all distances, one can extract the distances from the

various groups simply by scanning this matrix. However, the size of the matrix is unmanageable

when all known protein sequences are considered. One p ossible approach to resolving this problem

is to go through a preliminary phase in which each group of very similar sequences (identi ed by

either a comp ositional similarity search or a fast sequence similarity search) is represented by a

single sequence. Then the matrix is calculated only for those representative sequences. One could

also use probabilistic metho ds to obtain a reliable sampling of the sets, and thus reduce computation

time.

The approach we have taken is to discard the co ordinates asso ciated with large A's. We assume

that this can b e done without a signi cant distortion of the structure of the original space. As the

pro of of theorem 1 shows, the e ect of discarding these sets is the contraction of distances b etween

^

the images of the original p oints. Note that for each two sample p oints, the value of t de nes the

cardinality of the sets that have a high probability of satisfying conditions 4.2 and 4.3 and thus are

^

imp ortant for reconstructing the original distance during the emb edding pro cess. The smaller t is

the larger are the sets exp ected to satisfy the conditions.

^

If x and y are close then t is small and consequently, the large sets are the ma jor contributers

to the distance b etween the images of x and y . By discarding the large sets the distance b etween

the images decreases, but since the original distance b etween the p oints is small, this may b e a

9

welcome e ect when clustering is sought .

^

A p ossible problem may arise when x and y are far. However, for such p oints t tend to b e bigger,

and the small sets are imp ortant as well. Therefore, the distance shrinks as a result of discarding

the large sets, but an evidence for the original distance is kept by summing the distances from the

small sets.

When applied to the space of n = 543; 627 protein segments of length 50 in the SWISSPROT

database (see section 4.1 for the de nition of this metric space), the algorithm de nes the dimension

2 2 2

of the host space to b e blog nc = blog 543; 627c = 19 = 361. That is, the emb edding is obtained

0 1

by selecting at random 19 sets of size 2 = 1, 19 sets of size 2 , and so on, for every p ower of 2, till

18

19 sets of size 2 . Practical considerations (computation time) have limited the size of the largest

10

set accounted to 2 , and therefore the dimension of the host space was set to 19  11 = 209.

Further dimensionality reduction, by discarding even smaller sets, would increase the average

9

Though it is not p erfectly clear how the decrement a ect the original structure.

58 Chapter 4. The Euclidean embedding approach

32

30

28

26

4.2: Average distortion vs. the size of the 24 Figure

Distortion

largest set selected by the emb edding algorithm. 22

20

18 0 2 4 6 8 10 12

log(A)

distortion. This is shown in Fig. 4.2 where the average distortion of all pairwise distances b etween

5000 randomly chosen segments is plotted vs. the logarithm of size of the largest set accounted for

in the emb edding pro cess.

(a) Distribution of DP distances (b) Distribution of Euclidean distances 0.025 0.03 DP distance Euclidean distance normal approximation 0.025 0.02

0.02 0.015 0.015 0.01 0.01 relative frequency relative frequency

0.005 0.005

0 0 0 100 200 300 400 500 600 0 20 40 60 80 100 120 140 160 180 200

distance distance

Figure 4.3: (a) Distribution of dynamic programming (DP) distances. (b) Distribution of Euclidean

distances.

It is interesting to compare the distributions of distances b efore and after the emb edding

(Fig. 4.3). Surprisingly, the distribution of Euclidean distances (after emb edding) is very di erent

from the smo oth distribution of the dynamic programming (DP) distances (b efore emb edding) that

was characterized in section 4.1.

The distribution of Euclidean distances sharp en as larger sets are included in the emb edding,

i.e. as the dimension of the host space increases (see Fig. 4.4). The distribution starts as a uni-

mo dal distribution but changes into multi-mo dal distribution. The sharp ening of the distribution,

which gets more and more distinguished from the original distribution of DP distances, may seem

surprising in view of the decrement in the total distortion (Fig. 4.2) since these two trends seem to

b e contradictory. However, note that the distribution is centered more to the right (closer to the

distribution of DP distances) as the dimension increases, what explains the decrement in the total distortion.

Chapter 4. The Euclidean embedding approach 59

Distribution of Euclidean distances (dimension 19) Distribution of Euclidean distances (dimension 38) Distribution of Euclidean distances (dimension 57) 0.08 0.09 0.09 Euclidean distance Euclidean distance Euclidean distance 0.07 0.08 0.08 0.07 0.07 0.06 0.06 0.06 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 0.03 relative frequency relative frequency relative frequency 0.02 0.02 0.02

0.01 0.01 0.01

0 0 0 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 distance distance distance Distribution of Euclidean distances (dimension 76) Distribution of Euclidean distances (dimension 95) Distribution of Euclidean distances (dimension 114) 0.09 0.09 0.08 Euclidean distance Euclidean distance Euclidean distance 0.08 0.08 0.07 0.07 0.07 0.06 0.06 0.06 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 0.03 relative frequency relative frequency relative frequency 0.02 0.02 0.02

0.01 0.01 0.01

0 0 0 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 distance distance distance Distribution of Euclidean distances (dimension 133) Distribution of Euclidean distances (dimension 152) Distribution of Euclidean distances (dimension 171) 0.08 0.07 0.05 Euclidean distance Euclidean distance Euclidean distance 0.045 0.07 0.06 0.04 0.06 0.05 0.035 0.05 0.03 0.04 0.04 0.025 0.03 0.03 0.02

relative frequency relative frequency 0.02 relative frequency 0.015 0.02 0.01 0.01 0.01 0.005 0 0 0 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 distance distance distance Distribution of Euclidean distances (dimension 190) Distribution of Euclidean distances (dimension 209) 0.04 0.03 Euclidean distance Euclidean distance 0.035 0.025 0.03 0.02 0.025

0.02 0.015

0.015 0.01 relative frequency relative frequency 0.01 0.005 0.005

0 0 0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 120 140 160 180 200

distance distance

Figure 4.4: Distribution of Euclidean distances for di erent dimensions of the host space.

The di erences b etween the distributions b efore and after the emb edding pro cess are even more

evident with single sequences (Fig. 4.5). While the distributions of the original DP distances are

almost identical, the distributions of the distances b etween the images of the protein sequences

clearly di er from each other, where each distribution has a p eak centered at a di erent lo cation

and has a di erent spread.

Currently, we have no explanation for the di erences b etween the distributions (Fig. 4.3 and

Fig. 4.5). The di erences may suggest that the Euclidean distances enco de for a geometric structure

which di ers from the structure that is enco ded by the original DP pairwise distances. However,

in general this should not surprise us, since even with optimal emb edding the emb edded space

is not guaranteed to have zero distortion. Therefore, the Euclidean geometry of the emb edded

space is not exp ected to b e an exact re ection of the original structure. Actually, for some metric

spaces it is p ossible to prove that the optimal emb edding in Euclidean space has a nite non-zero

distortion. For example, take the graph-metric of the graph K . That is, 4 p oints, y ; x ; x ; x , so

3;1 1 2 3

60 Chapter 4. The Euclidean embedding approach

(a) Distribution of DP distances for selected proteins (b) Distribution of Euclidean distances for selected proteins 0.05 0.14 1433_horvu (213-262) 1433_horvu (213-262) 0.045 143f_bovin (26-75) 0.12 143f_bovin (26-75) 0.04 a1i3_rat (501-550) a1i3_rat (501-550) 120k_ricri (76-125) 120k_ricri (76-125) 0.035 5h2a_human (76-125) w lines 0.1 5h2a_human (76-125) w boxes 0.03 0.08 0.025 0.06 0.02

relative frequency 0.015 relative frequency 0.04 0.01 0.02 0.005 0 0 0 100 200 300 400 500 600 0 20 40 60 80 100 120 140 160

distance distance

Figure 4.5: Distribution of DP distances (a) and distribution of Euclidean distances (b) for 5 selected

proteins. For each protein the distribution of its distances from 5000 randomly chosen segments is shown, b oth

b efore and after the emb edding pro cess.

that d(y ; x ) = 1 and d(x ; x ) = 2 for i 6= j . It is easy to see that this metric cannot b e emb edded

i i j

isometrically in Euclidean space (regardless of the dimension). Since d(x ; y ) + d(x ; y ) = d(x ; x ),

1 2 1 2

then in any Euclidean realization of this metric, y must b e on the line b etween x and x . The

1 2

same argument shows that y must b e on the line b etween x and x , thus all four p oints must

1 3

lie on a line. But then we have no choice but to map x and x to the same p oint, hence the

2 3

distance b etween this pair has not b een preserved. Obviously this argument is indep endent of the

dimension of the Euclidean space. In this view, for a complex space such as the protein sequence

space, we probably should exp ect some non-zero distortion even with optimal emb edding. Theorem

10

3 provides an upp er b ound of C (X ) = O (log n) on the distortion of the optimal emb edding , but

2

the actual distortion di ers from one metric space to another.

It is intriguing to question whether the geometry of the Euclidean space (as re ected by the

distance distributions) reveals a genuine high order global structure. No such structure is obvious

11

from the distributions of the original DP pairwise distances . In other words, do es the emb edding

pro cess detect regularities which were not evident b efore the emb edding? The intuition is simple:

In Euclidean space the p osition of the image of a sp eci c protein segment should re ect its relative

distance from all the other protein segments, and the total numb er of constraints grows as the

square of the numb er of p oints. The original distance measure may not b e very informative and

cannot distinguish far from very far. Even though, when all constraints are met together, the

collective may reveal hidden regularities. Therefore, in principle, it is p ossible that this pro cedure

recovers geometric information from the proximity data. We found some examples (see chapter 6)

which suggest that such regularities (which were not obvious from the original distance/similarity

measure) are detected, but the results are not conclusive, and further work is needed to verify this.

10

The emb edding algorithm that we applied is not guaranteed to pro duce optimal emb edding, but it almost surely

pro duces \go o d emb eddings" with distortion that is b ounded by O (log n).

11

Though the distributions alone cannot revoke the existence of high order structure.

Chapter 5

Data clustering

After the protein sequences are emb edded in Euclidean space, we further analyze the sequence

space and turn to lo ok for statistical regularities in the Euclidean space. Sp eci cally, a statistical

clustering mo del of the sequences is constructed.

This chapter fo cuses on metho ds for clustering statistical data. It starts with a very brief

intro duction of the notations used in cluster analysis, classical clustering algorithms, and validation

schemes. Then it describ es our clustering algorithm. By incorp orating a cross-validation scheme,

this algorithm is exp ected to construct a reliable mo del of the data with high probability. Cluster

validity is measured by means of a few di erent statistical tests, which are describ ed next.

5.1 Intro duction

The term clustering is used to describ e techniques for organizing patterns into groups (clusters) so

that patterns within a cluster are more similar to each other than to patterns in other clusters. By

grouping data p oints into clusters or by organizing data as a hierarchy of groups, cluster analysis

attempts to recover the inherent structure within a given data set. Clusters are exp ected to detect

the existence of di erent typ es or features in the data. Data clustering is one of the fundamental

problems in learning, and self organization of data is a key step in any learning task.

As opp osed to sup ervised learning, where the samples are lab eled, and the goal is to design

a \go o d" classi er (e.g., a discriminant function that assigns each sample to its correct class), in

cluster analysis the samples are unlab eled and the goal is to explore the structure of the data

without any assumptions regarding the typ e and the structure of the mo del (e.g., the functional

form of the classi er or the typ e of the underlying distributions). In the literature, cluster analysis

is often termed unsup ervised learning.

A clear distinction should b e made b etween the two di erent roles of clustering. When concise

representation of data is sought (i.e., compression), data should b e clustered so as to minimize

certain global distortion measures, regardless of the actual meaning and signi cance of the cluster

centroids (this pro cedure is often called Vector Quantization). A di erent requirement is met

when the data set is a sample of a larger samples space, and clusters are to serve as a reliable mo del 61

62 Chapter 5. Data clustering

for generalization, i.e., new samples are exp ected to b e assigned correctly to these clusters. In

this case, great care should b e taken not to over t the mo del to the training data. One of the ma jor

goals of computational learning theory is to provide conditions under which go o d generalization

can b e derived from a small numb er of samples (see [Kearns & Vazirani 1994 ]).

5.1.1 Basic de nitions

There is no single de nition of \cluster" which is universally accepted, and any one of the following

common de nitions [Jain & Dub es 1988 ] may apply: (i) a set of entities which resemble one another

more than they resemble other entities, (ii) a set of patterns such that the maximum intra-pattern

distance is smaller than the minimum inter-pattern distance, for patterns not in the cluster, (iii) a

region of relatively high density in the pattern space separated from other patterns by regions of

low density, (iv) a compact well-separated set of p oints.

We are interested in clustering algorithms that partition a nite set of ob jects into clusters, based

on proximities b etween pairs of ob jects. An ob ject is usually assumed to b e represented as a p oint

d

in a d-dimensional metric space R (measurement space), and a measure of similarity/dissimilarity

can b e the (Euclidean) distance b etween the p oints. In the course of this discussion, we assume

1

that sample p oints reside in a Euclidean space , and from a geometric p oint of view, clusters can

b e viewed as \clouds" of p oints in a d-dimensional space, with di erent shap es and orientations.

A formal de nition of clustering was suggested in [Wang et al. 1990 ]: Given n samples X =

d

fx ; x ; :::x g, where x 2 R , nd an integer k , a set of representatives Y = fy ; y ; :::; y g y 2

1 2 n i 1 2 j

k

d

R and a function M : X ! Y which assigns a representative for each sample point, so that k ; Y

and M are optimal.

This de nition, however, is very general and avoids the main diculty by requiring optimality

without sp ecifying when clustering is considered optimal, and without addressing the means by

which a clustering should b e evaluated.

5.1.2 Clustering metho ds

Clustering metho ds can di er in several ways. For example, they can di er by the characteristics

of the input data. As mentioned ab ove, in some cases only the pairwise distances/similarities are

given (proximity matrix), while in other cases the samples are p oints in d-dimensional real space

(pattern matrix). This chapter fo cuses on algorithms of the second typ e. Algorithms for clustering

data of the rst typ e are often called pairwise clustering and are brie y discussed in chapter

7. Clustering algorithms also di er by the availability of the data, i.e., on-line (the samples are

intro duced one by one) vs. batch (all samples are stored in the memory and are accessible at any

1

In some cases the ob jects are represented only by their proximity data. However, as was shown in

[Linial et al. 1995 ], any nite metric space can b e faithfully emb edded into a Euclidean space of low dimension,

and with small distortion (see chapter 4). This emb edding into a Euclidean space enables us to utilize some of

the features of the Euclidean metric (see discussion in section 4.2). Therefore, large parts of this discussion can b e

mo di ed to t any metric space.

Chapter 5. Data clustering 63

stage of the algorithm). On-line algorithms are imp ortant when the size of the input data b ecomes

unmanageable by batch algorithms, in terms of memory and CPU resources.

Another distinction is made b etween exclusive (hard) vs. non-exclusive (soft) clustering. In

hard clustering, each p oint b elongs to exactly one cluster (as de ned in the previous section),

while in soft (fuzzy) clustering each p oint b elongs to all clusters with nite probabilities. For

example, we can mo del the clusters by a mixture of c normal distributions, one for each class, so

that the probability of a sample ~x is

c

X

p(~x=w )p(w ) p(~x) =

j j

j =1

where p(w ) is the prior probability for cluster j and p(~x =w ) is estimated by N (~ ;  ), the normal

j j j j

distribution asso ciated with the j -th cluster, where ~ is the mean vector and  is the covariance

j j

2

matrix . In the following sections, the discussion assumes a hard clustering mo del, but it can b e

easily generalized to the fuzzy clustering mo del.

Finally, clustering algorithms are also distinguished by the algorithm \scheme", i.e., hierarchical

vs. partitional clustering. Details will follow.

5.1.3 Partitional clustering algorithms

In the classical clustering schemes one usually selects a clustering criterion or an ob jective function

(e.g., global distortion) and sets the desired numb er of clusters k . Then one tries to optimize the

ob jective function by searching for the b est con guration (the b est partition of sample p oints into

clusters).

d

For example, given n samples x ; x ; :::x , where x 2 R , and an initial partition of x ; x ; :::x

1 2 n i 1 2 n

into sets C ;C ; :::C , the squared error clustering de nes the ob jective function as follow:

1 2

k

De ne the center of the j -th cluster as the average of the samples that were classi ed to the j -th

set, i.e.,

X

1

m = x

j i

n

j

x 2C

i j

where n is the numb er of samples that were classi ed to the j -th set. The center is the represen-

j

tative of the cluster. The squared error for cluster j is de ned by:

X

t 2

(x m ) (x m ) = e

i j i j

j

x 2C

i j

and the total squared error for the clustering is given by

k

X

2 2

= E (5.1) e

k j

j =1

2

Apriori there is no reason to assume that the underlying distributions are normal. However, by assuming a

mixture of normal distributions it is p ossible to approximate quite complicated sample distributions. Actually, by

increasing c, the numb er of comp onent densities, and by allowing it to b e unlimited, it is p ossible to approximate

any density function [Duda & Hart 1973 ]

64 Chapter 5. Data clustering

This error function favors spherical distribution of p oints around the centers (in other words, the

samples are equally distributed in all directions), b ecause the error dep ends only on the distance

from the center. General ellipsoid distributions can b e approximated as well, by weighting the

co ordinates, for example, when apriori information ab out the distribution of p oints within clusters

is available, or by assuming a mo del for the underling distribution (e.g., normal) and estimating

the parameters of the distribution empirically.

The b est partition into clusters is the one which optimizes the ob jective function (e.g., minimal

squared error). Theoretically this optimization problem can b e solved by evaluating the ob jective

function for all p ossible partitions. Practically, the size of the search space (numb er of p ossible

con gurations) for n sample p oints and k clusters is (at least) exp onentially large. Therefore, an

exhaustive enumeration is imp ossible, and the search techniques that are usually employed are

either iterative, \hill-climbing" techniques or are based on sampling the search space.

5.1.4 The basic clustering pro cedure

An example for a simple iterative partitional clustering scheme is given in the next algorithm:

 Select an initial partition into k clusters.

 Lo op :

{ (I) Compute the center of each cluster as the average of all samples assigned to it.

{ (I I) Assign each sample to the closest cluster (the one whose center is closest). In this

way a new partition is generated.

{ (I I I) If cluster memb erships change, go to Lo op ; otherwise stop.

For squared error function, the optimal set of representatives is the cluster centers, and this scheme

ensures that global distortion decreases or remain unchanged with every iteration. However, the

centers may not b e the b est choice in general, and the selection of the optimal set of representatives

given the assignments (step I) dep ends on the error function.

Over the years many clustering algorithms were prop osed (see [Duda & Hart 1973 ] and [Jain & Dub es 1988 ]

for a survey of this eld). The basic clustering algorithms such as K-means or Iso data classify the

sample set into a xed numb er of clusters, following the classical scheme, as describ ed ab ove. How-

ever, a ma jor problem with these algorithms is that they are very likely to b ecome trapp ed in a

lo cal minimum. Also, their output may vary a great deal, dep ending on the starting p oint. The ob-

jective function may also signi cantly a ect the output. Obviously there is no single b est ob jective

function for obtaining a partition. Each criterion imp oses certain structure on the data, but the

true clusters are not necessarily recovered. This is esp ecially true when the real numb er of clusters

k is unknown, and one has to chose it arbitrarily.

Chapter 5. Data clustering 65

5.1.5 Hierarchical clustering algorithms

Hierarchical clustering schemes can minimize the sensitivity to di erent initial parameters. Some

authors (e.g., [Duda & Hart 1973 ]) use the term \hierarchical clustering" to describ e techniques for

3

obtaining a sequence of nested groupings . However, many clustering algorithms apply hierarchical

schemes which do not necessarily result in nested groupings. For example, a hierarchical scheme is

employed by the LBG algorithm [Gray et al. 1980 ], which starts from a single cluster, and then,

iteratively, doubles the numb er of clusters by splitting the clusters from the last round. At each

iteration a hill-climbing pro cedure, as describ ed ab ove, is applied to optimize the partition. In

some applications the hierarchical structure itself is correlated with features of the data.

There are many hierarchical clustering algorithms (see [Duda & Hart 1973 ] and [Jain & Dub es 1988 ]),

among which we chose to describ e in detail the statistical mechanics approach.

The statistical mechanics approach

A hierarchical clustering algorithm based on analogy with physical systems was prop osed by

[Rose et al. 1990 ]. Their deterministic annealing algorithm applies concepts from statistical

mechanics, such as free energy and entropy, to the clustering problem. The data set is viewed as a

physical system, where a sp eci c (fuzzy) clustering partition, with a given set of representatives and

a total distortion, is equivalent to a con guration of the system whose total free energy F = E TS

is de ned in terms of the distortion E (energy), and the entropy S of the distribution of partitions

(con gurations) with average distortion E . T is the temp erature. It is well known that physical

systems are likely to adopt the con guration with the minimal free energy. The optimal con gu-

ration may change when the temp erature of the system changes, and the degree of order within

the system increases as the temp erature decreases (in other words, as the entropy decreases): At

zero temp erature the (unique) con guration with the minimal energy E is also the con guration

with the minimal free energy F . At temp erature > 0 there is no unique con guration. Rather,

the \state" of the system is describ ed in terms of a distribution over all p ossible con gurations,

a distribution that dep ends on the temp erature. Consequently, in the original clustering problem

the samples have di erent probabilities of b eing assigned to the di erent clusters (and hence the

fuzziness). One of the main advantages of this approach is that no assumption is made ab out the

distribution of sample p oints. Rather, the distribution is determined by the maximum entropy

principle.

By an annealing pro cess this approach aims to avoid lo cal minima and to obtain an optimal

con guration. The pro cess starts from a con guration at a very high temp erature, where all

samples form what is essentially a single large cluster. A co oling schedule is followed, in which

the temp erature is lowered gradually and a solution is found at each temp erature. The pro cess

traces the system through a series of phase transitions, each of which corresp onds to a solution in

a di erent temp erature (meaning, by analogy, a di erent scale or resolution of the analysis). For

3

Most pairwise clustering algorithms follow this hierarchical scheme (see chapter 7 for details).

66 Chapter 5. Data clustering

more details see [Rose et al. 1990 ].

5.1.6 Cluster validation

Most clustering metho ds do not suggest the \correct" numb er of clusters (classes) within the data

set. Obviously, this numb er dep ends on the scale of the analysis, and multiple scale analyses (as

in the metho ds of the previous paragraphs) may help in revealing the structure. However, these

metho ds do not have a natural criterion by which to stop the pro cess of splitting data p oints, and

usually an external, user-sp eci ed parameter is used (e.g., when the numb er of clusters reaches

a certain threshold). Consequently, the recovered structure do es not necessarily t the \real"

structure of the data, and its validity is questionable. The problem of cluster validity is particularly

imp ortant when high generalization p ower is desired, i.e., when the cluster structure is exp ected to

b e capable of \explaining" new samples.

Indices of validity

Many indices of validity have b een prop osed in the literature (see [Jain & Dub es 1988 ] for a discus-

sion). There are three main typ es of validation: (i) External validation compares the recovered

structure to an a priori structure, and tries to quantify the match b etween the two. For example,

the following index can b e used to assess the match b etween two clusterings of n patterns: There

are n(n 1)=2 pairs of samples. Denote by n the numb er of pairs that are in the same cluster

1

in b oth clusterings, and denote by n the numb er of pairs that are in di erent clusters in the two

2

clusterings. Then the Rand index [Rand 1971 ] measures the degree of match:

!

n

R = (n + n )=

1 2

2

To determine if the calculated index for the recovered structure is \unusual", or surprising, one must

obtain (by theory or simulation) the baseline distribution of the index for random data. A clustering

is termed valid if the index is unusually high, as measured with resp ect to the corresp onding

baseline distribution. However, establishing the baseline distribution requires purely random data

that match the characteristics of the real data, or extensive statistical sampling. Moreover, this

test can b e applied only when a prior structure is known. (ii) An internal validation test seeks

to determine if the structure is intrinsically appropriate for the data. This may b e done, for

4

example, by measuring the compactness and isolation of clusters and determining if the result is

extremely unusual. This is achieved by comparing the recovered structure to a structure that is

obtained by applying the same algorithm to a set of random data of the same size. As in the case

of external tests, this approach requires a knowledge of the baseline distribution of the index for

random data. However, deriving the baseline distribution for internal tests is more dicult than for

external tests, b ecause the clusters dep end on the clustering metho d itself. Furthermore, in order

4

Such measures can b e easily de ned when the data set is represented as a graph. See chapter 7 for more details.

Chapter 5. Data clustering 67

to determine if the recovered structure is surprising, it should b e compared to the best clustering of

random data. Therefore, only if the same result could not b e obtained for any set of random data,

could the structure reasonably b e considered valid. Practically, these requirements complicate the

application of internal validation tests. (iii) A relative test compares two di erent structures (e.g.,

clustering structures for k=4 clusters and for k=5 clusters) and measures their relative merit. Such

tests can also b e applied in order to compare results of alternative metho ds. For example, the

Calinski-Harabasz index, CH (k ), was the b est of 30 indices tested on synthetic data by Milligan

2

squared error (as in equation and Co op er [1985]. For clustering n samples into k clusters, with E

k

5.1), the index is de ned by:

!

2

n k E

1

CH (k ) = 1

2

k 1

E

k

and the value of k which maximizes CH (k ) is chosen as the real numb er of clusters. This index

p erformed well for synthetic data. However, there is no guarantee that this index or others will b e

optimal for real data, and the characteristics of data can a ect p erformance in unknown ways.

In general, there is no natural de nition according to which the quality and validity of the

clustering pro le can b e assessed, and external or relative parameters are used to adjust the numb er

of clusters. Another approach was prop osed by [Rissanen 1989 ], using the minimum description

length (MDL) criterion, where the total clustering cost takes into account the distortion as well as

the mo del's complexity (which is a function of the numb er of parameters needed to describ e the

mo del). However, dep endency on the external parameters makes these approaches to o sensitive to

p erturbations in the data or in the parameters. Consequently, these indices do not provide a useful

framework to minimize the generalization error.

Cross-validation

A robust approach for minimization and for estimating the generalization error is based on cross-

validation, i.e., testing the parameters of clusters against indep endent validation data [Hjorth 1994 ].

Statistical tests can b e applied to verify that the \match" b etween the parameters of the two sets

is statistically signi cant. It is well known that over tting the mo del to the training data can

b e avoided via cross-validation. Such a validation scheme provides an ob jective assessment of

a clustering structure. It can help to determine whether a structure is meaningful, and cannot

reasonably b e assumed either to have o ccurred by chance, or to b e an artifact of the clustering

algorithm. The hierarchical clustering algorithm presented in the next section is based on cross-

validation.

5.2 Our clustering algorithm

Our algorithm addresses the ma jor problems common to many clustering algorithms: determining

the right numb er of clusters, the danger of lo cal minima, and over tting. The clustering algorithm

is hierarchical, which makes it robust, as it is insensitive to the starting p oint. At each level of

68 Chapter 5. Data clustering

the hierarchy a basic clustering pro cedure is applied to partition the data into a xed numb er of

clusters, and the initial p oints for the centroids are de ned based on the results of the prior stage.

The validity of our clustering is monitored by splitting the data into two random subsets, and

applying the algorithm indep endently to each set, with the requirement that the clusters in the

two sets \agree" at every level of the clustering hierarchy. Thus the mo del's generalization p ower is

closely monitored, so as to avoid the common pitfall of over tting the mo del to the training data.

The underlying assumption is that patterns that are common to two large indep endently chosen

subsets of the same original set, capture true features of the original space. A statistically signi cant

corresp ondence b etween two indep endent samples implies a tight upp er b ound on the probability

that these two indep endent cluster sets disagree on the classi cation of new indep endent p oints,

and likewise, a b ound can b e obtained on the generalization error of the mo del (see [Vapnik 1982 ]).

5.2.1 Algorithm outline

Our approach to the clustering problem resembles the familiar hierarchical vector quantization (VQ)

algorithm [Gray et al. 1980 , Gray 1984 ]: Each data p oint is asso ciated with the nearest centroid,

then the centroids are reestimated to minimize the distortion within each cluster. This pro cess is

rep eated until convergence to a (p ossibly lo cal) minimum of distortion. To reduce the dep endency

on initial conditions, the pro cess b egins with a single cluster. Subsequently, at each iteration the

cluster with the highest asp ect ratio is split. This step is driven by the intuition that underlies

the deterministic annealing approach, where clusters split at a temp erature which dep ends on

the distribution of p oints within the cluster. Namely, the split temp erature is prop ortionate to the

5

recipro cal of the largest eigenvalue of the covariance matrix . The split happ ens along the principal

direction (by replicating the centroid and allowing the two copies to b e p ositioned on this axis).

5.2.2 Cross validation

The mo del's generalization p ower is monitored by p erforming the algorithm indep endently on two

randomly chosen subsets of the data. At each level of the hierarchy we cross-validate the

clusters in one part with the clusters in the other, and ab ort every split on which the two pro cesses

\disagree".

Agreement entails a one-to-one corresp ondence b etween clusters of the rst set and those of the

second set, where corresp onding clusters have

(i) nearly equal sizes

(ii) nearly identical centroids

(iii) similar distributions of p oints

A split that fails these criteria gets ab orted (i.e. the previous con guration is recovered), and a

split is p erformed on the cluster of the next-longest principal axis. The pro cess terminates when

5

Though this relation was shown in [Rose et al. 1990 ] only for a single cluster (i.e., no long-range interactions),

the qualitative characteristics remain when more than one cluster is apparent [Rose et al. 1990 ].

Chapter 5. Data clustering 69

all attempted splits get ab orted. This criterion is clearly very strict, and more relaxed criteria for

the matching can b e examined as well. The algorithm sketch and the conditions for agreement are

de ned explicitly b elow.

5.2.3 Algorithm sketch

Initialization

 Randomly split the data into two distinct subsets, of equal size.

 Calculate the centroid and covariance of each of the subsets.

Lo op Given k  1 clusters at each subset which are in mutual agreement as de ned ab ove. For each subset do

 Step 1: cho ose a cluster to split (the criterion - the cluster of the largest SVD principal comp onent) and split

the cluster along the corresp onding direction (replace its centroid with two copies of it which are almost the

same, except for low random noise in that direction). Now you have k+1 initial centroids.

 Step 2: estimate the parameters of the k+1 clusters (Apply the K-means algorithm with the k+1 initial

parameters).

 Calculate the covariance matrix, and de ne the principal directions for each cluster, based on singular value

decomp osition (SVD) of the covariance matrix.

 Check whether there is an agreement b etween the clusters of the rst set, and the clusters of the second set,

according to the three conditions that are stated ab ove. If there is an agreement, continue to step 1, otherwise,

return to the previous con guration (validated k clusters), split the cluster with the next longest principal

comp onent, and continue to step 2. If there is no cluster to split - ab ort.

5.2.4 Conditions for agreement

Bounding the deviation in clusters' size

Supp ose that we have a cluster of k data p oints. When we split the data into two random subsets,

k k

this cluster would split into two clusters, of approximately equal size, S = +  and S =  .

big smal l

2 2

What deviation  should we exp ect simply by chance ?

If N is the total numb er of data p oints, then there are N =k clusters of size k at the most. Based

on the law of large numb ers, we p osit that the probability of a cluster of size k satisfying condition

A (i.e. a deviation of  ) should b e approximately of magnitude k =N .

k

The probability that a cluster of size k splits randomly into two sets of size S = +  and

big

2

k

S =  is given by

smal l

2

!

k

1 k

+  ) = P r ob(S =

big

k k

2 2



2

and following the discussion ab ove, we consider the condition

k k

+  ) < P r ob(S >

max

big

2 N

70 Chapter 5. Data clustering

applying the normal approximation for the binomial distribution we get

!

2

k 2

max

P r ob(S > +  )  exp

max

big

2 k

Hence we obtain a b ound on the size of the \acceptable" deviation

s

 

N k

log  <

max

2 k

i.e. two matching clusters are considered of nearly equal sizes, and therefore susp ected to sample

the same prototyp e cluster, if the deviation in their size satis es the ab ove equation.

Bounding the distances b etween the centroids of matching clusters

To verify that the centroids of two matching clusters are \nearly identical" two tests are invoked.

(i) small Euclidean distance

(ii) small distance as measured in terms of the distribution of data p oints in the clusters (i.e. small

 distance).

Euclidean distance

1 2

Given two centroids ~ and ~ their Euclidean distance is simply de ned as

v

u

d

u

X



2

1 2 1 2

t

1 2

  d(~ ; ~ ) = k~ ~ k 

i i

i=1

where d is the dimension of the space. Two clusters are considered indistinguishable if the Euclidean

distance b etween their centroids is small.

There is no general de nition of what should b e considered as small Euclidean distance. The

b ound should dep end on the scale of the co ordinates of the data p oints, and on the resolution we

are aiming to. However, the distribution of distances b etween sample p oints may suggest a natural

threshold. Consider for example the distributions in Fig. 5.1. Setting the threshold at the value

where the smallest distances are e ectively observed is a reasonable choice (though p erhaps to o

strict).

 -distance

In some cases the Euclidean distance is not informative enough to determine whether two cen-

troids are indeed nearly identical. As noted ab ove, the Euclidean distance may b e to o sensitive to

Chapter 5. Data clustering 71

Distribution of Euclidean distances 0.03 Euclidean distance 0.025

0.02

Figure 5.1: Setting the minimal distinguish-

Euclidean distance from distance distri-

0.015 able

bution. Plotted is the distribution of distances b etween

rotein sequences, after emb edding in Euclidean space. 0.01 p

relative frequency

The threshold is denote by an arrow. 0.005

0 0 20 40 60 80 100 120 140 160 180 200

distance

the scale of the problem. Moreover it do es not re ect the distributions within clusters, and may

therefore b e misleading.

Consider for example the con guration in Fig. 5.2. It is obvious that although the Euclidean

distance b etween the two centroids is small (compared with the average distance within memb ers

of the clusters), the two centroids stand for two di erent clusters. However, by observing the

distribution of data p oints in these clusters we can immediately distinguish b etween the two. In

such cases it is useful to use the  -distance as another test to decide if the two centroids are nearly

identical.

64 63

62

Figure 5.2: The Euclidean distance b etween the cen-

61

troids of the two clusters is small with resp ect to the

60

distance within the clusters. However, the  - cluster 2 average

59

distance can b e used to distinguish b etween the two

58

clusters. The projection of the distribution of p oints in

57

1 on the axes that connect the two centroids cluster 1 cluster

56

de nes  (shown with arrow). The  -distance from 1

55

cluster 1 to cluster 2 is 2.6, therefore the clusters are

54 distinguishable. 53

35 40 45 50 55 60

1 2

We de ne the  -distance from ~ to ~ (the notation implies the asymmetric characteristic

of this measure) by rst taking the pro jection of the distribution of cluster 1 on the axis which

2 1 2

connects the two centroids, ~ ~ , and calculating the variance of this pro jection, denoted by  .

" #

2

n

1

2 1

X

1 (~ ~ )

2 1

 = (~x ~ ) 

i

2 1

n k~ ~ k

1

i=1

Then, the  -distance is de ned as

2 1

k~ ~ k

1 2

d (~ ; ~ ) =

 

72 Chapter 5. Data clustering

When the  -distance b etween two centroids is small (say, less than 1), the distributions are

highly sup erimp osed, and the centroid of the second cluster fall well within the cloud of p oints

surrounding the rst centroid. Therefore we consider these centroids to b e nearly identical.

Checking for similar distributions

The distribution of a sample set can b e characterized by the principal comp onents [Press et al. 1988 ]

of the samples covariance matrix. This analysis p oints out the principal directions of the distri-

bution, where the largest variances are observed. Formally, the principal comp onents are the

eigenvectors of the covariance matrix, and the corresp onding eigenvalues denote the variance along

these eigenvectors. Practically, the term is used to denote the principal directions (those with the

largest variance). These directions give the geometric orientation of the distribution.

If the distributions of two clusters are similar, then their principal directions are exp ected to b e

nearly identical. Therefore, the fourth test is invoked to determine if the distributions are similar

or not, by comparing their principal directions.

We test the suggested matching against the situation when the matching is random. The

d

baseline distribution is de ned as the distribution of scalar pro ducts of random vectors in R ,

d

drawn uniformly over the unit sphere in R . The validation is done by calculating the scalar

pro duct of the principal directions, and assessing the statistical signi cance from the distribution

of scalar pro ducts of random vectors.

P

d

d

x  y . For random vectors, The scalar pro duct of two vectors ~x,~y in R is given by

i i

i=1

each co ordinate is chosen indep endently at random. The pro duct of two random variables is

a random variable, and the sum of d i.i.d random variables is distributed normally for large d.

d

Therefore, the scalar pro duct of random vectors in R is distributed normally. However, for

unit vectors, the variables are not indep endent since the vectors are normalized to have unit

d

length. The absolute value of the scalar pro duct of two random unit vectors in R was shown in

1

p

[Frankl & Maehara 1988 , Johnson & Lindenstrauss 1984 ] to b e strongly concentrated around

d

(see also Johnson-Lindenstrauss theorem in chapter 4).

To obtain an explicit measure of statistical signi cance, the distribution of scalar pro ducts of

d

random unit vectors in R was derived (see Fig. 5.3). This distribution changes drastically as

the dimension increases. For random vectors in 2D the distribution is mostly at b esides two

6

p eaks at the b oundaries , whereas for higher dimensions (4,5) the distribution is a broad unimo dal

concave function that narrows as the dimension increases. For even higher dimensions (> 100) the

distribution is clearly normal centered at zero, and strongly concentrated (with standard deviation

1

p

which dep ends on the dimension  = ). Therefore, for high dimensional Euclidean spaces it is

d

7

p ossible to obtain statistical assessment by means of the standard deviation  of the distribution .

6

For vectors in 2D the scalar pro duct is determined by the cosine of the angel b etween the vectors. Therefore, the

distribution simply follows from the characteristic of the cosine function (which is denser in the b oundaries).

7

Since the distribution is quite broad for lower dimensions, the test is e ective only for high dimensions.

Chapter 5. Data clustering 73

0.0045 Distribution of scalar product of random vectors in 200D 2 dimensions 0.007 0.004 3 dimensions empirical distribution 4 dimensions 0.006 normal approximation 0.0035 5 dimensions 10 dimensions 0.003 100 dimensions 0.005

0.0025 0.004 0.002 0.003 0.0015 realtive frequency realtive frequency 0.002 0.001

0.0005 0.001

0 0 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4

scalar product scalar product

d

Figure 5.3: (a) Distribution of scalar pro duct of random vectors in R . Distributions are plotted for d=2 d=3 d=4 d=5

200

d=10 d=100. (b) Distribution of scalar pro duct of random vectors in R . Note that the distribution follows the normal

distribution.

If the absolute value of the scalar pro duct of two principal comp onents from two di erent clusters

exceeds 3 (signi cance level of > 99%) then the clusters are said to agree on their principal

directions.

In principal comp onent analysis the covariance matrix has to b e nonsingular, since the inverse

matrix is required. However, for small data sets the empirical covariance matrix may b e singu-

lar. For singular matrices it is p ossible to apply instead the singular value decomp osition (SVD)

technique. The technique is based on a theorem from linear algebra: Any M  N matrix A where

M  N can b e written as the pro duct of M  N column-orthogonal matrix U, an N  N diagonal

matrix W with p ositive or zero elements (the singular values), and the transp ose of an N  N

orthogonal matrix V . I.e.

T

A = U  W  V

If A is a square nonsingular matrix then U = V is simply the matrix whose rows are the eigenvectors

of the matrix A (the principal comp onents of the matrix A), and W is a diagonal matrix with the

eigenvalues of A in the diagonal. The inverse of A is then

1 T

A = V  [diag (1=w )]  U

j

(the inverse matrix can also b e computed by LU decomp osition or Gaussian elimination). For

0

s is zero. In practice, similar problems arise also if singular matrix A, one (or more) of the w

j

the entries in W are smaller than the machine's oating p oint precision), and A do esn't have an

inverse. However, those columns of the matrix U which corresp ond to nonzero j comp onents of

W span the range of A and therefore denote the main principal directions of the corresp onding

subspace (for more details see [Press et al. 1988 ]).

74 Chapter 5. Data clustering

Chapter 6

Global organization of protein

segments

The metho ds of the previous chapters were applied in order to cluster all protein segments of length

50 in the SWISSPROT database. To summarize the computational pro cedure, we start by parti-

tioning all protein sequences longer than 50 amino acids in the SWISSPROT database (release 30)

into overlapping segments of 50 amino acids. This partitioning yields a total of 543,627 segments.

A metric derived from the Smith-Waterman dynamic programming measure of similarity turns the

space of protein segments into a nite metric space (see section 4.1). This space of segments is

initially emb edded into Euclidean space of low dimension and small distortion (see chapter 4). A

hierarchical clustering algorithm is then applied to the emb edded space with Euclidean distances. A

key asp ect of this stage is that the validity of our clustering is closely monitored by cross validation

(see chapter 5).

The pro cessing of such large quantities of data requires extensive computer resources. The

emb edding phase requires massive pairwise calculations, and was made p ossible by using Com-

pugen's Bio ccelerator [Compugen 1998 ] that p erforms the SW dynamic programming algorithm.

The clustering phase is based on a computationally intensive proto col that was made feasible by

programming a parallel application to run on the MOSIX distributed system [Barak et al. 1995 ].

The whole computational pro cess was fully automatic, without any human intervention or

biological consideration. On termination, when the cross validation criteria allowed no further

splitting, the pro cess yielded a tree of 106 clusters. This chapter discusses the results of this

analysis.

6.1 Results

In order to evaluate the quality of this clustering we made various comparisons with a known

partial classi cation of proteins, namely the protein families in PROSITE (release 12.1, Octob er

1994 [Bairo ch 1991 ]). This list of ab out 700 groups of related proteins comprises 46% of the proteins 75

76 Chapter 6. Global organization of protein segments

1

in the databank . These families vary in size from a few proteins to several hundred (see Fig. 6.1).

Henceforth, in this chapter a \family" of proteins refers to a class of proteins on this list and the

nomenclature of PROSITE is adopted.

40

35

30

25

20

Families (%) 15

10 protein_kinase | | 5 g_protein_receptor | globin | | | 0 0 100 200 300 400 500 600 700

Family size (number of proteins)

Figure 6.1: Distribution of PROSITE families according to their size.

The hierarchical tree is evaluated rst in terms of the family comp osition within the clusters.

We fo cus on several clusters that match interesting motifs, or suggest the existence of biological

features common to di erent families. In the second part, we incorp orate the data obtained from

the distribution of protein segments among the 106 clusters to create a new representation of fam-

ilies (referred to as ngerprints), which induces quantitative indices of similarity b etween protein

families. In the last part, a new metho d for representing full-length proteins is intro duced, based

on the order of segments within a protein, and the clusters in to which these segments were clas-

si ed. Thus, by incorp orating the detailed information from our clustering, a natural measure of

similarity emerges for complete proteins as well. Moreover, this new representation is very e ective

in visualization of domains shared by a group of related proteins. The three levels of analysis rely

on the initial tree created by our algorithms, using the prop erties of the tree such as - the relative

p osition of a cluster in the tree, size of clusters, geometry of a cluster, and the Euclidean distance

b etween the centroids of the clusters.

6.1.1 Clusters of protein sequences

The tree of 106 clusters, generated by the hierarchical clustering algorithm, is shown in Fig. 6.2.

Insp ection of the tree shows that while most clusters were generated by a series of splits, corre-

sp onding to a deep er level in the hierarchical tree, a substantial numb er of clusters were created

and stabilized already after a few splits. The most extreme example is in the case of globins which

comprise clusters 8-13, 81-84 and 104-105. All evolved very early during the clustering pro cess.

Fig. 6.3 o ers a general view of the clusters' complexity and the distribution of data among the

clusters. A cluster's complexity is measured by the numb er of PROSITE families which contribute

1

As noted in section 3.1.1, the numb er of protein families in PROSITE is actually smaller, since some PROSITE

patterns match short motifs which are common to many families.

Chapter 6. Global organization of protein segments 77

8-11 globins 12-13 globins 81-84 104-105 globins globins

76-80 23-34 rubisco_large rubisco_large 14-15 64 rubisco_large actins 65-71 ig_mhc 16-20 21-23 72-75 tubulins ig_mhc 62-63 ig_mhc actins 54-57 actins

1-7 35 36 37 38 39 40 41 42-47 48 49 50 51 52 53 90 91 92-9699 106

58-61 85-86 87-89 97-98 100-103

Figure 6.2: Hierarchical clustering of protein segments. Only major splits are shown, with the appropriate

cluster numb ers 1 to 106. Certain clusters, e.g., 8-13 are created already early in the pro cess, but most clusters corresp ond

to deep er, more involved series of splits. Some of the conserved families which split from the rest at early stages are shown.

Sub classi cation within family (e.g. hemoglobin alpha chain, hemoglobin b eta chain, myoglobin, etc) is not indicated.

at least one segment to it. Ab out 56% of our clusters corresp ond to a single family and another 12%

of the clusters are still of low complexity, with up to 20 families p er cluster. At the high-complexity

end, in 22% of the clusters over 200 families app ear. On the other hand, the vast ma jority (90%) of

the segments b elong to highly complex clusters (over 200 families/cluster). Therefore, while most

clusters are small and have low complexity, they comprise only a small fraction of the data. Several

large clusters may need further splitting (see discussion). However, despite their complex nature,

some of the large clusters are very informative (see section 6.1.2).

Table 6.1 provides a closer view of the clusters, including their size and family comp osition.

Many conserved families get classi ed into a few distinct, low complexity clusters. Such families

large, clusters include globins (clusters 8-13, 81-84, 104-105), rib onucleotide reductases (rubisco

14-15, 24-34, 76-80), immunoglobulins (ig mhc, clusters 20-23, 65-74), actins (clusters 54-57, 62-

64, 75, 99) and tubulins (clusters 16-19, 48-51). Certain families have almost all their segments

classi ed to low complexity clusters. For example, 98% of the actin segments and 96% of the

rubisco large segments fell into such clusters. Other families, such as metallothionein, kazal serine

asp) have all, or almost all, their segments in protease inhibitor (kazal), and phospholipase a2 (pa2

only one or two clusters.

Some of the families are comp osed of di erent subfamilies. In some cases our clustering

78 Chapter 6. Global organization of protein segments

Cluster No. of No. of No. of proteins No. of Main families

no. segments proteins with PROSITE lab el families (PROSITE)

1 5206 3419 1993 249

2 67538 21897 9430 637

3 55403 19560 8398 613

4 38109 15286 6535 558

5 5935 2827 1217 202

6 19364 9256 4046 438

7 74082 23147 9999 650

8-13 915 464 464 1 Globin

14-15 380 198 198 1 Rubisco Large

16-19 609 154 150 1 Tubulin

20-23 600 155 155 1 Ig Mhc

24-34 2375 223 222 1 Rubisco Large

35 895 167 159 1 Zinc Finger C2h2

36 129 67 67 1 Kazal

37 189 97 96 1 Cyto chrome C

Toxin, Tubulin, Metallothionein 38 337 209 203 3 Snake

39 369 151 150 2 Cyto chrome C, Globin

40 269 121 119 1 Cyto chrome B Heme

41 780 361 194 2 Homeob ox

42 3047 2006 1299 134

43 4119 2869 1555 243

44 3838 2508 1288 193

45 4824 3289 1953 261

46 4104 2769 1746 167

47 4415 3179 1734 252

48 538 291 208 4 Tubulin , Insulin, Hsp20, Sasp 1

49 466 242 211 3 Gap dh, Cyto B Heme , Copp er Blue

B Heme , Rnase Pancreatic 50 1202 611 356 10 Tubulin, Globin, Cyto

51 1170 617 345 8 Tubulin , 2fe2s Ferredoxin, Histones, Globin

52 2380 1344 936 30

53 10830 5336 2529 345

54-56 500 103 103 1 Actin

57 230 126 125 2 Actin, Rib osomal S12

58 4739 3192 1592 238

59 3278 2629 1197 216

60 3364 2434 1393 186

61 3285 2415 1498 167

62 172 101 100 1 Actin

63 155 128 127 2 Actin, Histones H3 2

64 208 105 103 1 Actin

65-74 1479 156 156 1 Ig Mhc

75 201 102 102 1 Actin

76-80 1104 247 222 1 Rubisco Large

81-84 866 445 445 1 Globin

Table 6.1: Detailed description of clusters (part 1)

Chapter 6. Global organization of protein segments 79

Table 6.1...continued

Cluster No. of No. of No. of proteins No. of Main families

no. segments proteins with PROSITE lab el families (PROSITE)

82 205 205 205 1

83 228 228 228 1

84 217 217 217 1

85 57893 20694 9231 637

86 34771 14752 6685 581

87 497 106 16 7 Collagens

Typ e Lectin 88 1363 103 16 3 Collagens, C

89 4908 3497 2048 243

90 201 105 105 1 Pa2 Asp

91 2325 1373 899 64

92 4124 2814 1502 189

93 3592 2486 1451 185

94 4517 3048 1691 238

95 3498 2328 1260 203

96 3908 2735 1621 205

B Heme 97 205 101 85 1 Cyto chrome

98 140 61 53 1 Lactalbumin Lysozyme

99 513 423 278 9 Actin , Cox2, Cyto B Heme, Chap eronins Cpn10

100 3984 2758 1541 216

101 3305 2287 1236 199

102 2357 1537 985 109

103 4530 2963 1779 244

104-105 453 230 230 1 Globin

106 72461 23270 10250 652

Table 6.1: Detailed description of clusters. Each cluster is sp eci ed by its numb er ( rst column), the numb er of

segments within it (2nd column) and the numb er of distinct proteins from which these segments originate (3rd column).

The other three columns (partially) characterize clusters in terms of the PROSITE classi cation of the proteins. The 4th

column gives the numb er of proteins that have a PROSITE lab el. The complexity of the cluster, i.e., the numb er of families

which contribute these proteins, and major representative families are in columns 5 and 6 resp ectively. Notes: 1) A protein

that contributes a segment to some cluster is considered a \memb er" in this cluster. 2) A \family" of proteins is always one

of the classes in the PROSITE list, and the PROSITE nomenclature is adhered to. Only 46% of the proteins are classi ed

in PROSITE. Multi-trait b ehaviors of proteins are not accounted for. For family de nition and biological signi cance refer

to the PROSITE dictionary. 3) Where consecutive clusters represent only one and the same family, these are presented in a

single record. 4) The numb er of segments in a cluster may di er from the numb er of proteins from which they are derived.

A high ratio b etween these two parameters re ects the existence of rep eats, or redundancy, in these proteins (see clusters

35 and 88). 5) Some families have almost all their segments in well characterized, low complexity clusters. Families with

over 50% of their segments found in low complexity clusters are underlined. 6) Sub division within families was resolved in

some cases but is not indicated.

80 Chapter 6. Global organization of protein segments

80 80

70 70

Figure 6.3: Distribution of clusters according to

60 60

their complexity and distribution of data. Black

50 50

bars show the distribution of clusters by the numb er of fam-

represented in them (\cluster complexity"). In white - 40 40 ilies

Clusters (%)

distribution of segments according to the complexity of 30 30 the

Data (% of segments)

the clusters containing them. For example, the left white

20 20

bar indicates that ab out 2.5% of the segments are in clusters

10 10

which represent only one family.

1 2-5 6-20 21-100 101-200 201-300 >300

Number of Families in a Cluster

metho d succeed to distinguish b etween these subfamilies. For example hemoglobin alpha chains

were clustered to clusters 9,10,11,81,82,83, while hemoglobin b eta chains were clustered to clusters

8,13,84,104,105, and myoglobins to clusters 39,50. For clarity we refer to all of these as globins.

Note that the numb er of segments in a cluster may di er from the numb er of proteins from

which they are derived. A high ratio b etween these two parameters re ects the existence of rep eats,

or redundancy, in these proteins. For example, in cluster 35 this redundancy ratio is ab out 5.5.

All the proteins that have segments in this cluster are classi ed as `zinc nger proteins'. This

high ratio results from 4-8 rep etitions of the signature sp eci c to zinc nger domains, all of which

were clustered into cluster 35 (see b elow). Another example is cluster 88, where the segments to

proteins ratio is even higher - ab out 13. Though only 15% of the proteins within these clusters

have a PROSITE classi cation, all of those segments are rep eated domains of structural proteins

(mostly collagens).

Amino acid comp osition

The amino acid distribution was calculated for each of the 106 clusters. In certain clusters, the

amino acid distribution hardly di ers from their distribution over the whole data bank, while other

clusters show marked variations. Both cases are observed in large as well as in small clusters

(Fig. 6.4a,b resp ectively). Certain pairs of clusters turn out to have similar amino acid distribu-

tions, although they represent distinct protein families (not shown). Likewise, di erences in the

distribution of amino acids account for certain clusters, but certainly not for all of them. Con-

sequently, this distribution alone do es not necessarily determine biological prop erties. Only few

clusters exhibit degenerate amino acid distributions. For example, in clusters 87 and 88 glycine

and proline are relatively prevalent while all other amino acids are underrepresented (Fig. 6.4c),

2

re ecting the degeneracy of the proteins from which the relevant segments are derived .

2

Approximately 15% of the residues o ccur in segments of extreme comp ositional bias [Wo otton 1994 ]. Low-

complexity segments are implicated in crucial biological functions, but relatively little is known ab out their structure,

dynamics and interactions.

Chapter 6. Global organization of protein segments 81

4 (a) cluster 53

3

6.4: Amino-acid distribution in selected clus-

2 Figure

Amino acids are marked by the single letter co de, and are

1 ters.

into bio chemically related groups, separated by dashed ver-

0 group ed

-1

tical lines. From left to right: amino acids which are basic (K,R,H),

log2(F_cluster/F_total) -2

acidic (E,D), p olar and uncharged (Q,N), small (S,T,A,G), proline

-3

(P), nonp olar hydrophobic (M,I,L,V), aromatic (F,W,Y), and cys- -4

KRHEDQNSTAGPMILVFWYC

teine (C). The variance in frequency is quanti ed as the logarithm

amino acid

the frequency of a given amino acid within a cluster divided by its 4 of (b) cluster 51

3

frequency throughout the entire data bank (no di erence b etween

2

the exp ected value and the observed yields zero on this logarithmic

1

scale). (a) Many of the clusters display a unique amino acid dis-

0

tribution. One example is cluster 53 with 10,830 segments. It is

-1

relatively rich in glutamine (Q), glutamic acid (E) and alanine (A)

log2(F_cluster/F_total) -2

and is underrepresented in all aromatic residues (F,W,Y), histidine

-3

proline (P) and cysteine (C). (b) Some other clusters show -4 (H),

KRHEDQNSTAGPMILVFWYC

smo oth distribution close to the overall amino acid distribution. amino acid a

4

is, for example, the case with cluster 51, even though it has (c) cluster 87 This

3

only 1170 segments and represents only a small numb er of families.

2

Note that the cluster size do es not dictate the amino acid distri-

1

pro le (compare panels a and b). (c) Few of the clusters

0 bution

cluster 87) consist of segments with very low comp ositional

-1 (e.g.

complexity (predominant G and P). Most segments in this cluster

log2(F_cluster/F_total) -2

re part of proteins which play a structural role (see Table 1) and

-3 a

numerous rep etitions. -4 have KRHEDQNSTAGPMILVFWYC

amino acid

Motifs and domains

Some clusters exclusively match well de ned motifs within proteins. That is, segments that corre-

sp ond to a sp eci c biological pattern were group ed together to form a well separated and distinct

cluster. Two sp eci c examples are the zinc nger motif, and the homeob ox domain.

The zinc nger motif

The zinc nger motif is found in many DNA binding proteins (like transcription factors) in which

the zinc nger is the DNA binding domain, but also in certain proteins in which the role of the

zinc nger is unknown (see [Cukierman et al. 1995 ]). These proteins are characterized by 2-30

nger-like sub-structures, each centered around a zinc ion. Each nger is ab out 30 residues long,

with only a few highly conserved amino acids within it.

Cluster 35 corresp onds to this motif. All the segments classi ed to it b elong to proteins from

the zinc nger c2h2 family (one of the two ma jor zinc nger families). Moreover, these segments

are exactly the segments that contain the zinc nger pattern (as de ned by PROSITE), thus cor-

resp onding to the zinc nger motifs in each protein (Fig. 6.5a).

82 Chapter 6. Global organization of protein segments

(a) SW:ZKR1_CHICK (509 aa) Zinc Fingers ......

100 200 300 400 500 85 3 35 35 357 35 9135 2 7 435 91 91 9135 91 91 3

segment no 1 segment no 3 ... segment no 2 segment no 20

(b) SW:HXD3_HUMAN (416 aa) Homeobox domain

100 200 300 400 4 3 4 9641 2 2 6

3 6 5941 106 3 3 6

Figure 6.5: (a) The zinc nger motif. Cluster 35 matches the zinc nger domain. All segments in this cluster are

part of proteins which are classi ed as zinc nger c2h2 according to PROSITE. Moreover, these segments match exactly

the zinc nger domain in each such protein. Out of 241 proteins in this family, 159 contribute at least one segment to this

chick (509 amino acids). In the schematic representation, the zinc nger domains cluster. One such example is sw:zkr1

are denoted by black b oxes. The segment b oundaries are indicated b elow. The numb er near each segment is the numb er

of the cluster to which this segment was classi ed. The 10 zinc ngers in this protein are divided roughly to two blo cks.

Note that only the zinc nger domains are classi ed to cluster 35 (circled). Part of the second blo ck was classi ed to

cluster 91, which is also rich with zinc nger proteins. (b) The homeob ox domain. Cluster 41 matches the homeob ox

domain. Most segments classi ed to this cluster corresp ond to the homeob ox domain in 193 di erent proteins. One of

them is sw:hxd3 human (416 amino acids). Only the segments matching the homeob ox signature (marked in black) are

classi ed to cluster 41.

The homeobox domain

The homeob ox domain is a 60 amino acids p olyp eptide sequence, found in nuclear, DNA-binding

proteins. This domain binds DNA through a helix-turn-helix structure. Proteins that contain the

homeob ox domain are likely to act as regulators of transcription. Cluster 41 in our classi cation

matches this domain. Out of 304 proteins in the homeob ox family, 194 are represented in this clus-

ter. Segments in these proteins which are classi ed to this cluster are exactly those that contain

the homeob ox signature (Fig. 6.5b). The motif was extracted from the complete proteins without

any apriori information.

Clusters that match heterogeneous biological signatures

Some clusters represent biological signatures that are more heterogeneous but still very distinctive.

These clusters are of medium size (1000 - 6000 segments) and medium complexity (ab out 10-200

families represented in each cluster). Some of them suggest a p ossible relation b etween the con-

Chapter 6. Global organization of protein segments 83

tributing families. In other instances, a ner resolution might b e attained through further splitting.

Two such examples are cluster 52, and cluster 5, each of which predominantly represents one known

family.

Cluster 52 - the EF hand motif and its relatives

There are 2380 segments in cluster 52, originating from 30 di erent families. Despite this rela-

tively large numb er of families, the amino acid comp osition in this cluster deviates from the overall

values (Fig. 6.6a). Predominant in this cluster is the family of proteins containing the EF hand

2+

motif. The EF hand is a short domain of ab out 30 amino acids that co ordinates Ca ions. The

motif is present in parvalbumin, calmo dulin, trop onin-c and others, all of which are involved in

2+

Ca signaling, thus regulating cellular activities. These proteins often carry several EF domains.

Segments that corresp ond to these domains are classi ed to cluster 52 (see Fig. 6.6b).

(a) 4

3

2

1

0

-1

log2(F_cluster/F_total) -2

Figure 6.6: (a) Amino acid distribution in clus-

-3

ter 52. Acidic amino acids (E and D) are more frequent -4

KRHEDQNSTAGPMILVFWYC

than their average frequency over all the database, while

amino acid

N, P, W and C are underrepresented. For details on the

resentation see Fig. 6.4. (b) Cluster 52 - the EF

(b) . SW:CAL3_PETHY (183 aa) calmodulin related protein rep

domain. The predominant family in cluster 52 is

calcium binding sites hand

the family of proteins containing the EF hand motif. The

50 100 150

is present in parvalbumin, calmo dulin and trop onin-c

52 52 52 motif

protein is shown from each subfamily). These proteins

52 52 85 (one

often carry several EF domains (denoted by black b oxes).

SW:TPCC_HUMAN (161 aa) troponin c, muscle protein

Note that all segments that corresp ond to these domains are

classi ed to this cluster. For details on the representation

46 52 52

see Fig. 6.5. 52 52 52

SW:PRVA_ESOLU (108 aa) parvalbumin alpha

53 52

52 52

2+

In EF domains, glutamic and aspartic acids prevail (they participate in co ordinating the Ca

ions) and indeed these amino acids o ccur in cluster 52 ab ove the average (Fig. 6.6a). In addition

the frequency of amino acids which are absent from classical EF hands such as proline and cysteine

is low.

84 Chapter 6. Global organization of protein segments

Cluster 5 - the EGF domain

Cluster 5 has ab out 6000 segments, which come from over 200 families. It has a distinct amino acids

distribution (Fig. 6.7a), where the representation of hydrophobic amino acids is low and histidine,

proline, tryptophan and cysteine ab ound. Of 1217 proteins that are classi ed by PROSITE and

contribute to this cluster, 8% are EGF-like proteins. However, these proteins contribute more than

30% of the corresp onding segments, and are thus the predominant family. The EGF (epidermal

growth factor) domain is a small p olyp eptide chain of 53 amino acids. The EGF domain includes

six cysteine residues which have b een shown (in EGF) to b e involved in disul de b onds. This

amino acid is highly abundant in this cluster (over 3 times the overall frequency, see Fig. 6.7a). In

proteins containing an EGF-like domain, only the EGF-like domains are classi ed to cluster 5, yet,

each protein contributes ab out 10 segments on average, in accord with the rep eated nature of this

domain (see Fig. 6.7b).

6.7: (a) Amino-acids' distribution in clus-

(a) 4 Figure

5. Despite the large size of this cluster (5935 seg-

3 ter

and its high complexity (over 200 families), it shows

2 ments)

1

a unique amino acid distribution, with prevalent cysteine,

0

histidine, tryptophan, proline and glycine. All these residues

-1

imp ose major structural features. All hydrophobic amino

log2(F_cluster/F_total) -2

acids are underrepresented. For details on the represen-

-3

tation see Fig. 6.4. (b) Cluster 5 - the EGF mo-

-4

and related patterns. Cluster 5 is rich with pro- KRHEDQNSTAGPMILVFWYC tif

amino acid

teins containing the EGF-like domain. One such example is

(b) . SW:LMG1_MOUSE (1607 aa) - Laminin Gamma-1 chain precursor

sw:lmg1 mouse - laminin gamma-1 chain precursor (laminin

5 Egf repeats

b2 chain). Laminin is a complex glycoprotein, consisting of

100 200 300 400 500 600

three di erent p olyp eptide chains, which are b ound to each

24 3 4 2 4 5 5 5 5 7 106

other by disul de b onds into a cross-shap ed molecule. The

4 4 4 2 25 65 5 6 7

protein contains EGF-like domains, denoted by the black

7 Egf repeats

b oxes. Each b ox matches several tandem rep eats. Note

1200

that all the segments classi ed to cluster 5 corresp ond to

2 4 5 6 5 5 5 5 5 86 53 86

the EGF domain. Four segments were classi ed to cluster 6

2 5 5 65 5 5 5 7 86 53

(which is the closest cluster to cluster 5). This is due to the

fact that some of the EGF rep eats are less conserved, and

1600

y exhibit a slightly di erent comp osition (see PROSITE

86 86 85 86 53 86 100 86 ma

on the EGF-like motif ). For details on the

86 85 53 86 86 53 86 106 do cumentation

representation see Fig. 6.5.

Other families which are represented in this cluster are protein kinase, c-typ e lectin, homeob ox,

chitin binding, kringle, wnt1, and more. In most of them cysteine ab ounds and is involved in

disul de b onds. On the other hand, families of proteins rich in cysteine are not necessarily classi ed

to this cluster, e.g. kazal proteases, snake toxins, etc. (for details see Table 6.1).

Chapter 6. Global organization of protein segments 85

Kinship of protein families as inferred from the clustering tree

Further information can b e extracted from the tree of 106 clusters, by examining the splits as they

o ccur along the tree. As clusters that split from the same ro ot cluster may b e biologically related,

unknown relations b etween families may b e revealed by examining the \evolutionary" pro cess in

the nal tree. Likewise, clusters that represent a small numb er of families may hint at a connection

b etween the families that they represent (e.g. clusters 48-51, 99 and more). We fo cus on only two

cluster groups. Yet, other p ossible connections extracted from the junctions in the tree are op en

to interpretations, and may require further exp erimental data.

Cytochromes and Globins

Clusters 39-40 (together with cluster 41) split from their ancestor cluster quite late in our clustering

pro cess (Fig. 6.2). Cluster 40 totally matches the cyto chrome b/b6 family (cyto chrome b heme),

while cluster 39 is comp osed of cyto chrome c (45% of the segments) and globins (54.5% of the seg-

ments), mostly myoglobins. The cyto chromes of the two typ es are functionally related, but the con-

nection b etween globins and cyto chromes is more interesting and suggests an intrinsic link. Indeed,

an evolutionary relation b etween globins and cyto chromes was recently prop osed [Hardison 1996 ].

4 4 (a) cluster 35 (b) cluster 36 3 3

2 2

1 1

0 0

-1 -1

log2(F_cluster/F_total) -2 log2(F_cluster/F_total) -2

-3 -3

-4 -4 KRHEDQNSTAGPMILVFWYC KRHEDQNSTAGPMILVFWYC amino acid amino acid 4 4 (c) cluster 37 (d) cluster 38 3 3

2 2

1 1

0 0

-1 -1

log2(F_cluster/F_total) -2 log2(F_cluster/F_total) -2

-3 -3

-4 -4 KRHEDQNSTAGPMILVFWYC KRHEDQNSTAGPMILVFWYC

amino acid amino acid

Figure 6.8: Clusters 35-38. These four clusters have a common ancestor in the hierarchical tree (Fig. 6.2). The four

clusters are closely related despite the large di erences in their amino acid distribution. For details on the representation

see Fig. 6.4.

86 Chapter 6. Global organization of protein segments

Metal and DNA binding proteins

Clusters 39-40 are only part of a more complex structure. Fig. 6.2 and Table 6.1 suggest an in-

teresting and complex relation that ties clusters 35-41. The most common feature is that almost

all families represented in those clusters bind metal ions (zinc nger, cyto chrome c, metalloth-

ionein,cyto chrome b/b6 and globins), or Heme (cyto chrome b/b6, globins), or DNA (homeob ox,

zinc nger). These families di er in their biological role (enzymes, transcription factors, etc.). Some

of them use cysteines to stabilize their 3-D structure, e.g. zinc nger, snake toxins and kazal pro-

teases. The high frequency of cysteine in those families is re ected in the amino acid comp osition of

these clusters, but do es not account for all of them (compare Fig. 6.8a,b,d to Fig. 6.8c). Moreover,

other clusters that are rich in cysteine (e.g. cluster 5, Fig. 6.7a) are not part of this sup er-structure.

Thus, no simple relation of amino acids comp osition ties all these clusters together. Rather, the

connection is complex and leaves some op en questions.

6.1.2 \Fingerprints" of biological families based on cluster memb ership

It is not easy to characterize biological families, say by a single consensus sequence or pattern.

Consequently, most families are very diverse and p opulate many of our clusters. Therefore, the

nature of a family cannot b e deduced by insp ection of a sp eci c cluster. However, the distribution of

segments from proteins in a family among the various clusters is more revealing. This broader view

leads to an interesting novel representation of families, that distinguishes well di erent families. For

example, families such as globin and gap dh (glyceraldehyde 3-phosphate dehydrogenase) exhibit a

complex, yet well-de ned distribution over clusters (Fig. 6.9a,b). The distribution of segments from

a family among clusters can b e viewed as a ngerprint of the family. The statistical signi cance of

this representation is guaranteed, again, by the cross validation in the clustering pro cedure. Thus,

not only memb ership in small clusters is informative. Memb ership in large and complex clusters

may play a signi cant role in characterizing biological families.

Fingerprints of protein families allow quantitative comparisons among families: pick any dis-

tance measure among probability distributions, e.g., KL-divergence [Cover & Thomas 1991 ] or vari-

ational distance. The similarity b etween two protein families is quanti ed by the distance of their

ngerprints. In this way we can nd, for each family, its proximal families.

It should b e noted that the kinship of protein families, which is directly inferred from the tree

structure (as was suggested for the globins and the cyto chromes in the last section), is based on a

lo cal common motif, while the new representation re ects a global nature of all domains within a

family, and suggests a more thorough kind of similarity, which pro jects to the biological function

of the family.

The p ower of this metho d is demonstrated on several families of membrane proteins and trans-

p orters, whose mutual distances turn out to b e the smallest (Fig. 6.10). The four families (three

transp orter subfamilies, and a family of membrane proteins) share almost the same ngerprint,

Chapter 6. Global organization of protein segments 87

0.3 (a) globin 0.25

0.2

Figure 6.9: The representation of several biolog-

0.15

ical families by their ngerprints. Every biological

is tted with the distribution of segments of proteins 0.1 family

Relative frequency

the family among the clusters, thus obtaining a new

0.05 from

representation of the family ( ngerprints). The relative fre- 0

0 20 40 60 80 100

of a cluster is de ned as the numb er of segments Cluster no quency

0.3

within the family classi ed to this cluster, divided by the

(b) gapdh

numb er of segments within the family. (a)-(b) globins

0.25 total

gap dh (glyceraldehyde 3-phosphate dehydrogenase) are

0.2 and

two well studied families with 666(3434) and 103(1332) pro-

0.15

teins(segments), resp ectively. The characteristic and com- 0.1

Relative frequency

plex ngerprint of each of these families is shown in (a) and

resp ectively. 0.05 (b),

0 0 20 40 60 80 100

Cluster no

3

an evidence for the close biological function they all serve. Other transp orters (e.g. antip orters )

and ion channels (e.g. neurotransmitter-gated ion-channels ) resemble the ngerprints of the four

families mentioned ab ove to varying degrees. Thus, a connection is established among sup erfami-

lies within many of the membranous proteins. Fingerprints can b e further analyzed by considering

sub-families and their ngerprints, as well as by insp ecting sup er-families.

6.1.3 Higher-level measures of similarity b etween sequences

Fingerprints capture the distribution of segments in a family among the di erent clusters, but fail

to account for the order of segments within proteins. Signi cant information can b e extracted

for full-length proteins as well, by mapping each protein to the sequence of clusters in to which

its segments fall. In other words, every protein is enco ded by a \word" over an alphab et of 106

\characters" (the clusters). A natural similarity measure on full-length proteins emerges. Namely,

apply dynamic programming, where the similarity score b etween \characters" dep ends only on the

3

This family is not part of PROSITE list. 30 proteins in SWISSPROT which are de ned as antip orters were

group ed, and the corresp onding ngerprint was created, based on the distribution of the proteins' segments among

the clusters.

88 Chapter 6. Global organization of protein segments

0.3 0.3 (a) abc_transporter (b) prokar_lipoprotein 0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1 Relative frequency Relative frequency

0.05 0.05

0 0 0 20 40 60 80 100 0 20 40 60 80 100 Cluster no Cluster no 0.3 0.3 (c) bpd_transp_inn_membr (d) sugar_transport 0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1 Relative frequency Relative frequency

0.05 0.05

0 0 0 20 40 60 80 100 0 20 40 60 80 100

Cluster no Cluster no

Figure 6.10: Families with similar clustering pro les ( ngerprints). ABC transp orter (179 proteins, 3898

segments), prokaryotic-lip oprotein (131 proteins, 1311 segments), binding protein-dep endent transp orter of the inner mem-

brane (60 proteins, 665 segments) and the bacterial sugar transp orter (57 proteins, 1132 segments), all have similar (but

nonidentical) clustering pro les in (a)-(d) resp ectively. Their mutual distances turn out to b e very small. All the trans-

p orters are proteins with multiple membrane-spanning domains. Prokaryotic-lip oprotein consists of proteins with membrane

attached domain. Note that many of the clusters which prevail in the distributions of these four families (clusters 2, 7, 85,

86, 106) are very large (see Table 6.1). So, while memb ership in individual clusters is not very informative in this case,

the complete ngerprint do es provide a very useful characterization common to these families, that distinguishes them

from the rest. Additional families of membranous proteins, including neurotransmitter-gated ion-channels and G-protein

receptors have ngerprints that resemble, to varying degrees, the ngerprint shown in (a)-(d).

4

distance among the clusters' centroids and the clusters' sizes . Penalties for gaps are high, since

even a single omission entails a gap of 25 amino acids. We denote this new way of comparison

b etween complete proteins by BMR - \b est matching route".

A multiple alignment of the di erent memb ers of the acetylcholine receptor (AChR) family

using this representation, is shown in Fig. 6.11. An examination of the string of clusters' num-

b ers already shows a sub division within the acetylcholine receptors, dividing alpha, b eta, delta,

epsilon and gamma subunits to small related groups. Few segments are common to all subunits,

4

The charge s(i; j ) for switching from cluster i to cluster j is taken as

s(i; j ) = c  log (p(i)p(j )) + d(i; j ) (6.1)

Here p(i) and p(j ) are the clusters' relative sizes, and d(i; j ) is their Euclidean distance. The (negative) constant c

is optimized for b est p erformance. This measure accounts for the di erence in clusters' sizes and the fact that the

presence of a small cluster in the clusters' sequence is more signi cant than the presence of a large cluster. The term

d(i; j ) can b e rewritten as c  log p(i; j ), where p(i; j ) can b e interpreted as the probability that cluster i is replaced

p(i;j )

. Therefore, this score is implicitly a log o dds with cluster j . Equation 6.1 can then b e rewritten as c  log

p(i)p(j )

score. It is p ositive when cluster i is more likely to replace cluster j than what is exp ected simply by chance.

Chapter 6. Global organization of protein segments 89

but most of them are common to subsets of these di erent subunits. While the visualization of

the multiple alignment of the complete proteins is not practical in this case (the average length of

proteins in this family is 500 amino acids), and may lack the clarity needed to understand the com-

plicate connections that reside b etween the di erent subunits, this new representation of complete

proteins reduces signi cantly the details, while maintaining imp ortant information within. The

BMR algorithm can b e applied to generate a quantitative measure of similarity among the AChR subfamilies.

Access Length N o of Column no code se gments 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

acha_bovin. 457 ( 18) 45 86 1 06 . 60 89 4 3 6 6 1 53 53 94 94 2 2 106 103 acha_chick. 456 ( 18) 45 86 1 06 . 60 89 4 3 5 5 1 53 53 94 94 7 106 85 102 acha_human. 482 ( 19) 45 86 85 4 60 89 4 3 6 4 1 53 53 59 94 3 2 85 103 acha_mouse. 457 ( 18) 45 86 1 06 . 60 89 4 3 6 6 1 53 53 94 94 2 2 103 103 acha_najna. 104 ( 4) 4 6 6 6 acha_natte. 104 ( 4) 3 6 6 6 acha_rat. 457 ( 18) 45 86 1 06 . 60 89 4 3 6 6 1 53 53 94 94 2 2 103 103 acha_torca. 461 ( 18) 92 93 1 06 . 60 89 4 3 5 5 1 53 53 59 94 106 85 106 103 acha_torma. 461 ( 18) 92 93 7 . 60 89 4 3 5 5 1 53 53 59 94 106 85 106 103

achb_bovin. 505 ( 20) 85 85 7 . 60 7 4 3 2 3 106 86 86 53 94 4 6 4 106 58 58 achb_human. 501 ( 20) 85 86 1 06 . 60 106 3 3 3 4 106 86 86 53 94 6 6 4 85 58 3 achb_mouse. 501 ( 20) 86 86 1 06 . 60 7 4 3 2 3 106 86 86 53 94 6 6 4 58 58 58 achb_rat. 501 ( 20) 86 86 1 06 . 60 106 4 3 3 3 106 86 86 53 94 5 6 6 58 58 58 achb_torca. 493 ( 19) 86 86 1 06 . 60 43 4 2 2 3 . 85 53 53 94 4 3 3 85 58 3

achd_bovin. 516 ( 20) 106 86 42 . 60 43 4 2 103 3 1 86 53 46 44 106 85 106 3 3 4 achd_chick. 513 ( 20) 106 85 43 . 60 4 6 3 4 4 1 53 53 46 44 3 7 106 2 2 4 achd_human. 517 ( 20) 7 86 43 . 60 4 6 3 103 3 1 53 53 46 106 106 85 85 2 3 4 achd_mouse. 520 ( 20) 7 86 43 . 60 3 4 3 103 2 1 86 53 46 44 7 106 106 2 3 3 achd_rat. 517 ( 20) 7 86 42 . 60 89 4 3 103 3 1 86 53 46 44 106 85 106 2 3 4 achd_torca. 522 ( 20) 3 86 43 . 6 4 6 4 103 4 1 53 53 46 3 7 106 7 7 106 3 achd_xenla. 521 ( 20) 3 86 43 . 60 3 4 3 103 2 85 53 53 46 2 7 85 106 3 3 3 ache_bovin. 491 ( 19) 2 103 1 06 . 60 7 3 2 3 3 85 86 53 46 106 . 86 106 3 89 92 ache_human. 493 ( 19) 7 103 1 06 . 60 43 3 3 4 3 85 86 53 46 106 . 85 106 2 92 92 ache_mouse. 493 ( 19) 2 103 1 06 . 60 43 4 3 3 3 85 86 53 46 85 . 86 106 2 89 92 ache_rat. 494 ( 19) 2 103 85 . 60 43 4 2 3 3 85 86 53 46 85 . 86 106 2 89 92 achg_bovin. 519 ( 20) 2 102 43 . 60 60 6 3 103 7 85 86 53 53 106 2 106 7 4 2 2 achg_chick. 514 ( 20) 7 102 43 . 60 43 6 3 2 2 85 86 53 53 2 3 85 106 2 89 3 achg_human. 517 ( 20) 106 102 43 . 60 60 6 2 103 7 85 86 53 53 85 106 7 106 4 3 3 achg_mouse. 519 ( 20) 103 102 43 . 60 60 6 2 103 106 85 86 53 53 106 7 7 106 2 7 3 achg_rat. 519 ( 20) 103 102 43 . 60 3 6 2 103 7 85 86 53 53 106 2 7 106 2 2 3 achg_torca. 506 ( 20) 86 86 43 . 60 43 6 4 4 4 85 53 86 46 106 7 7 7 89 89 92

achg_xenla. 510 ( 20) 85 102 43 . 60 2 3 2 2 3 85 86 53 53 106 7 85 106 3 3 3

Figure 6.11: Multiple alignment of the acetylcholine receptor family. 32 memb ers of the acetylcholine

receptor family are shown, divided to alpha, b eta, delta, epsilon and gamma subunits. The proteins are represented by

the sequence of clusters in to which their segments fall, and aligned using the BMR algorithm for pairwise comparison

(see text for details). In columns where there are only few predominant clusters, these clusters are colored. The color

intensity is prop ortionate to the degree of conservation of a cluster within an aligned column in these 32 proteins. Only

a few columns are colored for illustration. This representation emphasizes the regions that are shared by all subunits (see

5th column), and those that are shared by some of the subunits. For example, in the 3rd column, alpha, b eta and epsilon

share almost the same segment, while delta and gamma subunits share a slightly di erent one. However, clusters 43 and

106 are geometrically close, an evidence for the strong relation b etween these segments. Likewise, in the 14th column,

b eta and gamma share a common segment, while delta and epsilon share di erent one. Still, these segments are close (the

distance b etween clusters 46 and 53 is small), thus indicating the common nature they all share as memb ers of the same

\mother" family.

As Fig. 6.11 demonstrates, this representation maintains the original nature of full-length pro-

teins, and may b e used towards a more re ned classi cation of families. It is intriguing to nd

out whether it reveals other interesting features of proteins. We tested this new metho d on full-

length proteins in comparison with the SW algorithm (Table 6.2). We translated all proteins in the

database into sequences of characters in the alphab et of 106 characters and compared each protein

against the database, using the BMR algorithm, in search of related proteins. The quality of p er-

formance was estimated by taking a single memb er from each family in PROSITE, comparing it

90 Chapter 6. Global organization of protein segments

against all the database, and identifying its related proteins in the family. Identi cation was based

on the following \equivalence numb er" identi cation criterion [Pearson 1995 ]: De ne the cuto

score as the similarity score that balances the numb er of related sequences b elow it and the numb er

of unrelated sequences with score ab ove it (i.e. the score where the numb er of false-p ositives equals

the numb er of false negatives). Only proteins with score at or ab ove the cuto score are considered

as identi ed. The results were compared against the SW algorithm with blosum62 scoring matrix

and values -10,-1 for gap p enalties, currently considered the b est metho d known [Pearson 1995 ].

Family No. of Proteins BMR SW Query

Rubisco Large 224 222 212 P35214

Tubulin 164 160 140 P02568

Egf 119 54 53 P07246

1 106 106 94 P25160 Actins

Gap dh 102 97 95 P17336

Hsp70 1 101 86 85 P22879

H2a 59 59 54 P19140 Histone

Chap eronins Cpn60 55 52 43 P19866

Lactalbumin Lysozyme 54 51 43 P08992

H2b 53 52 51 P16868 Histone

Metallothionein Cl1 46 43 43 P02303

Reca 42 36 29 P29843

Kinase 39 31 31 P18564 Pglycerate

Trop omyosin 36 33 26 P05697

Chalcone Synth 36 36 35 P00705

Rib osomal S12 34 25 24 P25336

Histone H3 2 31 27 27 P09862

1 30 28 25 P14717 Catalase

C2 Domain 27 18 18 P27362

Pal Histidase 25 17 13 P14714

1 25 21 18 P29254 1433

Photosystem I Psaab 24 23 22 P11383

Enolase 24 22 17 P26348

Arf 24 19 18 P10851

Synthase Beta 19 19 19 P33421 Trp

Phyto chrome 19 17 16 P08562

T id 15 15 14 P13393

Hydroxyl 15 14 14 P20077 Biopterin

Table 6.2: Performance of BMR (b est matching route) compared with SW. Some of the families on which

BMR did at least as well as SW are shown. For each such family, we show the numb er of proteins in the family longer than

50 amino acids (2nd column), and the numb er of proteins from the family which were identi ed using the \equivalence

numb er" identi cation criterion [Pearson 1995 ], in either metho d (3rd and 4th columns). Accession numb ers for the queries

are given in the last column. See text for further details.

Already with the BMR's simplistic approach, it comp eted successfully with SW on ab out 80

families of varying sizes (see Table 6.2). The p erformance of the BMR metho d is sup erior for

families that are well-characterized in terms of structure or function, since these families fall into

small clusters which receive a high score (see fo otnote no. 3).

Chapter 6. Global organization of protein segments 91

6.2 Discussion

The Euclidean emb edding approach suggests a novel metho d for self organizing complex data such

as protein sequences. This pro cedure allowed us to p erform a full scale comparison of nearly 40,000

proteins. The only biological information utilized by our metho d came in the form of a reliable and

computable lo cal metric. Sp eci cally, we employed the SW dynamic programming with blosum62

similarity matrix and -10,-0.5 for p enalizing sequence gaps, considered the current b est sequence

comparison algorithm [Pearson 1995 ]. Given the pairwise distances among protein segments, all

segments are carefully clustered into statistically signi cant families.

So far, our work has yielded a classi cation into only 106 classes (Fig. 6.2, Table 6.1). Yet, even

with this small numb er of 106 clusters, we found many signi cant biological signatures. Some known

families of proteins were clustered into well distinguished clusters, and other clusters match well

known motifs and domain within proteins. Kinship of protein families could b e inferred from the

clustering tree: Di erent families which were clustered into the same cluster, or split from the same

ancestor cluster may share some biological features. A similarity measure emerges for full-length

proteins as well. Proteins can b e characterized by their clusters sequence. This representation leads

to a quantitative comparison measures b etween full-length proteins, based on the b est matching

route (BMR) of clusters, and in some instances, our comparison metho d (BMR) outp erformed the

currently b est sequence comparison metho d (SW).

However, in view of the 700 PROSITE families, a more re ned classi cation seems desirable.

We have tested versions of the clustering pro cess, where cross-validation is applied only once in a

numb er of splitting phases. More p ermissive cross-validation pro cedures may still yield meaningful,

more re ned classi cations. The outcomes of one such pro cedure is shown in Fig. 6.12. Starting

from the ab ove 106 clusters, clusters with high asp ect-ratios were split and cross-validation was

p erformed only subsequently, when the numb er of clusters reached 150. Four clusters failed the

cross-validation test, and their segments were returned to the general p o ol. Thus 146 clusters were

obtained, all satisfying the same cross-validation criteria. 16 of the original clusters that underwent

further splitting are shown, resulting in 41 sub clusters. Clearly, b oth small and large clusters were

a ected (compare with Table 6.1). This pro cedure also veri es the stability of relations b etween

protein families that can b e suggested based on the tree of 106 clusters. It indicates a weak relation

in cases where the participating families were set apart, and a strong relation when they remained

together. When applied to the 146 clusters, BMR did b etter than SW on 11 additional families,

indicating further p otential for this metho d.

Our standard yardstick here is the PROSITE classi cation. While this is a ma jor reference

against which results such as ours ought to b e checked, certain shortcomings of the PROSITE

classi cation should b e kept in mind: Only 46% of the proteins are currently classi ed in PROSITE.

Moreover, the classi cation is often determined on the basis of very short subsequences, less than

10 residues in some cases, which often represent various signals or very lo cal, small sites, and not

necessarily structural or functional domains. Besides, most of the families are small, containing

only a few memb ers each (over 80% of the families have less than 30 memb ers in each, see Fig. 6.1).

92 Chapter 6. Global organization of protein segments

. . . .. Phase I

42-43 58-61 45 46 47 48 49 50 51 52 53

Phase II

No of Families 1 2-5

6-20

Figure 6.12: Second phase splitting with delayed cross-validation. Further splitting is p erformed, under the

same strict criteria for stable splits (see text). However, at this phase, these criteria are not veri ed at every step, so

cross validation is carried out only after all splittings are p erformed. The original (phase I) 106 clusters yield 146 clusters.

This gure shows the more re ned tree structure for the clusters numb ered 42-53, and 58-61 in the rst phase (total of

55,599 segments). Only 6 clusters (numb ers 43,44,49,53,59,61 ) remained intact. The other 10 split into 35 sub clusters.

The rectangular b ox shows these clusters at the end of phase I, and the resulting sub clusters are b elow. The leaves of the

tree show 35 of the 146 clusters in phase I I. Clusters that represent only a few families are denoted by small lled b ox at

the leaves. Note that for the clusters of the second phase, this splitting resulted in clusters of only one family (13 cases),

and 2-5 families (7 cases). The other clusters are still very big. Clearly, b oth small and large clusters were a ected. For

instance, the sub clusters originating from cluster 48 each represent a single PROSITE family. Some of the highly complex

clusters are also a ected, e.g., clusters 45-47 and 58-61, each with 150-250 families.

Our strict criteria for validity stops the clustering pro cess short of complete resolution, thus many

small families are \lost" in bigger clusters. We can exp ect further progress when more proteins

from small families get sequenced.

Besides the immediate information ab out biological patterns that can b e derived from the

clusters, they yield additional insight into the classi cation of protein families. Protein families

have characteristic distributions among the clusters, which we call \ ngerprints". While most of

the 106 obtained clusters corresp ond to a single functional protein family, most segments b elong

to very few large, non-sp eci c, clusters. Still, the ngerprints of families that do not corresp ond

to a single cluster are characteristic enough to distinguish imp ortant functional protein families.

Comparisons b etween ngerprints of distinct PROSITE families yield similarity indices of b oth

statistical and biological signi cance, where families of similar ngerprints turn to have similar

biological roles. Such indices can b e helpful in de ning families and sup er-families.

Our segment clustering approach provides an elegant, higher level, representation of protein se-

quences. The choice of segment length is made according to the length of patterns in the PROSITE

Chapter 6. Global organization of protein segments 93

classi cation, most of which are b etween 5-40 amino acids long. The choice of 50-mer fragments

is also consistent with structural features in proteins, since many folds consist of 20-80 amino

acids. Still, p erforming our pro cedure at other segment lengths may yield di erent granularity and

eventually new interesting insights on other classes of proteins.

94 Chapter 6. Global organization of protein segments

Chapter 7

Part II - The graph-based approach

7.1 Preface

The Euclidean emb edding approach has added much to our understanding of the protein sequence

space. This metho dology enabled us to p erform a full-scale analysis of all protein sequences in

the SWISSPROT database, without incorp orating any prior biological information. A few as-

p ects made this approach particularly attractive: (i) In Euclidean space we can apply ecient

self-validated clustering algorithms. (ii) The clusters' centroids o er a natural choice of clusters'

representatives, whereas otherwise there is no satisfactory general solution. (iii) Geometrical as-

p ects such as direction and shap e are of great imp ortance, and indeed we made use of these asp ects

in designing the clustering algorithm. For example, the hierarchy is obtained by splitting at each

iteration the cluster with the largest asp ect ratio. Furthermore, the geometry of the clusters is

taken into account when we compare the clusters of the rst set with the clusters of the second set,

in order to determine agreement.

Indeed, the results so far seem promising. The global organization reveals many clusters that

corresp ond to signi cant biological patterns, and it induces new measures of similarity for protein

families and for protein sequences. However, this approach has some drawbacks that should b e

mentioned:

 The dimensionality of the problem is probably inherently high. Therefore, even with optimal

emb edding, mapping this very complex space into a low-dimensional Euclidean space will

result in a certain amount of distortion.

 The emb edding problem is a hard problem. The optimal algorithm which emb eds the metric

space into a Euclidean space with minimal distortion is computationally intensive, and is not

applicable to more than a few dozen of sequences. Therefore, we resorted to using ecient

1

but sub optimal algorithms. These algorithms may su er from additional distortion .

1

It is not clear how and to what extent the distortion a ects the global geometry. As mentioned in section 4.4,

the distortion may b e correlated with the recovery of a global structure. Though the p osition of the image of a 95

96 Chapter 7. The graph based approach

 Perhaps the most distressing issue is that metric space is a must. However, no e ective

distance measure for complete protein sequences is known (see section 4.1). In particular,

distance measures among sequences that are based on dynamic programming are overly sen-

sitive to di erences in lengths among proteins. Therefore, we chose segments of 50 amino

acids, and not complete proteins, as our basic building blo cks. However, this solution raised

other problems such as: What is the \correct" way to segment complete proteins into do-

mains? Since protein domains are of various lengths, xing the segment length may not b e

the optimal solution. Moreover, in general one may wish to cluster protein sequences based

on other similarity measures which may not b e translated into distances. In such cases, the

emb edding algorithm of the previous section can no longer b e applied.

The graph-based approach presents a di erent metho dology for solving the problem of global

organization of protein sequences. This approach explores order within the sequence space without

emb edding the data in Euclidean space. It fo cuses on complete sequences, and clusters the

sequences based on their pairwise dissimilarities. The core of this approach is based on faithful

graph representation of the sequence space, and a two-phase clustering algorithm. This approach

is describ ed in detail in the next chapter. However, a short intro duction to pairwise clustering

algorithms is presented rst.

7.2 Pairwise clustering algorithms

Pairwise clustering algorithms are of great interest when the data is given as a proximity matrix

(pairwise similarities/dissimilarities). However, they are not con ned to proximity data, and obvi-

ously can b e applied to vector data as well (simply by rst calculating the distances b etween the

vectors).

Pairwise clustering algorithms are often called graph-based clustering algorithms. In terms of

graph theory the sample set is represented as a weighted directed graph G = (V;E ), where the

no des of the graph (the set V ) are the sample p oints and an edge is formed b etween every pair

of no des (the set of all no des is denoted by E ). The weight of an edge w (i; j ) is a function of

the similarity/dissimilarity b etween the no des i and j . As describ ed b elow, the steps of a pairwise

clustering algorithms are equivalent to op erations on the graph (e.g., deleting/adding edges) and

the resulting clusters are equivalent to sets of no des in the graph with a sp eci ed prop erty.

Almost all pairwise clustering algorithms share the same basic hierarchical scheme, which par-

titions the data into nested groupings. Traditionally these algorithms were called hierarchical

2

clustering . These hierarchical pro cedures can b e divided into two classes: agglomerative and

sp eci c protein segment in the Euclidean space re ects its relative distance from all the other protein segments, no

constraints are set on (unknown) higher order structures within the space. It would b e interesting to check how

incorp oration of distributional constraints such as entropy (e.g., as demonstrated for multi-dimensional scaling by

[Klo ck & Buhmann 1997 ]) will a ect the emb edding.

2

As was mentioned in chapter 5, the term \hierarchical clustering" was extended to also describ e hierarchical

schemes pro ducing non-nested groupings (e.g., the LBG and the VQ algorithms).

Chapter 7. The graph based approach 97

divisive. Agglomerative pro cedures start from singleton clusters and successively merge clusters.

Divisive pro cedures start with all the samples in one cluster and successively split clusters. The

agglomerative scheme requires fewer computations to pro ceed from one level to another, and there-

fore it is used more frequently. Here I will fo cus on this scheme (a few divisive algorithms were

already describ ed in chapter 5).

Given n samples with pairwise dissimilarities d(i; j ) (1  i; j  n), the basic agglomerative

clustering is describ ed in the following pro cedure:

Denote by k the numb er of clusters, and let k b e the desired numb er of clusters.

0

 Let k = n and initiate the clusters C = fig i = 1::n

i

 Lo op:

{ If k  k , stop.

0

{ Find the nearest (less dissimilar) pair of distinct clusters, say C and C .

i j

{ Merge C and C and decrement k by one.

i j

{ Go to lo op.

Letting k = 1 terminates the pro cess when all samples are classi ed to one cluster. But usually

0

a threshold is set and the pro cess terminates at the level where the distance b etween the nearest

clusters exceeds the threshold. At any level the distance b etween the nearest clusters can provide

the dissimilarity value for this level.

There are di erent ways to measure the distance b etween clusters. Each metho d can result

in an algorithm with di erent prop erties. I will fo cus on three main distance functions and their

corresp onding algorithms.

7.2.1 The single linkage algorithm

This algorithm (also called the nearest neighb or algorithm) uses the following distance measure

b etween clusters:

d (C ;C ) = min fd(i; j )g

min i j i2C ; j 2C

i j

i.e., the distance is de ned as the least of all of the p ossible pairwise distances b etween the clusters.

If data p oints are represented as no des of a graph, merging is equivalent to forming an edge b etween

the nearest pair of no des i and j (in C and C resp ectively), whose length (or weight) w (i; j ) equals

i j

the dissimilarity of the corresp onding pair of p oints. Since the clusters are distinct, the resulting

graph never has circuits, i.e., the resulting graph is a tree. If k = 1 (i.e., the merging continues

0

until all samples b elong to one cluster), then the tree is a spanning tree, which contains a path

from any no de to any other no de (since at each stage the closest clusters are merged, it is easy to see

that the generated tree is a minimal spanning tree, i.e., the sum of the edge lengths is the minimum

of all p ossible spanning trees for this set of samples). If a threshold is set, and the clustering stops

98 Chapter 7. The graph based approach

when the minimum distance b etween two clusters exceeds this threshold, the resulting graph splits

into connected comp onents.

This pro cedure is e ective when the clusters are compact and well separated. However, the

algorithm is very sensitive to noise or to slight changes in the distances b etween p oints. For

example, in the presence of a few outliers the algorithm may pro duce a bridge b etween unrelated

clusters. Thus \elongated" unnatural clusters can b e formed (the \chaining e ect").

7.2.2 The complete linkage algorithm

This algorithm (also called the furthest neighb or algorithm) uses the following distance measure

b etween clusters:

d (C ;C ) = max fd(i; j )g

max i j i2C ; j 2C

i j

In this case the distance b etween two clusters is determined by the most distant pair of p oints.

Hence, when two clusters are merged, the distances b etween all other p oints in the resulting cluster

are smaller than the distance b etween this pair. In terms of graph theory, this pro cedure will

pro duce a graph in which each cluster is a complete subgraph, i.e., all no des within the cluster

are connected by edges. As for the single linkage algorithm, this algorithm is e ective for compact

clusters. The growth of elongated clusters is discouraged, which is a welcome e ect on the one

hand. On the other hand, if the true clusters are not very compact, the resulting clusters may b e

meaningless.

7.2.3 The average linkage algorithm

Both the single-linkage and the complete-linkage algorithms are very sensitive to noise and to

outliers. The average linkage algorithm uses the average distance b etween p oints from the two

clusters to de ne their distance,

X X

1

d (C ;C ) = d(i; j )

av e i j

n n

i j

i2C j 2C

i j

which is a compromise b etween the other two measures d and d . Consequently, this algorithm

min max

tends to generate more stable clusterings. When sample p oints are vectors in real normed space,

then another measure of distance which pro duces similar results, with many fewer computations,

is d (C ;C ) = km~ m~ k where m~ (m~ ) is the center of cluster C (C ).

mean i j i j i j i j

It should b e noted that the ab ove pro cedures do not necessarily pro duce partitions which

optimize any particular criterion function (recall the discussion in chapter 5), though in some cases

(for some criterion functions) it is p ossible to obtain a stepwise-optimal pro cedure by merging, at

each level, the pair of clusters that would increase (or decrease) the criterion function as little as

p ossible (see [Duda & Hart 1973 ]).

Chapter 7. The graph based approach 99

7.2.4 Other pairwise clustering algorithms

More sophisticated clustering schemes have b een develop ed over the years. Some of them are based

on graph theory. For example, the minimal cut clustering algorithm [Wu & Leahy ] uses the notion

of cuts in graphs. A cut (A; B ) in graph G(V;E ) is a partition of V into two disjointed sets of

vertices A and B . The capacity of a cut is the sum of the weights of all edges that cross that cut,

i.e.,

X

w (i; j ) cut(A; B ) =

i2A j 2B

There may b e more than one p ossible cut b etween two given vertices. The minimal cut is the one

which has the minimal capacity. The minimal cut clustering algorithm starts with a preliminary

step in which weights (the pairwise similarities) are transformed to the maximal ow b etween

3

the corresp onding vertices . The new weights re ect the overall p ossible intermedia connections

b etween vertices (high ow value indicates strong interaction). Based on these new weights, the

algorithm divides the graph into subgraphs of connected comp onents, by removing edges of low

capacity, in order of increasing capacity, until an external threshold value is met. Thus, the largest

4

inter-subgraph maximum ow is minimized .

Other transformations of distances/similarities have b een prop osed in the literature. Some

transformations are lo cal and utilize near neighb or information. For example, the mutual neigh-

b orho o d clustering algorithm [Smith 1993 ] de nes the weight of the edge b etween sample i and

sample j as the sum of their mutual neighb orho o d ranks, i.e., if j is the n-th neighb or of i and i

is the m-th neighb or of j , then the weight is de ned as m + n. Other transformations take into

account global prop erties of the graph (as in the case of the minimal cut algorithm). Clearly,

di erent transformations can lead to di erent clustering results.

The minimum cut algorithm favors cutting small sets of isolated vertices in the graph (since

the cut capacity increases with the numb er of edges crossing the cut). To avoid this bias, the

normalized cut algorithm [Shi & Malik 1997 ] normalizes the capacity of the cut (A; B ) by the total

asso ciation of the set A (which is the sum of weights incident on A), and the total asso ciation of

B . This disasso ciation measure is called the normalized cut:

cut(A; B ) cut(A; B )

N cut(A; B ) = +

assoc(A; V ) assoc(B;V )

P

where assoc(A; V ) = w . The algorithm lo oks for partitions with small N cut value.

i;k

i2A k 2V

By this normalization, small sets of isolated p oints are no longer preferred since the cut capacity

for such sets is a large p ercentage of the total connection from the small set to all the other

no des in the graph, resulting in a large N cut value. The graph is partitioned recursively so as to

3

According to one of the famous theorems in graph theory, the maximal ow b etween two vertices equals the

capacity of the minimal cut b etween them. Finding the maximal ow is a well studied problem and there exist

ecient algorithms for solving it [Cormen et al. 1990 ].

4

In practice, the maximal ow do es not need to b e calculated for all p ossible edges. Rather, a maximal spanning

tree is constructed (by successively solving n 1 maximum ow problems, n b eing the numb er of vertices), and the

connected comp onents are de ned from this tree.

100 Chapter 7. The graph based approach

5

minimize the normalized cut criterion at each level until N cut exceeds a pre-de ned threshold .

The criterion function is computed by reformulating the problem and solving it as a generalized

eigenvalue problem.

A di erent approach, based on an analogy with physical magnetic systems (inhomogeneous

ferromagnets) is prop osed in [Blatt et all 1997 ]. Each data p oint is asso ciated with a \state"

(out of a nite numb er of p ossible states), and short-range interactions b etween data p oints are

intro duced. The interactions decrease with the distance. In this mo del, p oints are able to \interact"

with each other and change states, until they reach a steady state. Points may in uence each other

directly or through other mediating p oints, hence the mutual in uence b etween p oints is a collective

6

e ect . The system has minimal energy when all the samples are in the same state (a single cluster).

However, the statistical prop erties (e.g., the total energy) of this thermo dynamic system are a ected

by the temp erature, and such p erfect con guration can exist only at zero temp erature. At every

nite temp erature there is some level of noise, and the state variable s asso ciated with each sample

will uctuate. As a result, the sample p oints are distributed in di erent states. The level of

the noise decreases as the temp erature decreases. Consequently, the uctuations decrease as well,

and clusters of samples of di erent states may merge (a phase transition). Therefore, as the

temp erature is decreased the system undergo es a series of phase transitions, and by tracing this

7

pro cess, it is p ossible to obtain hierarchical classi cation of the samples .

An attractive asp ect of the algorithms describ ed in this section is their app eal to intuition.

They all share an imp ortant feature: Global prop erties of the graph are taken into account in

5

Shi and Malik [1993] also de ne a normalized measure of asso ciation within groups, namely

assoc(B;B ) assoc(A; A)

+ N assoc(A; B ) =

assoc(A; V ) assoc(B;V )

Note that

cut(A; B ) cut(A; B )

N cut(A; B ) = +

assoc(A; V ) assoc(B;V )

assoc(B;V ) assoc(B;B ) assoc(A; V ) assoc(A; A)

+ =

assoc(A; V ) assoc(B;V )

= 2 N assoc(A; B )

Therefore, minimizing the disasso ciation N cut b etween di erent groups is equivalent to maximizing the asso ciation

within groups.

6

, where the summation is For a given con guration of states, the energy of the system is de ned as J 

ij s ;s

i j

over all interacting pairs of p oints, s (s ) is the state of the sample i (j ), and J is a p ositive interaction strength

i j ij

b etween sample i and sample j , which decays with the dissimilarity b etween the p oints. The Kronecker delta 

s ;s

i j

equals one if s = s and zero otherwise. Hence, only interacting samples in the same state are counted.

i j

7

Recently, an alternative approach was prop osed by [Gdalyahu et al. 1998 ]. Their algorithm is a randomized ver-

sion of an agglomerative clustering which generalizes the minimal cut algorithm. Their work resembles the physically

motivated granular magnet mo del, but is derived from di erent concepts and draws on di erent distributions.

Chapter 7. The graph based approach 101

the clustering pro cess. This feature provides the mo del with robustness against spurious noise.

However, most of these algorithms are computationally intensive and cannot b e applied to the space

of all protein sequences. Our clustering algorithm tries to achieve the same prop erties without the

need for massive computations. It is also designed to address the sp eci c problems that are inherent

in the analysis of protein sequences and the merging pro cess is closely monitored in an attempt

to prevent the transitive chaining that is observed for multi-domain proteins. The algorithm is

describ ed in detail in the next chapter.

102 Chapter 7. The graph based approach

Chapter 8

The graph based approach -

representation and algorithms

We investigate the space of all protein sequences in search of clusters of related proteins. The goal

is to automatically detect these sets and thus obtain a classi cation of all protein sequences.

The graph-based approach uses a graph representation of the sequence space. Sequences are

mapp ed to vertices, and edges b etween vertices are weighted in a manner that re ects the degree of

similarity b etween the corresp onding proteins. To assign weights to edges we start from an all-vs-all

comparison of SWISSPROT using the standard measures for sequence similarity (SW, FASTA and

BLAST). In a prepro cessing step, edges are screened to maintain only those similarities which we

b elieve to re ect true relationships. An e ort was made to include as much as p ossible distant

relationships (weak similarities) in the graph, while avoiding meaningless, random similarities.

Clusters of related proteins corresp ond to strongly connected comp onents of this graph. To

identify these sets, we start from a very conservative initial classi cation based on the highest scor-

ing pairs. The many classes in this classi cation corresp ond to protein subfamilies. Subsequently

we merge the sub classes using the weaker pairs in a two-phase clustering algorithm. The algorithm

makes use of transitivity to identify homologous proteins, however, transitivity is applied restric-

tively in order to prevent unrelated proteins from clustering together. This pro cess is rep eated at

varying levels of statistical signi cance. Consequently, a hierarchical organization of all proteins is

obtained.

These metho ds are discussed in detail in this chapter. The chapter starts with the main guide-

lines of this approach and a short review of related works, followed by a description of the pro cess

of building the graph (esp ecially the pro cess of assigning weights to edges), and the two-phase

clustering algorithm. 103

104 Chapter 8. The graph based approach - representation and algorithms

8.1 Intro duction

Analysis of new protein sequences almost always starts with a comparison of the sequence un-

der study with all known sequences, in search of close relatives that may have b een assigned a

function. In this way prop erties of a new protein sequence are extrap olated from those of its

neighb ors. Since the early 70's, algorithms were develop ed for comparing protein sequences ef-

ciently and reliably (see 2). But even with the b est alignment of two protein sequences, the

basic question remains: Do they or do they not share the same biological function? It is gener-

ally accepted that two sequences with over 30% identity along much of the sequences are likely

to have the same three-dimensional structure or fold [Sander & Schneider 1991 , Flores et al. 1993 ,

Hilb ert et al. 1993 , Brenner et al. 1998 ]. Proteins of the same fold often have similar biological

1

functions . Nevertheless, one encounters many cases of high similarity b oth in fold and function,

that is not re ected in sequence similarity [Murzin 1993 , Pearson 1997 , Brenner et al. 1998 ]. Such

cases are missed by current search metho ds that only compare sequences.

Detecting homology may often help in determining the function of new proteins. By de nition,

homologous proteins have evolved from the same ancestor protein. The degree of sequence conser-

vation varies among protein families. However, homologous proteins almost always have the same

fold [Pearson 1996 ]. Homology is, by de nition, a transitive relation: If A is homologous to B, and

B is homologous to C, then A is homologous to C. This simple observation can b e very e ective

in discovering homology. However, when applied simple-mindedly, this observation leads to many

pitfalls.

The common evolutionary origin of two proteins is almost never directly observed. Nevertheless,

we can deduce homology, with a high statistical con dence, given that the sequence similarity

is signi cant. This is particularly useful in the so called \twilight zone" [Do olittle 1992 ], where

sequences are identical to, say, 10-25%. Transitivity can b e used to detect related proteins, b eyond

the p ower of a direct search.

The p otential value of transitivity has b een observed b efore. In [Watanab e & Otsuka 1995 ]

transitivity and single linkage clustering are employed to extract similarities among 2000 E. coli

protein sequences. In [Ko onin et al. 1996 ] a similar analysis is p erformed on 75% of the proteins

enco ded in the E. coli genome. The p ower of transitivity in inferring homology among distantly

related proteins is demonstrated in [Pearson 1997 ] and [Park et al. 1997 ]. In [Neuwald et al. 1997 ]

transitivity was combined to a search through the database, for the purp ose of mo deling a pro-

tein family, starting from a single sequence. Transitivity was also used in [Gonnet et al. 1992 ,

Harris et al. 1992 , Sonnhammer & Kahn 1994 , Sonnhammer et al. 1998 ].

Though transitivity is an attractive concept, it has its p erils: Similarity is not transitive, and

2

similarity do es not necessarily imply homology . Therefore, similarity should b e used carefully in

1

with the exception of convergent evolution in which the same fold is shared by non homologous proteins.

2

Similarity may b e quanti ed whereas homology is a relation that either holds or do es not hold. Signi cant

similarities can b e used to infer homology, with a level of con dence that dep ends on the statistical signi cance (see

Pearson 1996).

Chapter 8. The graph based approach - representation and algorithms 105

attempting to deduce homology. Multidomain proteins make the deduction of homology particu-

larly dicult: If protein 1 contains domains A and B, protein 2 contains domains B and C, protein

3 contains domains C and D, then should proteins 1 and 3 b e considered homologous? This simple

example indicates the inadequacy of single-linkage clustering for the purp ose of identifying protein

families within the sequence space. Some of the works mentioned ab ove have also addressed the

p erils of transitivity [Harris et al. 1992 , Sonnhammer & Kahn 1994 , Neuwald et al. 1997 ].

Exp ert biologists can distinguish signi cant from insigni cant similarities. However, the sheer

size of current databases rules out an exhaustive manual examination of all p otential homologies.

Our goal here is to develop an automatic metho d for classi cation of protein sequences based on

sequence similarity, through the detection of groups of homologous proteins (clusters), and high level

structures (groups of related clusters) within the sequence space. Such organization would reveal

relationships among protein families and yield deep er insights into the nature of newly discovered

sequences.

8.1.1 Related works

Several large-scale studies of protein sequences were carried out during the last few years. As

discussed in chapter 3, these studies can b e generally divided into two classes: the motif-based

studies and the protein-based studies. The relevant studies are brie y summarized here.

Motif and domain based analyses

These studies are fo cused on nding signi cant patterns within protein sequences. Such analyses

yielded imp ortant databases of motifs and domains, e.g. PROSITE, Blo cks, PRINTS, ProDom,

Pfam and Domo (see section 3.1 for a brief description of each study). Most of them require groups

of prealigned sequences or prede ned groups of related proteins (PROSITE, BLOCKS, PRINTS,

Pfam). Some are involved with manual calibration (PROSITE, PRINTS, Pfam) while others are

fully automatic. Two of these studies make use of transitivity to enhance sensitivity. ProDom

applies the transitive closure of high scoring segments pairs obtained by BLAST (when the common

segments overlap ab ove a minimum overlap parameter). In the Pfam database, the construction

of new families starts from an HMM mo del derived from multiple alignment of related proteins,

which is then improved iteratively by searching for further related sequences in the database.

These sequences are iteratively incorp orated into the mo del, until the pro cess converges. After

each iteration the alignment is checked manually to avoid misalignments.

Protein based analyses

These are analyses which apply to whole protein sequences. Among which we include the studies by

[Gonnet et al. 1992 , Harris et al. 1992 , Watanab e & Otsuka 1995 , Ko onin et al. 1996 , Barker et al. 1996 ,

Tatusov et al. 1997 , Krause & Vingron 1998 ]. All these works cluster the input database using

transitive closure (i.e. single linkage clustering) of pairwise similarity scores (see section 3.2 for

106 Chapter 8. The graph based approach - representation and algorithms

details). Among these studies, three [Harris et al. 1992 , Tatusov et al. 1997 , Barker et al. 1996 ]

have addressed the problem of multi-domain proteins. Harris et al. allow groups to merge only

if they share k overlapping regions. However, they concluded that k=1 is the b est choice for

highest accuracy. Thus their clustering pro cedure essentially remains a single-linkage cluster-

ing (in multi-domain proteins, regions are classi ed to multiple classes). In the second study

[Tatusov et al. 1997 ], clusters are created starting from triangles formed by three homologous pro-

teins from di erent sp ecies. Triangles which share an edge are merged (this requirement reduces the

probability that unrelated clusters merge). An additional (manual) step is carried to split clusters

which are incorrectly merged due to multi-domain proteins. The third study [Barker et al. 1996 ]

distinguishes b etween families and sup erfamilies, based on global similarities, and homology do-

mains, based on lo cal similarities. Both classi cations are involved with careful manual insp ection.

Our approach

The imp ortant role of motifs and domains in de ning protein's function is unquestionable: detect-

ing known motif within a new protein sequence can help reveal its function and lead to the correct

assignment of the new sequence to an existing protein family. Indeed, domain-based studies have

added much to our knowledge. The domain-based databases usually o er a lot of biologically valu-

able information ab out domains and the domain structure of proteins, through multiple alignments

and schematic representations of proteins. However, in many cases, characterizing a new protein

only by its domain content is insucient. This happ ens, for example, when no known domains are

apparent in the new protein. In some instances, only a few related sequences are available, to o few

to de ne a reliable prototyp e signature or a pro le of the common domains. Therefore, a prop er

analysis of a new protein sequence should incorp orate comparisons against domain based databases,

as well as sequence databases. In this view, an analysis which identi es groups of related proteins

in databases of protein sequences is invaluable. It may amplify the outcomes of a database search.

When close hits are already group ed together based on mutual similarity, this may highlight a

similarity with a group which could otherwise b e missed by a simple manual scanning. Moreover,

if several groups are found related to the query sequence, this may indicate the existence of several

distinct functional/structural domains. If all groups share the same region of similarity, this adds

insights ab out the relations b etween the di erent groups. In some cases, this may suggest that

the groups b elong to the same family or sup erfamily. Some of this information, but not all of

it, can also b e inferred from comparisons with domain databases. Moreover, unlike most domain

based metho ds, protein-based metho ds can b e applied to all protein sequences, and are more easily

automated.

Our graph-based approach attempts to de ne a classi cation of whole protein sequences. It

draws on pairwise similarities and seeks clusters of related proteins. It applies a mo derate version

of transitive closure, in an attempt to eliminate chance similarities and avoid indirect multiple-

domain-based connections. The result of our study is a hierarchical organization of all known

protein sequences. The classi cation uses only standard similarity scores and do es not dep end on

Chapter 8. The graph based approach - representation and algorithms 107

further biological information.

8.2 Metho ds

This section contains a description of our computational pro cedure. The pro cedure was carried

out on the SWISSPROT database [Bairo ch & Bo eckman 1992 ] release 33, with a total of 52205

3

proteins and 18,531,385 amino acids .

8.2.1 De ning the graph

We represent the space of protein sequences as a directed graph, whose vertices are the protein

sequences. An edge b etween two vertices is weighted to re ect the dissimilarity b etween the cor-

resp onding pair of sequences, i.e. a high similarity translates to a small weight. To compute the

weight of the directed edge from A to B , one compares sequence A against all sequences in the

SWISSPROT database including sequence B , and obtains a distribution of scores. The weight is

taken as the exp ectation value of the similarity score found for A and B , based on this distribution

(see chapter 2). This is an estimate for the numb er of o ccurrences that the appropriate score could

have b een obtained by chance, i.e. when compared with random sequences drawn from the same

background distribution (usually de ned as the distribution of amino acids overall the database).

A low exp ectation value re ects a signi cant, strong connection, whereas a high exp ectation value

re ects an insigni cant, weak connection. Not all edges are retained in the graph as edges of sta-

tistically insigni cant similarity scores are discarded (details b elow). In other words, in the nal

graph, an edge b etween sequences A and B indicates that the corresp onding proteins are likely to

4

b e related .

This graph is constructed using the common algorithms for protein sequence comparison: Smith-

Waterman dynamic programming metho d (SW) [Smith & Waterman 1981 ], FASTA [Lipman & Pearson 1985 ]

and BLAST [Altschul et al. 1990 ]. The SW algorithm was run with the BLOSUM62 matrix

[Heniko & Heniko 1992 ] and gap p enalties of -12,-2 using either the Bio ccelerator hardware

[Compugen 1998 ] or the ssearch program which is part of the W. Pearson's FASTA 2 package.

FASTA was run using the fasta program with the BLOSUM50 matrix and gap p enalties -12,-2 (the

default setting). Both ssearch and fasta calculate exp ectation values based on empirically derived

3

Recently, the pro cedure was applied to a newer release of SWISSPROT (release 35 with up dates up to May 6th

1998, with a total of 72623 proteins). Since this was not one of the main releases, it complicated the assessment

pro cedure (describ ed in the next chapter), esp ecially in comparisons with other classi cations. The results of the

analysis of b oth releases are available at the ProtoMap web site. For the mutual corresp ondence/correlation of b oth

releases see section 9.5.

4

The self-similarity of short proteins is less signi cant than self-similarity of long proteins, since exp ectation

value decreases as the length of the similar region decreases. Therefore, when a distance function b etween protein

sequences is sought, the exp ectation values may need to b e corrected for the query length (W. Pearson, p ersonal

communication). Within the scop e of this work the exp ectation values are used as are, and weights re ect statistical

signi cance rather than distance (intuitively, the term distance can b e used instead of exp ectation value, to indicate

that two proteins are either close or far, but practically, no metric is de ned).

108 Chapter 8. The graph based approach - representation and algorithms

distribution of scores [Pearson 1998 ] (the Bio ccelerator applies the same pro cedure for assessing the

signi cance of results as in ssearch). The BLAST algorithm was also run with the BLOSUM62

5

matrix using the blastp 1.4.9 program available from the NCBI ftp site . The program rep orts

similarity scores along with the probability (pv al ue) that the scores could have o ccurred by chance.

1

Blastp probabilities are transformed to exp ectation values by the formula ev al ue = log

1pv al ue

(see manual). All these metho ds are in daily use by biologists for comparing sequences against

the databases. Though SW tends to give the b est results on average, it is not uncommon that

FASTA or BLAST are more informative, esp ecially when combined with di erent scoring matrices

[Pearson 1995 ]. Therefore we chose to incorp orate all three metho ds into the graph to achieve

6

maximum sensitivity

The following sections contain a detailed description of the pro cedure of assigning weights to

edges. The pro cedure starts by creating a list of neighb ors for each sequence, based on all three

metho ds. In order to place the exp ectation values for all three metho ds on comparable scales, a

numerical normalization is determined and applied. Then, only statistically signi cant similarities

are maintained in these lists. Finally, the weight of an edge is de ned as the minimum exp ectation

value asso ciated to it by any of the three metho ds, so as to capture the strongest relationship (recall

that edge weights represent exp ectation values, so small values indicate high similarity).

8.2.2 Placing all metho ds on a common numerical scale

It is relatively easy to compare b etween scores that a particular metho d assigns to di erent com-

parisons. However, how do es one compare b etween scores that are assigned by di erent metho ds?

We p erformed the following calculation: Pick any protein, carry out an exhaustive comparison

against the whole database and consider the highest scores in each of the metho ds. Now plot these

values against one another for two metho ds at a time. These scores show a remarkably strong linear

relation on a log-log scale (Fig. 8.1); therefore by intro ducing a (usually small) correction factor,

p er each protein and p er metho d, the three metho ds get scaled to a single reference line.

The di erences b etween FASTA and SW are mostly due to the di erent scoring matrices that

are b eing used, and can b e corrected by multiplying the original score by the relative entropy of

the two matrices (see section 2.4.3). The di erences b etween SW and BLAST may b e due to

approximations in estimating the parameters  and K . The underlying assumption in calculating

these parameters is that the amino acid comp osition of the query sequence is close to the overall

5

The new version of BLAST also accounts for gaps and gives a very go o d approximation to the SW algorithm, while

b eing much faster [Altschul et al. 1997 ]. However, the old version of BLAST o ccasionally detects similarities that

+

are missed by SW (e.g. for the Glucagon precursor family, and the H -transp orting ATP synthase [Pearson 1995 ]).

We haven't tested yet if the new version of BLAST preserves this merit.

6

FASTA is based on a restricted application of the rigorous SW algorithm and is usually b eing viewed as an

approximation of SW, whose main advantage is its sp eed. However, combining FASTA with the BLOSUM50 (whereas

SW and BLAST were used with the BLOSUM62 scoring matrix) resulted in b etter identi cation of remote homologies.

Indeed, many similarities were rep orted exclusively by only one of the three metho ds - in some cases as many as tens

of hits p er sequence, which were not detected by the other metho ds. Incorp orating several matrices helps to cover a

broader range of evolutionary distances, and in the next releases of ProtoMap we intend to enlarge this range.

Chapter 8. The graph based approach - representation and algorithms 109

BLAST e-values vs. Smith-Waterman e-values (for 1431_lyces) BLAST e-values vs. Smith-Waterman e-values (for 1a03_human) 0 0 (a) (b) -50 -50

-100 -100 slope = 1.16 slope = 1.3 -150 -150

-200 -200 BLAST log(e-value) BLAST log(e-value) -250 -250

-300 -300

-300 -250 -200 -150 -100 -50 0 -300 -250 -200 -150 -100 -50 0

Smith-Waterman log(e-value) Smith-Waterman log(e-value)

Figure 8.1: Correlation of BLAST e-values and Smith-Waterman e-values. (a) BLAST e-values of neigh-

lyces (P42651) vs. the SW e-values of the same neighb ors. The graph is plotted in log-log b oring sequences of 1431

scale. Note the strong linear correlation b etween the scores assigned by the two metho ds, where the slop e is 1.16, i.e.

1:16

ev al ue = (ev al ue ) . (b) BLAST e-values of neighb oring sequences of 1a03 human (P04439) vs. the SW

B LAS T SW

e-values of the same neighb ors. The slop e is 1.3 in this case.

distribution (see chapter 2 section 2.3.3). This assumption often fails, e.g., for low complexity

segments. Moreover, these parameters are based on rst order statistics of the sequence, the

scoring matrix and the database. The corrections that are required to match SW and BLAST may

b e due to inaccurate approximations of the estimated parameters, or to higher order statistics of

the sequence.

8.2.3 Neighb ors' lists

It is, of course, very dicult to set a clear dividing line b etween true homology and chance similarity.

5

An exp ectation value b elow 10 indicates that a false match would o ccur once in 100,000 searches

and can b e safely considered signi cant. On the other hand, an exp ectation value ab ove 10 re ects

mostly pure chance similarities. However, the midrange is more dicult to characterize, and

homologous proteins can have exp ectation values around 1. An overly strict threshold will miss

imp ortant similarities within this \twilight zone", whereas an excessively lib eral criterion will create

many false connections. The exact threshold for each pairwise comparison metho d was set to b est

discriminate among related and unrelated proteins. Our choice is based on the overall distribution

of exp ectation values over the entire protein space, as given by each of the three metho ds (Fig. 8.2).

The distribution shown may b e thought as the average distribution of exp ectation values for a

'typical' protein sequence as a query. The graphs in Fig. 8.2 naturally suggest a threshold for each

metho d. The distribution drawn on a log-log scale is nearly linear at low exp ectation values (where

pairs of related sequences dominate), but starts to rise rapidly at a certain value. The steep slop e

at high exp ectation values indicates a rapid growth in the numb er of sequences that are unrelated

to the query sequence. Although the distribution may di er from one sequence to another, there

is not enough data to deduce a reliable threshold for each sequence. Only when the distributions

110 Chapter 8. The graph based approach - representation and algorithms

Distribution of e-values (SW) Distribution of e-values (FASTA) 0.001 0.001 (a) (b)

0.0001 0.0001 frequency frequency

1e-05 1e-05 1e-10 0.001 0.1 1 1e-10 0.001 0.1 1 e-value e-value

Distribution of e-values (BLAST) 0.001

(c)

Figure 8.2: Overall distribution of e-values ac-

cording to the three main algorithms for sequence

comparison. (a) SW (b) FASTA (c) BLAST. The dis-

tributions are based on the neighb ors lists of all protein

0.0001

sequences in the SWISSPROT database and are plotted

frequency

in a log-log scale. Frequency is the relative numb er of

pairwise similarity scores with e-value that is equal to

the value shown on the x-axis. Note that the deviation

3

from straight line starts earlier in BLAST, around 10 , 1

1e-05

in FASTA and SW it starts only around 10 . 1e-10 0.001 0.1 1 whereas

e-value

7

are averaged, is the derived threshold reliable .

In this view, the thresholds for SW, FASTA and BLAST are set at the value where the slop e

3

rapidly changes (0.1, 0.1 and 10 resp ectively). An edge from vertex A to vertex B is maintained

only if a signi cant score is obtained by any of the three metho ds used to compare the corresp onding

sequences. Namely, if either SW or FASTA yield an exp ectation value  0:1 or BLAST's exp ectation

3

value is  10 .

While the self-normalized statistical estimates of FASTA and SW [Pearson 1998 ] are quite

reliable (see also [Brenner et al. 1998 ]), the statistical estimates of BLAST may b e a ected by the

amino acid comp osition of the query sequence, and an unusual comp osition (e.g. low complexity

segments within the sequence) may bias the results of a search [Altschul et al. 1994 ]. Therefore, we

also generated the results using BLAST following a ltering of the query sequence, to exclude low

complexity segments ( ltering was carried out with the SEG program [Wo otton & Federhen 1993 ]).

On the other hand, many relations of biological signi cance can b e missed if only sequences that

pass the lter are to b e considered. Therefore, this was handled with more stringent BLAST

6

threshold of 10 .

7

The b est threshold (that minimizes the numb er of false p ositives and the numb er of false negatives) may change

from one sequence to another, as noted ab ove. However, the derived threshold is the b est on the average, and can

b e used as a guideline.

Chapter 8. The graph based approach - representation and algorithms 111

A ma jor di erence b etween BLAST and SW/FASTA is that BLAST do es not include gaps in the

alignments. BLAST detects similarities based on one or more high-scoring segment pairs (ungapp ed

lo cal alignments). Signi cance is assessed by applying Poisson or sum statistics [Altschul et al. 1994 ].

Consequently, since gaps are ignored, BLAST tends to overestimate the statistical signi cance of

fragmented alignments. We counter this b ehavior of BLAST by the ab ove asymmetry in selecting

the thresholds (Fig. 8.2). While this prop erty may help BLAST reveal signi cant similarities that

the other metho ds miss (e.g. Pearson 1995), we have to b eware of highly fragmented alignments

that cannot b e considered biologically meaningful. Therefore, we ignore those BLAST scores that

come from a large numb er of HSPs (high scoring pairs), when the MSP (maximal segment pair) is

8

insigni cant .

Finally, even if the comparisons b etween proteins A and B fail to satisfy the previous criteria,

the edge from A to B is maintained when all three metho ds yield an exp ectation value  1.

This pro cedure is designed to screen most of the chance similarities in the neighb ors list of each

protein sequence. Unfortunately, chance similarities may o ccasionally pass these criteria. A ma jor

goal of the algorithm that is describ ed next is to detect such similarities and eliminate them.

8.2.4 Exploring the connectivity

We now turn to explore this graph, in search of clusters of related sequences which hop efully have

a characteristic biological function. There are two ma jor obstacles which should b e considered: i)

Multidomain proteins can create undesired connection among unrelated groups; ii) Overestimates

of the statistical signi cance of similarity scores may bias our decisions; chance similarities b ecome

more abundant as signi cance levels decrease. Therefore, transitivity should b e applied restrictively.

By analogy, if transitivity is to b e viewed as a force that attracts sequences, then it should b e

countered by some \repulsive force" to keep unrelated clusters apart and prevent collapse of the

9

protein space .

Our approach

The clustering pro cedure starts by eliminating all edges of weight b elow a certain, very high,

signi cance threshold (i.e., low exp ectation value). This op eration splits the graph to many small

comp onents. In biological terms, we split the set of all proteins into numerous small groups of

8

Sp eci cally, denote the numb er of HSPs and the MSP score by N and S resp ectively. The average and

HSP MSP

the standard deviation of N and S are calculated for high scoring sequences ( ,  ,  and

HSP MSP HSP HSP MSP

 resp ectively). Those hits that are based on N which satisfy N >  +  , with MSP score

MSP HSP HSP HSP HSP

S <   , and are not signi cant according to SW and FASTA, are ignored.

MSP MSP MSP

9

Our analysis is not aiming to overcome incorrect gene identi cation. This is an imp ortant problem, and indeed

it should b e addressed to some extent in the creation and management of sequence databases.

112 Chapter 8. The graph based approach - representation and algorithms

closely related proteins, which corresp ond to highly conserved subfamilies. To pro ceed from this

basic highly restrictive classi cation, we lower the threshold in a stepwise manner, and gradually

take into account similarities of lower statistical signi cance. In so doing, several clusters of a given

threshold may merge. The pro cess is closely monitored and a merge is allowed only when strong

statistical evidence is found for a true connection among the proteins in the resulting set. Detailed

description of these two main steps follows.

Note that the graph is directed, and hence is not necessarily symmetric. Sp eci cally, it may and

do es happ en that there is an edge from protein A to protein B, but none in the reverse direction.

Furthermore, even if b oth edges exist, their weights usually di er. Therefore, in a preliminary step,

this graph is transformed into an undirected graph, by replacing the directed edges from A to B

(with weight w ) and from B to A (with weight w ), with one undirected edge whose weight is

1 2

de ned as the maximum of w and w . If there is only one directed edge, then in the new graph it

1 2

is discarded.

Basic classi cation

If all edges of weight b elow a certain signi cance threshold are eliminated, the transitive closure

of the similarity relation among proteins splits the space of all protein sequences into connected

comp onents or clusters. The transitive closure is equivalent to a single-linkage clustering and the

resulting clusters are prop er subsets of the whole database wherein every two memb ers are either

directly or transitively related. These sets are maximal in this resp ect and cannot b e expanded.

Thus they o er a self-organized classi cation of all protein sequences in the database. We initially

100

set the threshold at the very stringent signi cance level of 10 . Similarities at this threshold

corresp ond to highly conserved regions that involve at least 150 amino acids. Thus, neither chance

similarities, nor connections based on a chain of distinct common domains in multi-domain proteins

o ccur at this level. The resulting connected comp onents can b e safely exp ected to correlate with

known highly conserved biological subfamilies.

The clustering algorithm

Our pro cedure is recursive. That is, given the classi cation at threshold T , the algorithm derives

5

the classi cation at the next more p ermissive level, that is 10  T . We start from the basic

100

classi cation at 10 .

The algorithm runs in two phases. First it identi es and mark groups (\p o ols") of clusters

that are considered as candidates for merging (see Fig. 8.3). A lo cal test is p erformed where each

candidate cluster is tested with resp ect to the cluster which \dragged" it to the p o ol, to check their

degree of similarity.

To quantify the similarity of two clusters P and Q, we calculate the geometric mean of

all pairwise scores b etween sequences in P and Q. Unrelated pairs are assigned the default (in-

Chapter 8. The graph based approach - representation and algorithms 113

Figure 8.3: The clustering algorithm. Phase I)

Identify pairs of clusters that are considered as candi-

X

for merging. Decisions are made based on the ge-

Phase I dates

ometric mean of the pairwise scores of the connections

X

b etween the two clusters. If this mean exceeds a sp eci c

threshold then the cluster is accepted as a candidate,

and enters a p o ol of candidates. Otherwise it is rejected

(denoted in the gure by \X"). Phase I I) Pairwise clus-

is applied to identify groups of clusters which are

Phase II tering

strongly connected. At each step the two closest groups

are chosen from the p o ol and merged provided that the

quality of the connection exceeds the threshold (see text).

Otherwise they stay apart (denoted by dashed line).

signi cant) e-value of 1. The geometric mean considers the distribution of all pairwise connections

b etween the two clusters, so that random or unusual connections have little e ect. The geomet-

ric averaging may equivalently b e considered as (arithmetically) averaging the logarithms of the

p

T (more signi cant) our similarity scores. When the geometric mean of the e-values is b elow

interpretation is that P and Q are indeed related and that their connection re ects a genuine simi-

larity. This threshold was chosen to obtain a sublinear decrease (since pairwise similarities are more

frequent as con dence level decreases). It gave b etter results than other schemes and prevented

a quick collapse of the whole space into few huge clusters (as would happ en if the single-linkage

algorithm were to b e applied). The level of con dence in the reliability of the connection clearly

decreases as T increases. We de ne the Quality of the P to Q connection as minus the log of the

geometric mean (see fo otnote 13). This quantity ranges b etween 0 and 100, and is higher for more

signi cant connections. This term is used later on.

At the end of the rst phase we are left with a group of clusters which are candidate for merging.

This group is the input for the second phase.

In the second phase we carry out a variant of a pairwise clustering algorithm. This algorithm

successively merges only pairs of clusters that pass the ab ove test (and are thus not susp ected as

representing chance similarities). At each step the two closest clusters are chosen on the basis of

the quality of their connection and merged if their similarity (as quanti ed ab ove, and based on

all pairwise similarities b etween sequences within the new formed clusters) is more signi cant than

the threshold. The pro cess stops and the nal clusters are de ned (Fig. 8.3) when the similarity of

the next closest clusters do not pass the threshold (Fig. 8.3).

Any rejected merge is marked for further biological analysis. We refer to these rejected merges

as possibly related clusters, and comment on them in the next chapter.

This analysis is p erformed at di erent thresholds, or con dence levels, to obtain a hierarchical

100

organization. The analysis starts at the 10 threshold. Subsequent runs are carried out at levels

95 90 85 80 0

10 , 10 , 10 , 10 ... The pro cess terminates at the threshold of 10 = 1. Ab ove the

114 Chapter 8. The graph based approach - representation and algorithms

10

threshold of 1 almost all similarities are in fact chance similarities (see previous section) .

10

The exp ectation values that are assigned to edges are taken from the output of fasta, ssearch and blastp, and

are de ned in the context of a single search against the database. These statistical estimates re ect, for each score,

the numb er of hits with the same score that could have o ccurred by chance in a single search against a database of

the same size and comp osition as the SWISSPROT database. Since this pro cedure was rep eated 52,205 times (the

numb er of proteins in the SWISSPROT 33 database), then we may encounter a similarity with an exp ectation value

of 1/52205, simply by chance, at least once. Therefore, in the context of our work, one may suggest to correct the

exp ectation values and divide them by 52,205 (e.g. an exp ectation value of 1e 5 transforms to an exp ectation value

of 0.5 which is not very signi cant). However, since hits were already screened to exclude most chance similarities

(see section 8.2.3) and thresholds were de ned based on the original e-values, and in order to t the output of search

programs such as fasta, ssearch and blastp, the original e-values were kept.

Chapter 9

ProtoMap: Automatic classi cation of

protein sequences, a hierarchy of

protein families, and lo cal maps of the

protein space.

The metho ds of the previous chapter were applied to cluster all the protein sequences in the

SWISSPROT database. The resulting classi cation splits the protein space into well de ned groups

of proteins, which are closely correlated with natural biological families and sup erfamilies. Nearly

all of the clusters we found seem biologically meaningful, some corresp onding to well known families,

and many others representing less studied families. Some clusters consist exclusively of unknown

proteins or hyp othetical proteins. Our analysis concerns complete proteins, and it is not limited

only to those subsequences which are identi ed as functionally or structurally imp ortant motifs

and domains. Indeed, not all the emerging clusters are correlated with a sp eci c domain, although

some clusters are characterized by a domain that is common to many or all memb er proteins.

Needless to say, this overwhelming b o dy of information cannot b e prop erly surveyed here. An

interactive web site that contains the results of our analysis has b een constructed

(http://www.protomap.cs.huji.ac.il), where users can get acquainted with this \map" of the protein

space. The following sections give the results of the overall assessment of our classi cation in com-

parison with several well known domain-based databases, and o er a glimpse of our classi cation.

9.1 General information on clusters

Table 9.1 shows the distribution of cluster sizes at various con dence levels. At each level, the set

of all proteins splits into clusters, which merge to form larger and coarser clusters as the con dence

level decreases. In particular, the numb er of isolated proteins (clusters of size 1) diminishes as well.

0

At the lowest signi cance level (10 = 1) there are 10,602 clusters, of which 4435 contain at least 115

116 Chapter 9. ProtoMap: Automatic classi cation of protein sequences

2 memb ers. 1006 clusters have size 10 and ab ove.

Con dence Cluster size Total numb er

level over 100 51-100 21-50 11-20 6-10 2-5 1 of clusters

100

10 8 18 90 234 528 3727 29870 34475

95

10 8 19 100 240 537 3806 29086 33796

90

10 8 20 111 256 545 3871 28224 33035

85

10 8 23 119 262 563 4004 27189 32168

80

8 25 133 264 594 4071 26140 31235 10

75

10 9 31 132 275 623 4131 25051 30252

70

10 34 138 293 653 4136 23943 29207 10

65

10 11 32 156 309 660 4180 22911 28259

60

10 13 34 171 319 677 4170 21772 27156

55

15 40 178 334 676 4194 20646 26083 10

50

10 15 51 184 350 676 4188 19463 24927

45

10 17 53 197 362 696 4181 18282 23788

40

21 54 203 383 714 4109 17129 22613 10

35

10 23 53 213 393 760 4101 15801 21344

30

10 26 53 232 415 774 4014 14428 19942

25

10 29 57 252 421 788 3897 13191 18635

20

10 32 64 263 436 779 3775 11839 17188

15

10 35 64 270 464 808 3645 10620 15906

10

38 76 293 457 802 3231 9112 14009 10

5

10 51 92 315 431 684 2655 7169 11397

0

10 51 94 315 456 703 2816 6167 10602

Table 9.1: Distribution of clusters by their size at each con dence level.

The numb er of clusters (of size bigger than 1) at each level of con dence ranges b etween 4228

and 5543. Naturally, one may ask if these numb ers are in agreement with the exp ected numb er

of clusters in the \ultimate" classi cation of all proteins. However, this numb er is unknown and

we can only provide an estimate. A lower b ound for this numb er is provided by the numb er of

di erent folds, since we exp ect memb ers of the same cluster to have similar folds. The current

estimates place the total numb er of folds (known and unknown) b etween several hundreds and few

thousands [Chothia 1992 , Gonnet et al. 1992 , Green et al. 1993 , Wang 1996 ]. However, the same

fold is usually adopted by few di erent sup erfamilies which share little or no signi cant sequence

similarity. Therefore, a sequence based classi cation would probably place these sup erfamilies in

di erent (though p ossibly related) clusters. Moreover, sup erfamilies may consist of several families,

sometimes with only a few p ercent sequence identity. Consequently, these families may b e classi ed

into di erent clusters, dep ending on the sensitivity of the metho d. On the average, each fold is

adopted by two to three protein families [Brenner et al. 1997 ]. Thus the total numb er of clusters

is exp ected to exceed the numb er of folds, but it is likely to b e less than an order of magnitude

bigger. A numb er of clusters in the 4000-5000 range seems consistent with this estimation.

Most of these clusters are biologically meaningful. Many of the smaller clusters suggest a

de nition of new biological families, and often reveal a characteristic feature. Table 9.2 shows only

0

the 50 largest clusters at the lowest con dence level (10 ). The description attached in Table 9.2

to each cluster is based mainly on SWISSPROT annotations [We do not have a fully automatic

Chapter 9. ProtoMap: Automatic classi cation of protein sequences 117

metho d for annotating all the clusters. A prop er biological interpretation requires a substantial

degree of biological sophistication and insight. However, a simple census of proteins based on

SWISSPROT de nition usually gives a go o d indication of the cluster's nature]. The table should

b e viewed only as a sample. For further information, the reader is referred to the ProtoMap web

0

site. Henceforth, unless otherwise stated, cluster numb ers refer to signi cance level 10 .

Cluster Size Order of Family

transitivity numb er

1 718 2 protein kinases

2 593 2 globins

3 514 2 G-protein coupled receptors

330 2 immunoglobulin V region 4

5 326 2 immunoglobulins and major histo compatibility complex

6 318 2 homeob ox

7 315 2 ribulose bisphosphate carb oxylase large chain

8 284 2 ABC transp orters

9 260 1 zinc- nger C2H2 typ e

256 2 calcium-binding proteins 10

11 252 2 serine proteases, trypsin family

229 2 GTP-binding proteins - ras/ras-like family 12

13 221 2 myosin heavy chain, trop omyosin, kinesins

14 208 3 collagens, structural proteins

206 2 cyto chrome p450 15

16 198 2 GTP-binding elongation factors

17 196 2 tubulins

190 1 cyto chrome b/b6 18

19 187 2 ATP synthases

20 172 2 heat sho ck proteins

21 171 2 alcohol dehydrogenases (short-chain)

22 171 2 snake toxins

23 152 2 NADH-ubiquinone oxidoreductase

142 2 bacterial regulatory comp onents of signal transduction 24

25 141 3 DNA-binding proteins of HMG

26 140 1 nuclear hormones receptors

139 1 actins 27

28 139 1 intermediate laments

138 2 GTP-binding, ADP-rib osylation factors family 29

30 136 1 neurotransmitter-gated ion-channels

31 133 2 zinc-containing alcohol dehydrogenases

133 2 cellular receptors, EGF-family 32

130 3 amylases 33

34 130 1 hemagglutinin

35 129 2 RNA-directed DNA p olymerase

0

Table 9.2: Largest clusters at the lowest con dence level 10 (part 1)

118 Chapter 9. ProtoMap: Automatic classi cation of protein sequences

Table 9.2...continued

Cluster Size Order of Family

transitivity numb er

36 125 1 chap erones, chap eronins

37 122 2 phospholipase A2

38 120 2 insulins

39 115 1 cyto chrome c

40 115 3 ketoacyl synthase

41 114 2 growth hormones (somatotropin, prolactin and related hormones)

113 1 glyceraldehyde 3-phosphate dehydrogenase 42

43 113 3 nuclear proteins, hn-RNP and sn-RNP, RNA pro cessing proteins

44 110 1 viral nucleoprotein

109 1 cyto chrome c oxidase subunit II 45

46 108 3 kazal serine protease inhibitors, secreted SPARC proteins

47 102 3 2Fe-2S ferredoxins, avohemoproteins

102 2 viral genome p olyproteins 48

102 1 developmental regulators - WNT family 49

101 1 cation transp ort ATPases 50

0

Table 9.2: Largest clusters at the lowest con dence level (10 ). (i) Clusters are ordered in decreasing order

of size. (ii) The order of transitivity within each cluster is de ned as follows: select the protein with the maximum numb er

of neighb ors and de ne it as the cluster's seed. The seed's order of transitivity is 0. Its neighb ors are of order 1. Additional

proteins that are neighb ors of 1st order proteins, are of order 2, etc. (iii) The family description states the feature common

to most of the memb er proteins.

9.2 Performance evaluation

It is very hard to evaluate the validity of classi cations that emerge from a large-scale study of

protein sequences, in that no generally accepted standards have b een set yet in this eld. Thus new

classi cations are traditionally compared with what is considered a state of the art characterization

of protein sequences, namely, the PROSITE dictionary of signature patterns, motifs and domains

[Bairo ch et al. 1997 ]. For domain based studies, a comparison with the manually derived PROSITE

dictionary is inevitable, and essential for testing the biological signi cance of the results. However,

when the analysis extends b eyond regions which are known or susp ected as domains, no standard

b enchmark exists to assess the quality of the results. One may then resort to comparisons against

domain-based databases. Obviously, this may bias the assessment, a fact that should b e kept in

mind when evaluating the results.

To estimate the quality of our classi cation, we compared it with two well established domain-

based databases: PROSITE and Pfam.

9.2.1 The evaluation metho dology

Given a reference classi cation A and a new classi cation B of the same set X , we evaluate the

quality of the classi cation B in terms of the reference classi cation A, as re ected by their mutual

Chapter 9. ProtoMap: Automatic classi cation of protein sequences 119

1

agreement . We consider four such indices of quality.

Gracy and Argos [Gracy & Argos 1998 ] have prop osed a pro cedure for such a p erformance

evaluation: Each class a 2 A is asso ciated with the group b 2 B which maximizes the quantity

tp f p f n. Here the numb er of true p ositives (tp) is given by ja \ bj 1, that of false p ositives

(fp) is given by jb n aj and that of false negatives (fn) is given by ja n bj. Quality is de ned by the

tp

p ercentage of the true p ositives 100  . In the same way, the p ercentage of false p ositives

tp+f p+f n

and false negatives are calculated. We call this index Q .

sing l e

What if B is a further re nement of classi cation A (i.e., each group in A splits p erfectly into

several groups in B )? The Q parameter will b e very small, since for each group in A only one

sing l e

group in B will b e counted for. In order to counter this, we intro duce the following mo di cation.

Now each group a 2 A is asso ciated with all those groups b from B , for which tp > f p. Sp eci cally,

we say that a group b 2 B is a relative of a group a 2 A if more than 50% of b's memb ers are also

memb ers of a (see Fig. 9.1). For each group a 2 A we identify all its relative groups b in B . The

union of all the relatives of a is denoted by b . A protein is misclassi ed by classi cation B if it

a

is a memb er of a missed by b (false negative), or is a memb er of b , but not a memb er of a itself

a a

(false p ositive). The intersection of b and a de nes the group of correctly classi ed proteins. a

B2 B1 B5 a

true positives

false negatives

false positives

B3 B4 B6

Figure 9.1: Asso ciation of groups in classi cation B with groups in the reference classi cation A.

Groups B 1-B 4 are relatives of the red ellipse (group of A), while groups B 5 and B 6 are not, as the overlap is to o

(a)

ja\b j

a

. For = 100  small. The set b is de ned as b = B 1 [ B 2 [ B 3 [ B 4, and the quality Q is given by Q

a a set

set

ja[b j

a

(a)

ja\B 2j

comparison, Q is given by Q = 100  .

sing le

sing le

ja[B 2j

We de ne the quality Q of the classi cation for the group a by the p ercentages of the true

set

p ositives in the union a [ b . Namely,

a

ja \ b j

(a) a

Q = 100 

set

ja [ b j

a

1

The two classi cation can b e either \hard", i.e. each protein is classi ed to exactly one group, or \soft" where

each protein can b e classi ed to more than one group. In our case, Pfam and PROSITE are soft, while (the current

version of ) ProtoMap is a hard classi cation.

120 Chapter 9. ProtoMap: Automatic classi cation of protein sequences

which accounts for b oth false p ositives and false negatives errors. This pro cedure is rep eated for

every group a 2 A, and the total p ercentage of true p ositives is given by the average over all groups

a 2 A.

As observed ab ove, Q gives more favorable evaluations when B is a re nement of the partition

set

A. This however, may lead to another problem, since dividing each group a 2 A into many

small clusters (and in the extreme, to singletons clusters) is not desirable. Therefore, we also

de ne another quality index, that accounts for the numb er of clusters which are relatives of a, and

p enalizes for excess in this numb er. This is done by rst subtracting the numb er of relatives of a

from the numb er of true p ositives, then calculating the p ercentages of the total numb er of proteins

in the union a [ b , i.e.

a

ja \ b number of r el ativ es of a + 1j

(a) a

= 100  Q

setr el ativ es

ja [ b j

a

This way, a class with N elements which has a single identical relative cluster in classi cation B

has quality of 1 (or 100%). On the other hand, if the relatives are N singletons in classi cation B ,

the corresp onding quality is close to zero.

The fourth index of quality is given by the quantity tp=(tp + f n), which is the fraction of the

reference family a which is in the set b . This measure do es not take into account the false p ositives;

a

the reference classi cations may have not detected all homologies, and therefore not all seemingly

false p ositives are indeed such (see section 9.2.4). This index is based on the (overly generous)

assumption that all false p ositives are in fact p otential related sequences. Obviously, this is not

true, and we should b e extra cautious in using it in automatic assignments of protein sequences.

This estimate still provides a useful upp er b ound on the quality of the classi cation, since typically

some of the false p ositives are indeed related sequences. We denote this quantity by Q

upper bound

Consistency test prop osed by Krause and Vingron [Krause & Vingron 1998 ] can b e used for

rough self-validation, but it is less useful for assessing the quality of a new classi cation, with

resp ect to a reference classi cation (see section 3.2.6).

9.2.2 The reference databases

We compared our classi cation with two domain based classi cations: PROSITE and Pfam. Our

classi cation contains 52,205 proteins of the SWISSPROT database, release 33, classi ed to 10,602

clusters, of which 1006 have size 10 and ab ove (see table 9.1). The PROSITE database, release

13 (released with the SWISSPROT 33 database) contains 24156 proteins, characterized by 1151

di erent signature patterns. PROSITE often asso ciates the same family with two or more signa-

ture patterns. Therefore, sequences with di erent PROSITE patterns do cumented as the same

PROSITE family are considered to b elong to the same family. For example, all proteins that have

either the ACTINS 1, the ACTINS 2 or the ACTINS ACT LIKE signature b elong to the actins

family. The exceptions are those patterns who never app ear together in the same protein though

COMP ALPHA and ANTENNA COMP BETA). they are do cumented the same (e.g. ANTENNA

Patterns that are always asso ciated with other patterns (e.g. INTEGRIN BETA with EGF, POU 1

Chapter 9. ProtoMap: Automatic classi cation of protein sequences 121

and POU 2 with HOMEOBOX), are ignored. Overall, within these terms, the 1151 signatures char-

acterize 874 protein families and domains, of which 600 are of size 10 and ab ove. The Pfam database

(release 1.0, asso ciated with SWISSPROT 33) contains 15604 proteins, classi ed to 175 families,

2

of which 172 are of size 10 and ab ove .

9.2.3 Evaluation results

The results of the evaluation pro cedure for these reference databases are given in table 9.3. The

evaluation is based on all families with at least 10 memb ers (the same analysis with all families

with at least 5 memb ers gave the same results up to within 1.5% ).

Q Q Q Q

set

sing le setr elativ es upper bound

Reference True False False True False False True

Database p ositives p ositives negatives p ositives p ositives negatives p ositives

PROSITE 68.4% 9.5% 22.1% 77.8% 11.1% 11.1% 75% 88.5%

64.8% 7.3% 27.9% 76.7% 6.9% 16.4% 75% 83.1% Pfam

Table 9.3: Performance evaluation

Gracy and Argos [Gracy & Argos 1998 ] used a similar pro cedure (using the Q param-

sing l e

eter) with resp ect to a reference classi cation that was a combination of PROSITE and PIR

[George et al. 1996 ]. The assessment resulted in 96.6 % true p ositives (1.8% false p ositives) for

PROSITE, 93.2 % true p ositives for DOMO (0.3% false p ositives) and 65.1% true p ositives for

ProDom (0.9% false p ositives). All three are domain-based classi cations. The high p ercentage of

true p ositives in PROSITE with resp ect to this reference classi cation indicates that the combined

database is not much di erent from the PROSITE database. The rst three columns in table

9.3 assess the p erformance of ProtoMap in the same way as [Gracy & Argos 1998 ] with resp ect

to PROSITE. With 68.4% true p ositives, our work compares favorably with ProDom, though we

nd many more false p ositives. However, as noted in the next section, not all these false p ositives

should b e counted as such.

Since ours is a hard classi cation, we nd the second measure Q more appropriate than

set

Q . This means that no single cluster is asso ciated with one domain family. Consider for

sing l e

TRNA LIGASE I family. Seven di erent clusters form the cover example the PROSITE's AA

of this family corresp onding to the subfamilies: leucyl/isoleucyl/valyl/methionyl-trna synthetase

(cluster 152), glutamyl/prolyl-trna synthetase (cluster 429), tryptophanyl-trna synthetase (cluster

758), arginyl-trna synthetase (cluster 988), tyrosyl-trna synthetase (cluster 904), cysteinyl-trna

synthetase (cluster 1216), and a singleton (cluster 6647) arginyl-trna synthetase (a very short

fragment).

Though most of PROSITE families have only one relative in our classi cation (see table 9.4),

2

The latest releases of PROSITE and Pfam are asso ciated with SWISSPROT 35 database. This database includes

69113 sequences, 43053 (62%) of which are included in the 1390 families of the Pfam database release 3.3, and 35340

(51%) included in the PROSITE 14 database.

122 Chapter 9. ProtoMap: Automatic classi cation of protein sequences

Numb er of relative clusters in ProtoMap

0 1 2 3 4 5 6-10 > 10

Numb er of

PROSITE families 73 525 131 62 24 18 30 11 874 (total)

Numb er of

Pfam families 15 76 33 18 10 1 15 7 175 (total)

Table 9.4: Distribution of the numb er of relative clusters.

many families are asso ciated with more than one cluster, where di erent clusters may corresp ond to

di erent subfamilies (b elonging to the same family is still detectable through connections b etween

clusters, as discussed in section 'p ossibly related clusters'). Consequently, with the Q index, the

set

quality of p erformance reaches 77.8%.

9.2.4 Critique of the evaluation metho dology

The ab ove pro cedure may result in an over-strict measure. False p ositives may b e overcounted, since

often supp osedly false p ositives with resp ect to the reference database are actually true p ositives.

For example, short fragments which surely b elong to a sp eci c family may b e considered false

p ositive simply since they are to o short to completely include the domain which is common to all

other memb ers in the family. Similarly, hyp othetical proteins with a signi cant sequence similarity

with a family are not necessarily false p ositives. Even proteins which are do cumented as memb ers of

a family may b e counted as false p ositives simply since they do not have the exact family signature

pattern, but rather a slightly mo di ed one. For example, cluster 27 has 139 proteins of which

132 have the actin and actin like signature. Five of the other 7 proteins are do cumented as actins

(and indeed show a remarkable similarity with other actins) and two are hyp othetical proteins

(again, with a strong similarity to actins). However, these 7 proteins do not have the actin and

actin-like PROSITE signature and therefore are counted, unjusti edly p erhaps, as false p ositives.

Similarly, cluster 7 has 46 proteins which do not have the rubisco large PROSITE signature, and

are thus counted as false p ositives. They are all, however, annotated in SWISSPROT as ribulose

bisphosphate carb oxylase large chain (some are fragments). Similarly, cluster 15 has 13 proteins

which do not have the cyto chrome p450 signature, 10 of these are variants of cyto chrome p450, two

proteins are thromb oxane-a synthase, and one is trans-cinnamate 4-mono oxygenase. All these are

do cumented to b e part of the cyto chrome p450 family, and indeed, show a strong similarity with

cyto chromes p450.

Such \false p ositives" are very common in our clusters, but should they count as false p ositives?

Obviously, sometimes they are, but we have no automatic way to discern those that are indeed

biologically meaningful. At present we can only say with con dence that some false p ositives

should not count as such. Consequently, the true value of p erformance quality lies somewhere

b etween Q = 77.8% and Q = 88.5% of true p ositives.

set

upper bound

Here is another factor which limits the agreement: Some families have subfamilies, the memb ers

of which share a well-de ned domain which other memb ers in the family do not have. In the

Chapter 9. ProtoMap: Automatic classi cation of protein sequences 123

evaluation pro cedure these will b e considered as false p ositives, and if the latter are a ma jority,

then no cluster will b e matched with the subfamily. A case in p oint is the ran family, a subfamily

of the small G-proteins. The ran proteins are classi ed to the same cluster as the ras/ras-like/rab

proteins (cluster 12). However, since the ras/ras-like/rab proteins are not characterized by the same

signature pattern (nor by any other signature pattern in release 13.0 of PROSITE), they count as

false p ositives. The functional relationship of the proteins in this cluster p oints the problem of

assessing a new classi cation by means of another, human-made, classi cation.

Families for which our analysis p erformed worst, with less than 50% true p ositives are dominated

by short/lo cal domains (e.g. PH domain, EGF, ER TARGET, C1Q, KRINGLE, C2 domain, SH2,

SH3) or domains that are paired with other, more abundant domains (e.g. opsin paired with

G-protein-receptor). This is to b e exp ected, since our analysis is not a domain-based.

The results of the evaluation pro cedure with Pfam as the reference database, lead to similar

conclusions: go o d p erformance for protein families, but short motifs are not detected well. The

quality of the classi cation increased as the Pfam coverage (the total p ortion of the sequences

which was included in the multiple alignment used to de ne the domain or the family in the Pfam

database) increased: for coverage > 0:3 (134 families) the quality raised to Q = 76.9% (6.1%

sing l e

false p ositives), Q = 84.9% (6.2% false p ositives) and Q = 90.9%; for coverage > 0:5

set

upper bound

(109 families) the quality increased to Q = 80.6% (5% false p ositives), Q = 88.3% (5% false

set

sing l e

p ositives) and Q = 93.3%.

upper bound

9.2.5 New clusters

The ab ove evaluation pro cedure is oblivious to the many new clusters in our classi cation that have

no counterpart cluster in PROSITE, nor in Pfam. Our de nition of a counterpart is very strict. A

cluster has no counterpart family in the reference database if NONE of its memb ers are asso ciated

with ANY reference family. Of the 1006 clusters with over 10 memb ers (total of 33682 proteins),

308 clusters (6989 proteins, which comprise 20.8%) have no counterpart family in PROSITE. 734

clusters (15586 proteins 46.3%) do not have a counterpart in Pfam A. 281 clusters are missing from

b oth.

The largest 20 unannotated clusters (b oth by PROSITE and Pfam) are listed in table 9.5.

They are do cumented based on their SWISSPROT de nition. The purity of these clusters (in

terms of de nition consensus) is very high. Still proteins in these clusters are not characterized in

PROSITE 13 nor in Pfam 1.0. It should b e noted that b oth databases have extended since then

(as did ProtoMap). Yet, a large p ortion of the up dated SWISSPROT database is not in the latest

releases of these databases (see fo otnote 2).

124 Chapter 9. ProtoMap: Automatic classi cation of protein sequences

Cluster no Size Family

23 152 NADH-Ubiquinone reductase (chains 2,4,5)

34 130 Hemagglutinin (virus)

110 Nucleoprotein (virus) 44

60 92 Phyco cyanin (algae)

67 86 Histone H1

79 77 Envelop e protein (virus)

85 74 Chlorophyll A-B binding protein (plants)

102 69 NADH-Ubiquinone oxidoreductase (chain 6)

64 60S rib osomal protein (P0,P1,P2) 110

114 61 Envelop e protein (virus)

120 58 E6 (viral protein)

56 Neurotoxins (insect) 129

132 55 NADH-Ubiquinone oxidoreductase (chain 3)

133 55 RNA p olymerase B

54 Probable L1 protein (virus) 135

137 53 E1, helicase (virus)

53 E2, transactivator (virus) 138

139 53 Probable L2 protein (virus)

142 52 TAT protein, transactivator (virus)

145 51 E7, transforming (virus)

Table 9.5: Largest clusters with no corresp onding family in PROSITE 13 or Pfam 1.0.

9.3 ProtoMap as a to ol for analysis

Aside of the direct use of ProtoMap as an automatic classi cation of protein sequences in the

SWISSPROT database, ProtoMap o ers additional information, which is available interactively in

the website.

9.3.1 Tracing the formation of clusters

A ma jor asp ect of the hierarchical organization is that separate clusters at a given threshold may

merge at a more p ermissive threshold. This re ects the existence of subfamilies within a family, or

families within a sup erfamily.

By moving from one level to a more restrictive one, we obtain a sub division of clusters into

smaller subsets. These subsets suggest a natural division of the corresp onding family, as illustrated

in the following example for the transp ort system p ermease proteins.

0

Cluster 170 at level 10 with 46 proteins consists of transp ort system p ermease proteins.

These proteins participate in multicomp onent transp ort systems in bacteria. Sp eci cally, they are

the integral inner-membrane proteins which translo cate the substrate across the membrane.

10

The cluster decomp oses into four sub clusters at level 10 , which form a clique (Fig. 9.2).

These smaller sub clusters corresp ond to the lactose/maltose transp ort system lacG/malG, the

lactose/maltose transp ort system lacF/malF, the phosphate transp ort system, and other transp ort

systems (of sulfate, molyb denum, sp ermidine and putrescine). The subgroups of lacG/malG and

25

lacF/malF form already at level 10 . Some proteins that combine features from F and G subtyp es

Chapter 9. ProtoMap: Automatic classi cation of protein sequences 125

Figure 9.2: Tracing the formation of cluster 170 (at 0

Lactose/Maltose

el 10 ) - the transp ort system p ermease proteins.

transport (LACF/MALF) lev

5 10

we move on to level 10 and further to level 10 the

10 As

cluster splits into several sub clusters. Each circle stands for a

10 6

10

cluster at threshold = 10 . Circles' radii are prop ortionate Lactose/Maltose Phosphate transport

transport (LACG/MALG)

to the clusters' sizes (numb ers indicate clusters' sizes). The

10

wn edges app ear up on changing the threshold from 10

20 dra

5

the more p ermissive 10 . Edge widths are prop ortion-

Sulfate/Molybdenum/Spermidine/Putrescine transport to

ate to the numb er of connections b etween the corresp onding

clusters.

are denoted in SWISSPROT as malGF proteins. However, based on this sub classi cation and the

fact that the malG group and the malF group form at such high levels of signi cance, these proteins

may b e classi ed either to malG or to malF.

9.3.2 Hierarchical organization within protein families and sup erfamilies

The hierarchical organization also suggests classi cation within known families. This classi cation

is suggested by scanning the hierarchy over all levels, as illustrated for the small G-protein/Ras

sup erfamily (Fig. 9.3).

The ras gene is a memb er of a family that has b een found in tumor virus genomes and are

resp onsible for the viruses' carcinogenic e ect. In most cases this viral oncogene is closely related

to a cellular counterpart, called a proto-oncogene. Infection by a retrovirus that carries a mutant

form of the ras gene (ras oncogene), or mutations, can cause cell transformation. Indeed, mutations

in ras gene are linked to many human cancers.

The cellular ras protein binds guanine nucleotide and exhibits a GTPase activity. It participates

in the regulation of cellular metab olism, survival and di erentiation. In the last decade many

additional proteins that are related to ras were discovered, all of which share the guanine nucleotide

binding site. They are referred to as the small-G-protein sup erfamily [Nuo er & Balch 1994 ]. This

family of proteins has several subfamilies: ras, rab, ran, rho, ral, and smaller subfamilies. Like ras,

these proteins participate in regulatory pro cesses, such as vesicle tracking (rab), and cytoskeleton

organization (rho).

Fig. 9.3 depicts the relations within this family, based on our hierarchical organization. A total

of 229 proteins, all from the small G-protein sup erfamily, are presented. All were clustered into

0

cluster 12 at the lowest level of signi cance 10 . Small clusters, which corresp ond to subfamilies,

are formed at higher con dence levels, and fuse to larger clusters when the threshold is lowered.

The four main branches coincide with (I) rab subtyp es (I I) ras, ral and rap (I I I) rac, rho and cdc

(cell division control proteins) (IV) ran. Interestingly, the linkage of rac to cdc and rho seems

stronger than that b etween ran and rho or rab and rho. This prop osed sub division suggests a

common ro ot for all the subtyp es, but splits them in a way that resembles the evolutionary tree of

the small G-protein sup erfamily [Downward 1990 ].

126 Chapter 9. ProtoMap: Automatic classi cation of protein sequences

cluster 29 (138 proteins)

rab8,rab10

GTP−binding

9.3: The small G-protein family. rab3 sar1 proteins Figure G−alpha rab2, rab4

ran

family is comp osed of several subfamilies. A rab5 ARF(1−6) This rab1

rab7 *

of 229 proteins, combined in cluster 12 (level

ara, rab11 total

0

), were group ed together into isolated sets at * 10 rac,

cdc ral

di erent levels of con dence, to form a natural rap

Thiophene and GTP−binding

of the family. This hierarchical p21 furan oxidation proteins OBG sub classi cation

rho ras proteins

is much enriched by combining p os- * organization GTP−binding

protein ERA

related clusters (see text). The related clus- cluster 461 sibly cluster 12 (229 proteins) Hypo− (20 proteins)

cluster 646 thetical

are connected by dashed lines. (14 proteins) ters cluster 1400

(7 proteins)

As we pro ceed to include weaker similarities, we identify other families which are related to the

small G-protein family. According to our map, the clusters which are detected as related clusters

include cluster 29 (138 proteins), cluster 646 (14 proteins) cluster 461 (20 proteins) and cluster 1400

(7 proteins). All these clusters are possibly related clusters (rejected merges) of cluster 12 (see next

section for a discussion on p ossibly related clusters). Cluster 29 consists of ADP-rib osylation factors

family (ARF) that are involved in vesicle budding, and of guanine nucleotide-binding proteins

from the sar subfamily, whose memb ers participate in a di erent typ e of vesicle budding. Most

interestingly - G proteins of heterotrimeric G proteins b elong to the same cluster. The connection

b etween this cluster and cluster 12 is based on similarity of the ARFs and the rab subfamily, as

shown in Fig. 9.3. Cluster 646 consists of the GTP-binding protein ERA and of thiophene/furan

oxidation proteins (b oth b eing groups of GTP-binding proteins). This cluster and cluster 12 are

related through the similarity of the thiophene/furan oxidation proteins and the ras subfamily.

Cluster 461 (GTP-binding proteins of the OBG family) and cluster 1400 (hyp othetical small G-

proteins) are not directly related to cluster 12. However, these clusters are related to clusters 29

and 646, as well as to each other (see Fig. 9.3).

9.3.3 Possibly related clusters and lo cal maps

Recall that our clustering algorithm di ers from a single linkage algorithm. First, the algorithm

identi es groups of p ossibly related clusters using \lo cal considerations" (strong connections b e-

tween pairs of clusters). Then, a \global test" is applied to identify nuclei of strong relationships

within these groups of clusters, and clusters are merged accordingly. During this pro cess, the algo-

rithm automatically rejects many p ossible connections among clusters. This happ ens whenever the

quality asso ciated with a connection falls b elow a certain threshold (see Fig. 8.3). Many of these

rejected connections are nevertheless meaningful and re ect genuine though distant homologies.

We refer to the rejected mergers as possibly related clusters.

In examining a given cluster, much insight can b e gained by observing those clusters which are

p ossibly related to it. Even though some of these connections are justi ably rejected, in particular

0

at the lowest level of con dence we consider (10 ), many others do re ect structural/functional

Chapter 9. ProtoMap: Automatic classi cation of protein sequences 127

similarities, despite a weak sequence similarity. At this stage it is hard to give exact rules for

evaluating these relations, and one's judgment must b e used. Such judgment can also take into

account pairwise alignments of protein pairs, one from the cluster under study and one from a

p ossibly related cluster (the alignments are shown in the web site).

Based on the connections with p ossibly related clusters we can plot lo cal maps (at this stage,

3

mostly schematic) for the neighb orho o ds of protein families . These schematic maps can exp ose

interesting relationships b etween protein families. An example for a lo cal map is given for the

immunoglobulins sup erfamily. Two of the big clusters in Table 9.2 b elong to this sup erfamily.

These are cluster 4 (immunoglobulin V region) and cluster 5 (immunoglobulins C region).

Table 9.6 shows those clusters which are p ossibly related to cluster 4, ordered by their quality

value. These clusters include proteins which are involved in asp ects of recognition at the immune

system via the variable regions.

Cluster numb er Size Connection Numb er Family

quality of edges

1643 5 0.29 219 B-cell antigen receptor complex asso ciated protein

10 0.11 193 T-cell surface glycoprotein CD4 927

2613 3 0.03 20 p olymeric-immunoglobulin receptor

5 326 0.01 226 immunoglobulins and major histo compatibility complex

8 0.01 18 T-cell-sp eci c surface glycoprotein CD28, cytotoxic T-lympho cyte protein 1137

1189 8 0.01 9 myelin P0 protein precursor

1796 5 0.01 9 p oliovirus receptor precursor

Table 9.6: Clusters p ossibly related to cluster 4 (Level: 1e-0). Clusters are sorted by quality (i.e. minus the

log of the geometric mean of similarity scores). Note that all clusters b elong to the sup erfamily of immunoglobulins

Likewise, Table 9.7 shows the clusters which are p ossibly related to cluster 5. These clusters,

unlike the clusters related to cluster 4, consist mostly of proteins that adopted the immunoglobulin

fold of the Ig constant region. Clusters which are susp ected to b e unrelated app ear in italics (one

can validate the signi cance of \p ossibly related clusters" using the quality of their relation and the

alignments, and insigni cant connections can b e easily traced and ignored by a manual examination

of the alignments in the website).

The two sets of clusters are mostly disjoint, with the exception of cluster 1796. Memb ers of this

cluster contain b oth regions, whence the cluster is related to b oth clusters 4 and 5. The di erent

parts of the proteins account for the appropriate relationships.

There is also a direct connection b etween cluster 4 and cluster 5. The connection is based on

226 pairwise similarities b etween cluster 4 and 5. However, all the similarities are due to a single

protein in cluster 5, a T-cell receptor b eta chain (P11364). This protein has one V region aside

3

Recently, [Holm & Sander 1996 ] prop osed a map of the protein universe. However, their work fo cuses on the

shap e (structure) space. By multivariate scaling metho d applied to the similarity matrix of all structures, the p oints

(structures) of the fold space are plotted in 2D of the two dominant eigenvectors. The distribution of folds is dominated

by ve densely p opulated regions (attractors). These attractors are identi ed as parallel b eta, b eta-meander, alpha

helical, b eta-zigzag, and alpha/b eta meander.

128 Chapter 9. ProtoMap: Automatic classi cation of protein sequences

Cluster numb er Size Connection Numb er Family

quality of edges

1831 5 0.38 248 T-cell receptor gamma chain C region

330 0.01 226 immunoglobulin V region 4

104 66 0.01 64 cell adhesion molecules, myelin-asso ciated glycoprotein precursor

axonin-1 precursor, B-cell receptor CD22-b eta precursor, and more

578 16 0.01 28 high/low anity immunoglobulin epsilon/gamma FC receptor

596 16 0.01 33 recombination activating proteins, zinc nger, c3hc4 type

856 11 0.01 11 corni n (smal l proline-rich protein)

7 0.01 21 T lympho cyte activation antigen CD80/CD86 precursor 1262

1636 5 0.01 8 basigin precursor

1796 5 0.01 7 p oliovirus receptor precursor

Table 9.7: Clusters p ossibly related to cluster 5 (Level: 1e-0). Only clusters with two memb ers or more are

shown, and are sorted by quality. Almost all clusters b elong to the sup erfamily of immunoglobulins. Clusters 596

and 856 are probably unrelated (in italic).

of a C region, and the similarity with the 226 proteins in cluster 4 is limited to the V region. No

other protein in cluster 5 has a V region. Note that despite these similarities, clusters 4 and 5 did

not merge, and the connection was automatically rejected by our clustering algorithm.

This information depicts the \geometry" of the protein space in the vicinity of the immunoglobu-

lin sup erfamily as in Fig. 9.4. This map includes also indirectly related clusters, i.e. p ossibly related

clusters of order 2 and ab ove (related clusters of related clusters, etc). The map includes almost

all protein families which b elong or are related to the immunoglobulin sup erfamily de ned by the

SWISSPROT database (see Table 9.8 for details), except for three clusters: cluster 373 (isolated

cluster of p eriplasmic pilus chap erones), cluster 2172 (isolated cluster of THY-1 membrane gly-

coproteins) and cluster 2363 (B-lympho cyte antigen cd19). This \lo cal map" of relationships is

plotted to distinguish b etween three main groups (this map may change up on the accumulation of

new data). The left one corresp onds to the variable regions (V domains) of the immunoglobulins,

the right one to the constant regions (C domains). In the middle app ear many clusters that are a

mixture of the two typ es. The many alternative connections b etween clusters whose proteins resem-

ble V-domain to these of the C-domains indicate that adhesion molecules, fasciclin II and vascular

CAM, are p ositioned b etween the classical V and C-regions. Indeed, studies of the evolutionary

pathway b etween the structural classes of Ig and Ig-like domains have p ointed the imp ortant role of

these non-Ig molecules, and they are considered as the I-set according to their intermediate nature

[Harpaz & Chothia 1994 , Smith & Xue 1997 ].

9.4 Discussion

The graph based approach suggets a new way for charting the space of all protein sequences. A

complete charting of the protein space is a daunting task and many diculties are encountered. One

must b egin from well established statistical measures, in order to identify signi cant similarities.

Great caution and biological exp ertise are needed to exclude connections which are unacceptable or

Chapter 9. ProtoMap: Automatic classi cation of protein sequences 129

1727

578 4763

927 1938 2294 1831 1637 1075

4 5 1643 1262 4622 2613 104 1468 5847 1301 1137 1189 1796 5583

5186

1636

854 621

Figure 9.4: The immunoglobulin sup erfamily. All clusters related to clusters 4 and 5 are shown (these are

referred to as directly related clusters), as well as other clusters which are indirectly related clusters (clusters that are

p ossibly related to directly related clusters). In this schematic map, each cluster is represented as a circle, whose radius

is prop ortional to the cluster's size. Only clusters' numb ers are given. See table 9.8 for more information on these clusters.

Clusters on the left (colored group) contain an Ig V-region. The group on the right has an Ig C-region. The clusters in

b etween share b oth regions (except for cluster 1468, that consists exclusively of V-region).

misleading. The main diculties stem from chance similarities among sequences and multi-domain-

based connections. However, the sheer volume of data makes it necessary to develop automatic

metho ds for such identi cation.

The work we present here addresses these ma jor problems. We start by creating, for each

protein sequence, an exhaustive list of neighb oring sequences. These lists take into account the

scores of the three ma jor metho ds for pairwise sequence comparisons. These three typ es of scores

are jointly normalized and the lists are ltered. The link is maintained in the lists only when a

signi cant relationship seems to exist. A two-phase clustering algorithm is then applied to identify

groups of related sequences. There are two pitfalls to avoid here: On one hand, it is very easy

to follow a very strict rule and generate many small clusters, within each of which, the proteins

are very closely related. This approach safely avoids the creation of false connections, but adds

130 Chapter 9. ProtoMap: Automatic classi cation of protein sequences

Cluster numb er Size Family

4 330 immunoglobulin V region

5 326 immunoglobulins and major histo compatibility complex (MHC)

66 cell adhesion molecules (CAM, n-CAM, ng-CAM, v-CAM, contactin, fascilin I I), 104

myelin-asso ciated glycoprotein, axonin-1 precursor, B-cell receptor CD22-b eta precursor

373  25 p eriplasmic pilus chap erones

578 16 high/low anity immunoglobulin epsilon/gamma fc receptor

621 15 interleukin-1 receptor, interleukin-1 binding protein, surface antigen

854 11 T-cell surface glycoprotein CD3 delta/epsilon/gamma chain precursor

10 T-cell surface glycoprotein CD4 927

1075 9 MHC class I NK cell receptor precursor

1137 8 T-cell-sp eci c surface glycoprotein CD28, cytotoxic T-lympho cyte protein 4

8 myelin P0 protein 1189

1262 7 T lympho cyte activation antigen CD80/CD86 precursor

1301 7 intercellular adhesion molecule precursor (ICAM)

6 hemagglutinin precursor 1468

1636 5 basigin precursor

5 proable cell adhesion molecule involved in regulating T-cell activation 1637

1643 5 B-cell antigen receptor complex asso ciated protein

1727 5 interleukin-12 b eta chain precursor

1796 5 p oliovirus receptor, ox-2 membrane glycoprotein

1831 5 T-cell receptor gamma chain C region

1938 4 T-cell surface antigen CD2 precursor

4 THY-1 membrane glycoproteins 2172 

2363  3 B-lympho cyte antigen cd19

2294 3 alpha-1b-glycoprotein

3 olymeric-immunoglobulin receptor 2613

4622 1 b eta-2-microglobulin

4763 1 T-cell surface glycoprotein CD4

1 fasciclin III precursor 5186

5583 1 lympho cyte activation gene

5847 1 immunoglobulin mu chain C region membrane-b ound

Table 9.8: Clusters b elonging to the immunoglobulin sup erfamily All clusters (b esides those marked with

) app ear in the lo cal map of the immunoglobulins (Fig. 9.4).

little to our understanding of the protein space, since it includes only fairly obvious and well

known connections. The other extreme strategy is not useful either: A pro cedure that declares a

connection without sucient scrutiny, do es not miss interesting connections. Instead it generates

so many false connections that it b ecomes imp ossible to recognize signi cant relations. In other

words, such a p ermissive metho d quickly collapses the whole space into a small numb er of gigantic

but biologically meaningless clusters. The problem is to nd a golden path where non obvious

relationships are discovered without 'littering' the classi cation with to o many false connections.

Our clustering algorithm addresses these p erils. The algorithm b egins with a very strict, high-

resolution classi cation that employs only connections of very high statistical signi cance. The

resulting clusters are then merged to form bigger and more diverse clusters. The algorithm op erates

hierarchically: each step adds new weaker connections to the previously considered connections.

A statistical test is applied in order to identify and eliminate problematic connections as well as

p ossibly false connections b etween unrelated proteins.

Chapter 9. ProtoMap: Automatic classi cation of protein sequences 131

Finally, a hierarchical organization of all protein sequences is obtained, strongly correlated with

a functional partitioning of all proteins. This data structure reveals interesting relations b etween

and within protein families, and provides a global view ("map") of the space of all proteins.

The classi cation consists of several thousand clusters, the largest of which contain several

hundred memb ers each. It is interesting to assess the e ect of slowing down the transitive closure

algorithm. Indeed, a straightforward application of the transitive closure algorithm (a.k.a. single-

linkage clustering) leads to an avalanche as discussed ab ove (details not shown). Already at a

20

con dence level 10 , most of the space is made up of a small numb er of very large clusters.

This avalanche is caused by chance similarities and chains of domain-based connections that cause

unrelated families to merge into few giant clusters. A ma jor ingredient of our new algorithm is the

choice of rules for avoiding such undesirable connections.

We should note that some connections found in our analysis are still questionable. Some domain-

based connections did escap e our lters, and caused unrelated clusters to merge. Also, high scoring

low-complexity segments which may b e biologically meaningless can lead to false connections, and

to the formation of nonhomogeneous clusters. Though we did take into account the e ect of these

segments, not all of them were ltered out. Since our goal was to detect many remote homologies

we used weak lters in this case, and considered many similarities at very low levels of con dence.

We are currently testing more stringent ltration criteria, and improve our algorithm to handle

domain-based connections b etter. At the moment, such connections can b e easily traced manually,

by observing the alignments (at the web-site). We would like to note that this pro ject is still in

progress. At this stage, we are testing di erent strategies to infer multi-domain structure within

clusters and re ne clusters accordingly. Multiple alignments will b e available as well in the next

release.

Our algorithm has turned out well de ned groups which are strongly correlated with protein

families and subfamilies. When compared with the well accepted databases PROSITE and Pfam,

our classi cation p erformed well for most of the families, though many of them were domain based

rather than protein families. Many new clusters in our classi cation didn't have a match from

either PROSITE nor Pfam. The hierarchical organization can indicate the existence of subfamilies

within families, and the concept of p ossibly related clusters exp oses distant relationships which

re ect functional similarities. These relations o er a basis for sketching lo cal maps near protein

families. These connections can b e of biological interest, and should b e taken into account in the

study of protein families. For example, an overall view of Fig. 9.3 or Fig. 9.4 implies that tracing

the hierarchical organization of a cluster and/or its rejected merges can provide new information

ab out the corresp onding family and reveal relationships among protein families.

Another asp ect of the p ossibly related clusters is that this list can b e viewed as a \soft clus-

tering", where the same protein can participate in several di erent clusters. Proteins that are

comp osed of several domains, some of which shared by proteins from di erent families, are nat-

urally asso ciated with more than one group. Therefore, though in this version of ProtoMap a

multi-domain protein is classi ed to a single cluster, its multi trait nature revealed when we exam-

132 Chapter 9. ProtoMap: Automatic classi cation of protein sequences

ine its relations with the other clusters as well.

The concept of soft clustering will b e integrated in the next releases of ProtoMap, and is

already applied for the online classi cation of new protein sequences (submitted by the user), in

the newest version of ProtoMap. New sequences are classi ed to the existing clusters based on their

distributions of connections with the existing clusters. The sequence can b e classi ed to more than

one cluster, with di erent qualities, to help in predicting the nature of new sequences. For more

information check the ProtoMap website (http://www.protomap.cs.huji.ac.il). See also chapter 10.

9.5 App endix A - Correlation of ProtoMap releases

Since the rst release of ProtoMap (releases 1.0-1.2) we have run our algorithms on a newer version

of SWISSPROT. That is, SWISSPROT 35 with up dates (dated May 6th 1998) with total of 72623

proteins (39% increase), the corresp onding release numb ered 2.0. As was mentioned in section 8.2,

we run the evaluation pro cedure on the SWISSPROT 33 since the up dated version is not one of

the ma jor releases, and therefore no corresp onding PROSITE or Pfam databases were available.

However, for reference, we checked the correlation of the old release and the new release. Sp eci cally,

we checked: How many clusters remained unchanged, how many clusters grew larger due to the

addition of new protein sequences, how many clusters split, and how many clusters merged. This

pro cedure is used to identify those clusters which seem \stable" vs. \unstable" ones. A cluster

in the rst release of ProtoMap is considered stable, if one of the following conditions hold:

 The cluster remains the same in the new release.

 Only new protein sequences are added to the cluster in the new release.

 Only new protein sequences, and old singletons join the cluster.

 The cluster p erfectly splits into several smaller sub clusters.

 The cluster splits into several clusters, to which only new sequences are added.

 The cluster splits into several clusters, each of which augmented only by new sequences and

old singletons.

Such clusters are well correlated with protein families. All other clusters are considered unstable.

Unstable clusters are not necessarily \false" clusters. Some merges and splits of clusters are prop er

resp onses to the new information carried by the addition of new protein sequences. The new clusters

may in fact b e b etter correlated with protein families or subfamilies. Some are unfortunately

improp er.

The ratio b etween the two typ es of clusters can help in assessing the stability of the ProtoMap

system. The results are summarized in table 9.9, based on the statistics of 1892 clusters with more

Chapter 9. ProtoMap: Automatic classi cation of protein sequences 133

than 5 memb ers in ProtoMap 1.2. This analysis shows that 89% of the clusters are stable, in terms

of the conditions de ned ab ove.

1514 clusters mapp ed to a single new cluster 378 clusters mapp ed to several new clusters

\Stable" `Unstable" \Stable" `Unstable"

The Only new Merged Others Perfect Split, only Split and others Total

sequences with new sequences merged with

same added singletons split added singletons Stable

399 947 102 66 29 162 39 148 1678 (89%)

Table 9.9: Correlation of ProtoMap releases

134 Chapter 9. ProtoMap: Automatic classi cation of protein sequences

Chapter 10

The ProtoMap website

The ProtoMap website (http://www.protomap.cs.huji.ac.il) o ers an exhaustive classi cation of

all proteins in the SwissProt database, into groups of related proteins based on the results of the

analysis describ ed in chapters 8 and 9.

The resulting classi cation splits the protein space into well de ned groups of proteins, most

of which are closely correlated with natural biological families and sup erfamilies. This hierarchical

organization may help in detecting ner subfamilies that make up known families of proteins. In

addition it brings forth interesting relations b etween protein families. The website was designed so

as to enable easy access to our results. The analysis of the SwissProt database can b e accessed by

a variety of metho ds, and user friendly graphic to ols were develop ed to facilitate the presentation

of this huge b o dy of information.

10.1 How to use ProtoMap

The analysis was carried at varying thresholds, or levels of con dence, in the range 1e-100 (very

high con dence level) to 1e-0 = 1 (almost pure chance similarities). At each level, the universe

of all proteins splits into clusters, which get larger and coarser as the con dence levels decrease.

Each cluster is given a numb er. Clusters are numb ered in decreasing order of size.

Selecting the entrance level

The default entrance level is 1e-0, but many interesting relations are revealed at higher levels as

well. In any case, your selection of the entrance level do es not limit the search as you advance in

your analysis. It is p ossible to move among levels, thus tracing the fusion of sub-families into larger

families, and to identify links b etween protein families.

10.1.1 Search

Search is always p erformed at a sp eci c level. Five di erent access mo des are available. For

example, it is p ossible to search by a general keyword (e.g toxin, coiled coil). If one is interested in 135

136 Chapter 10. The ProtoMap website

a sp eci c protein then he can search by SwissProt accession numb er (or ID). If the keyword option

is to o general and the SwissProt ID is to o sp eci c, then it is p ossible to search by protein name

(e.g. synaptotagmin, brevinin). Other search mo des are by cluster numb er, and by cluster size.

A keyword search may use a simple query such as \ATP-binding". More complex queries

are made p ossible, using the logical connectives AND, OR, NOT, XOR and parentheses, to de ne

precedence. For example: (DNA-binding OR RNA-binding) AND 3D-structure.

The search outputs a table with all clusters matching your query (see Fig. 10.1). Each cluster

is given with its numb er (clusters are numb ered by size), its size, and the keywords asso ciated with

the cluster. If you click on one of the keywords, a new search is activated with the clicked word as

the new keyword.

Figure 10.1: Results of the search on keyword \electron transp ort".

Search by SwissProt ID (or AC) yields a card with information ab out the protein and the

sp eci c cluster which contains the selected protein (see Fig. 10.2), as well as the full list of neighb ors,

combined of all three metho ds for sequence comparison (SW, BLAST, and FASTA), and all the

pairwise alignments with these neighb oring sequences.

Chapter 10. The ProtoMap website 137

Figure 10.2: Results of the search on SwissProt ID \acha human".

10.1.2 Browsing a cluster

When your search (in any of the metho ds describ ed ab ove) yields a cluster(s) that is of interest,

pro ceed by clicking the cluster(s) numb er(s). Subsequently, you receive the list of all proteins in the

1

cluster. In this list, proteins are ordered rst by their order of transitivity , followed by the numb er

of new memb ers they added to the cluster. Each protein is accompanied by a short description

(the full information is available as well), and a list of known domains and motifs (from PROSITE

database) which app ear in the protein.

Two java applets can help visualize di erent asp ects of this cluster. The tree-like presentation

of the cluster (see Fig. 10.3), is very suggestive for biologists. It places each protein at a leaf of a

tree (on the left side). The ro ot of the tree is the rightmost vertex. The x-co ordinate of internal

no des in the tree indicates the similarity among the proteins at the leaves of the corresp onding

subtree. The higher the similarity, the no de is lo cated more to the left of the screen. This to ol

1

The rst protein is the seed, whose order of transitivity is 0. Its neighb ors are of order 1. Additional proteins

that are neighb ors of 1st order proteins, are of order 2, etc.

138 Chapter 10. The ProtoMap website

can help in detecting the existence of subgroups within the cluster, and in obtaining a hierarchical

organization within the cluster.

Figure 10.3: Tree-like presentation of the heat sho ck 70 family (cluster 23 at level 1e-0). You can zo om

in on any part of the tree and successive higher zo oming is p ossible. By standing on a vertex (without pressing) you

get a summary line of all proteins descendent from this vertex. When you press on a vertex, you get a detailed list

of all proteins descendent from the vertex.

The applet called higher level constituents allows you to move b etween levels and trace the

formation of clusters. Isolated clusters in a given level may unite when the threshold is lowered.

This to ol graphically shows the clusters which were isolated at higher con dence levels, and form

the cluster under observation at the current lower threshold, as well as the connections b etween

these clusters (see Fig. 10.4). This way it is p ossible to trace the formation of families out of

sub-families.

By clicking on a vertex, a new window app ears, with detailed information ab out the corresp ond-

ing cluster. Besides the list of memb ers and the summary, you can get the tree-like presentation

Chapter 10. The ProtoMap website 139

of the sp eci c cluster, as well as its own higher-level constituents. By checking the 'higher-order

constituents' button you will move up to the next building level, at which the cluster decomp oses to

even smaller comp onents (if the cluster do es not change till the level of 1e-100, or if you are already

at the level of 1e-100 then this button is disabled). The new graph will replace the original graph.

You may continue in the same manner, and move even higher, or you can check other comp onents

in the original graph or in the consequent graphs.

Figure 10.4: Higher level constituents of the extended 'short-chain' alcohol dehydrogenases family

(cluster 19 at level 1e-0). Each circle stands for a cluster at the higher threshold level. Edges represent new

connections b etween the clusters which were formed up on lowering the threshold. The size of the circles and the

width of the edges are prop ortionate to the numb er of proteins and the numb er of connections, resp ectively.

By clicking on an edge, a new window app ears, with the list of pairs connecting the corre-

sp onding vertices (clusters). Select any pair and press 'Align' to get the pairwise alignment of

the pair. The result is a schematic representation that illustrates the alignment and the sequences

(see Fig. 10.5). Each protein is presented as a gray bar, and the shared/similar regions comp osing

140 Chapter 10. The ProtoMap website

the alignment are presented as yellow segments within the gray bars. All PROSITE motifs and

domains are presented schematically, as blue segments, to indicate their p osition. Clicking on a

blue segment in a protein will give a short description of the corresp onding pattern/family. This

allows you to immediately recognize proteins that are comp osed of several domains and to check

if the alignment extends throughout the two sequences, or is limited to the motif region, etc. Also

available are the detailed alignment in text format (by pressing the 'Alignment as text' button),

and the SwissProt information ab out the corresp onding proteins (the 'Info' buttons).

Figure 10.5: Connections b etween sub cluster of alcohol dehydrogenases and sub cluster of various oxidoreductases

and dehydrogenases, all b elong to the extended family of 'short chain' alcohol dehydrogenases.

Another interesting feature of ProtoMap is the list of p ossibly related clusters. These

are clusters whose connection with the cluster under study was automatically rejected by the

clustering algorithm. This happ ens whenever the quality asso ciated with a connection falls b elow

a certain threshold. Even though some of these connections are justi ably rejected, many others

are nevertheless meaningful and re ect genuine though distant homologies. In examining a given

cluster, much insight can b e gained by observing its p ossibly related clusters. For each cluster

we provide the the list of its p ossibly related clusters. In this list clusters are ordered by quality

of relatedness, and the numb er of connections. To evaluate the relations, and distinguish true

relationships from insigni cant connections, the pairwise alignments of protein pairs, one from the

Chapter 10. The ProtoMap website 141

cluster under study and one from a p ossibly related cluster are given.

10.1.3 Analyzing new sequences

Finally, supp ose that you have a new protein sequence which is not yet in the database. A recent

up date of this site o ers online classi cation of protein sequences submitted by the user. The

sequence is compared with the sequences in the database and the protein is classi ed to the ex-

isting clusters based on its distribution of connections with each cluster. The output is a list of

clusters, sorted by the quality of the connection with the new sequence. Again, the list of pairwise

connections and the pairwise alignments are available, presented graphically (as in Fig. 10.5) and

as text.

For a comprehensive view of this pro ject the reader is again encouraged to visit the ProtoMap

web site (http://www.protomap.cs.huji.ac.il).

142 Chapter 10. The ProtoMap website

Chapter 11

Concluding remarks

The study presented in this thesis addresses the problem of identifying high order features and

organization within the space of all protein sequences (sequence space). Our aim is to exhaustively

\chart" all proteins and to automatically classify them into families, based on pairwise similarities.

This high order organization is a key step towards a global view (a \map") of the sequence space.

The creation of such a global map is a ma jor target for protein and genome science. It would

add much to our understanding of the relations b etween di erent classes of proteins, as well as

of fundamental evolutionary pro cesses. In particular, an automatic to ol for the classi cation of

protein sequences and their characterization is crucial in searching the function and structure of

proteins. The information acquired may serve investigators and can help in setting research plans,

and in designing exp eriments.

Our study presents two approaches for the global self-organization of the sequence space: the

Euclidean emb edding approach and the graph based approach.

The rst approach converts the problem of identifying high order structures within the se-

quence space into a problem of pattern recognition in a Euclidean space. Our approach has to

overcome two ma jor obstacles: (i) The data is inherently high dimensional (ii) It is hard to or-

ganize it according to a provably valid mo del. Consequently, it is hard to predict how well new

samples are integrated into this mo del (the problem of generalization). Furthermore, generalization

in high dimensional spaces is a notoriously problematic task. A ma jor diculty is that representing

an n-dimensional ob ject to a desired precision, may require a sample set of size exp onential in n

(\the curse of dimensionality").

We deal with the rst issue by using a novel geometric emb edding of the segments in a lower

dimensional Euclidean space, with small distortion. However, It is imp ortant to understand that

the drastic reduction in dimensions achieved by the emb edding algorithm, do es not automatically

guarantee the ability to prop erly generalize: while our emb edding algorithm maintains, with small

distortion, the distances among data p oints, nothing is proven so far ab out other p oints from

the original, high dimensional distribution. In other words, there is no guarantee that the whole

distribution is smo othly emb edded in the lower dimensional space. This would follow only from 143

144 Chapter 11. Concluding remarks

stronger assumptions on the sample set. Rather, the validity of our clustering and its generalization

p ower are monitored through a careful cross-validated hierarchical clustering of the segments in this

lower dimensional space. This clustering is hierarchical, and thus o ers additional insight into the

large-scale organization of the space of all protein sequences.

The resulting hierarchical tree of clusters o ers a new representation of protein sequences and

families, which compares favorably with classi cations based on functional and structural data

ab out proteins. Some of the known families were clustered into well distinct clusters. Motifs and

domains such as the Zinc Finger, EF hand, Homeob ox, EGF-like and others were automatically

identi ed. This tree of clusters also provides some insights on relations b etween protein families,

relations which are suggested by examining the splits along the tree. At a higher level of analysis we

intro duced new to ols for representing and analyzing protein families and their relations, as well as a

new concise representation for protein sequences which can b e very e ective in presenting multiple

alignments for complex protein families, and can b e used as well in searching for close relatives in

the database.

The second approach uses a graph representation of the sequence space and apply a hierar-

chical two-phase clustering algorithm. Our algorithm can b e describ ed as a mo derate version of

the transitive closure algorithm. At each round of this pro cess we gain statistical information on

the relationships among current clusters. This information is then used to merge certain clusters

(those who pass a certain statistical test), thus forming the next round of larger, coarser clusters.

The algorithm starts from a very conservative classi cation, and is rep eatedly applied, at varying

levels of con dence, the input for each stage b eing the classi cation output at the previous stage.

The resulting classi cation splits the protein space into well de ned groups of proteins, which

are closely correlated with natural biological families and sup erfamilies. Di erent indices of validity

were applied to assess the quality of our classi cation and compare it with the protein families in the

PROSITE and Pfam databases. Our classi cation agrees with these domain-based classi cations for

b etween 64.8% and 88.5% of the proteins. It also nds many new clusters of protein sequences which

were not classi ed by these databases. The hierarchical organization suggested by our analysis

reveals ner subfamilies in families of known proteins as well as many interesting relations b etween

protein families. The connections with p ossibly related clusters (connections which were rejected

during the clustering pro cess) usually exp ose distant relationships and can b e used to sketch lo cal

maps near protein families.

A considerable e ort has b een made to establish a way for other researchers to approach these

results, and the ProtoMap website has b een constructed (http://www.protomap.cs.huji.ac.il). This

is an interactive platform for analyzing the sequence space that presents the results of the graph

based approach. Graphical to ols were develop ed and are used for "navigating" through this protein

map. The information which one can extract from this map is overwhelming, and many interest-

ing relations (p otentially evolutionary) b etween sequences whose sequence similarity is weak, are

exp osed by this map.

Chapter 11. Concluding remarks 145

11.1 The whole picture

It is very dicult or even imp ossible to prop erly classify all proteins. The space of proteins has

many di erent facets, all of which should b e considered in future, more thorough classi cation: 3D

structure/fold, biological function, domain content, cellular lo cation, tissue sp eci city, organism

(source), metab olic pathways, etc. This work di ers from previous large scale analyses in several

ways: (i) No pre-de ned groups or other classi cation are b eing employed in our analysis. Moreover,

no multiple alignments of the proteins are needed. (ii) We chart the space of all protein sequences

in SWISSPROT, not just particular families. (iii) We o er a global organization of all protein

sequences.

Our segment clustering approach (the Euclidean emb edding approach) provides an elegant,

higher level, representation of protein sequences. At the moment, the prerequisite of a metric on

the sequence space has limited the extent to which these to ols can b e used for practical analysis of

protein sequences. However, this approach relies on a solid theoretical foundation, and we b elieve

that these to ols can b e re ned and extended to provide interesting predictions on the relationships

among protein families and the nature of new sequences.

Our graph based approach provides a practical metho d for self organization of all complete

protein sequences. It is applied to the set of all SWISSPROT sequences, and yields an exhaustive

classi cation. This leads to the de nition of a new pseudo-metric on the space of all protein

sequences. In most cases, this measure turns out to b e more sensitive than the existing measures

from which it was derived. Moreover, this work has created an infrastructure for analysis of

the sequence space and the structure space, in which other similarity measures and sophisticated

clustering algorithms can b e integrated.

These two di erent approaches are complementary in several asp ects. The rst is based on

geometrical considerations but is restricted to the analysis of protein segments. The second lacks

the geometrical asp ects but can b e applied to complete protein sequences, and is based on graph

algorithms. The rst starts with a single cluster and successively splits clusters. The second

start from many small clusters and successively merges clusters. Both studies are fully automatic

classi cations of all protein sequences in SWISSPROT, without incorp oration of any biological

considerations other than the use of standard pairwise sequence comparisons. Both studies yield

a hierarchical organization of all protein sequences, and reveal interesting relations b etween and

within protein families.

11.2 Future directions

In this study we have made a long step towards a global map of the protein space. Several further

developments are planned for the very near future. Sp eci cally,

(A) To ols for classifying new sequences were recently develop ed. These to ols will b e comple-

mented by to ols for assessing the e ect of newly intro duced sequences on the entire space.

(B) Creating domain-based charts. This is a signi cant goal, since proteins often consist of

several functional domains. These domains constitute natural biological entities which deserve to

b e individually investigated.

(C) Allow an overlap among families. Currently our clusters are mutually exclusive. In view

of the ubiquity multi-domain proteins, it makes sense to allow the same protein to participate

in several clusters. In other words, each protein can b e assigned to more than one cluster, with

di erent probabilities (soft clustering). This concept is already integrated in ProtoMap for the

online classi cation of new protein sequences.

(D) Incorp orate structural considerations into the global organization of proteins, by combining

sequence based metrics and structure based data.

These developments, which are already in progress, are all exp ected to add much to our un-

derstanding of the prop erties of the protein space, to protein function analysis, and to studies on

evolution and sequence-structure relationships.

Bibliography

[Agra otis 1997] Agra otis, D. K. (1997). A new metho d for analyzing protein sequence relation-

ships based on Sammon maps. Protein Sci. 6, 287-293.

[Altschul 1991] Altschul, S. F. (1991). Amino acid substitution matrices from an information the-

oretic p ersp ective. J. Mol. Biol. 219, 555-565.

[Altschul et al. 1990] Altschul, S. F., Carrol, R. J. & Lipman, D. J. (1990). Basic lo cal alignment

search to ol. J. Mol. Biol. 215, 403-410.

[Altschul et al. 1994] Altschul, S. F., Boguski, M. S., Gish, W. G. & Wo otton, J. C. (1994). Issues

in searching molecular sequence databases. Nature Genetics 6, 119-129.

[Altschul & Gish 1996] Altschul, S. F. & Gish, W. (1996). Lo cal alignment statistics. Methods

Enzymol 266, 460-480.

[Altschul et al. 1997] Altschul, S. F., Madden, T. L., Scha er, A. A., Zhang, J., Zhang, Z., Miller,

W. & Lipman, D.J. (1997). Gapp ed BLAST and PSI-BLAST: a new generation of protein

database search programs. Nucl. Acids Res. 25, 3389-3402.

[Attwo o d & Beck 1994] Attwo o d, T. K. & Beck, M. E. (1994). PRINTS-a protein motif ngerprint

database. Protein Eng. 7, 841-848.

[Bairo ch & Bo eckman 1992] Bairo ch, A. & Bo eckman, B. (1992). The SWISS-PROT protein se-

quence data bank. Nucl. Acids Res. 20, 2019-2022.

[Bairo ch 1991] Bairo ch, A. (1991). PROSITE: a dictionary of sites and patterns in proteins. Nucl.

Acids Res. 19, 2241-2245.

[Bairo ch et al. 1997] Bairo ch, A., Bucher, P., & Hofmann, K. (1997). The PROSITE database, its

status in 1997. Nucl. Acids Res. 25, 217-221.

[Barak et al. 1995] Barak, A., Laden, O. & Yarom, Y. (1995). The NOW MOSIX and its pre-

emptive pro cess migration scheme. Bul letin of the IEE E Technical Committee on Operating

Systems and Application Environments 7:2, 5-11. 147

[Barker et al. 1996] Barker, W. C., Pfei er, F. & George, D. G. (1996). Sup erfamily classi cation

in PIR-international protein sequence database. Methods Enzymol, 266, 59-71.

[Bellman 1957] Bellman, R. (1957). "Dynamic programming". Princeton University Press.

[Blatt et all 1997] Blatt, M., Wiseman, S. & Domani, E. (1997). Data clustering using a mo del

granular magnet. Neural Computation 9, 1805-1842.

[Bourgain 1985] Bourgain, J. (1985). On Lipschitz emb edding of nite metric spaces in Hilb ert

space. Israel J. Math. 52, 46-52.

[Brenner et al. 1997] Brenner, S.E., Chothia, C. & Hubbard, T. J. P. (1997). Population statistics

of protein structures: lessons from structural classi cations. Curr. Opin. Struct. Biol. 7, 369-

376.

[Brenner et al. 1998] Brenner, S.E., Chothia, C. & Hubbard, T. J. P. (1998). Assessing sequence

comparison metho ds with reliable structurally identi ed distant evolutionary relationships.

Proc. Natl. Acad. Sci. USA 95, 6073-6078.

[Casari et al. 1995] Casari, G., Sander, C. & Valencia, A. (1995). A metho d to predict functional

residues in proteins. Nat. Struct. Biol. 2, 171-178.

[Chothia 1992] Chothia, C. (1992). One thousand families for the molecular biologist. Nature 357,

543-544.

[Compugen 1998] Compugen LTD. BIOCCELERATOR Manual. http://www.compugen.co.il

[Cormen et al. 1990] Cormen, T. H., Leiserson, C. E. & Rivest, R. L. (1990). "Intro duction to

algorithms". MIT press/McGraw-Hill Bo ok Company.

[Cover & Thomas 1991] Cover, T.M. & Thomas, J.A. (1991). "Elements of Information Theory".

John Wiley and Sons, New York. 0p 12-33,

[Cukierman et al. 1995] Cukierman, E., Hub er, I., Rotman, M. & Cassel, D. (1995). The ARF1

GTPase-activating protein: zinc nger motif and Golgi complex lo calization. Science 270,

1999-2002.

[Dayho et al. 1978] Dayho , M., Schwartz, R. M. & Orcutt, B. C. (1978). A mo del of evolutionary

change in Proteins. In Atlas of protein sequence and structure, ed. Dayho , M. vol. 5, Suppl.

3, pp 345-352. National biomedical research foundation, Silver Spring, MD.

[Demb o & Karlin 1991] Demb o, A. & Karlin, S. (1991). Strong limit theorems of empirical func-

tionals for large exceedances of partial sums of i.i.d variables. Ann. Prob. 19, 1737-1755.

[Demb o et al. 1994b] Demb o, A., Karlin, S. & Zeitouni, O. (1994). Limit distribution of maximal

non-aligned two-sequence segmental score. Ann. Prob. 22, 2022-2039.

[Do olittle 1992] Do olittle, R. F. (1992). Reconstructing history with amino acid sequences. Protein

Sci. 1, 191-200.

[Do olittle 1998] Do olittle, R. F. (1998). Microbial genomes op ened up. Nature 392, 339-342.

[Downward 1990] Downward, J. (1990). The ras sup erfamily of small GTP-binding proteins. Trends

Biochem. Sci. 15, 469-472.

[Duda & Hart 1973] Duda, R. O. & Hart, P. E. (1973). "Pattern Classi cation and Scene Analysis".

John Wiley and Sons, New York. 0p 189-252,

[Erdos and Renyi 1970] Erdos, P. & Renyi, A. (1970). On a new law of large numb ers. J. Anal.

Math. 22, 103-111.

[Feng et al. 1985] Feng, D. F., Johnson, M. S. & Do olittle, R. F. (1985). Aligning amino acid

sequences: comparison of commonly used metho ds. J. Mol. Evol. 21, 112-125.

[Ferran et al. 1994] Ferran, E. A., P ugfelder, B. & Ferrara P. (1994). Self-Organized Neural Maps

of Human Protein Sequences. Protein Sci. 3, 507-521.

[Fitch & Margoliash 1967] Fitch, W. M. & Margoliash, E. (1967). Construction of phylogenetic

trees. Science 155, 279-284.

[Flores et al. 1993] Flores, T. P., Orengo, C. A., Moss, D. & Thoronton, J. M. (1993). Comparison

of conformational characteristics in structurally similar protein pairs. Protein Sci. 2, 1811-1826.

[Frankl & Maehara 1988] Frankl, P. & Maehara, H. (1988). The Johnson-Lindenstrauss lemma and

the sphericity of some graphs. J. Comb. Th. B 44, 355-362.

[Gdalyahu et al. 1998] Gdalyahu, Y., Weinshall, D. & Werman, M. (1998). A randomized algorithm

for pairwise clustering. To app ear Nips 98. ???

[George et al. 1996] George, D. G., Barker, W. C., Mewes, H. W., Pfei er, F. & Tsugita, A. (1996).

The PIR-International protein sequence database. Nucl. Acids. Res. 24, 17-20.

[Gonnet et al. 1992] Gonnet, G. H., Cohen, M. A. & Benner, S. A. (1992). Exhaustive matching

of the entire protein sequence database. Science 256, 1443-1445.

[Gracy & Argos 1998] Gracy, J. & Argos, P. (1998). Automated protein sequence database clas-

si cation. I. Integration of cop ositional similarity search, lo cal similarity search and multiple

sequence alignment. I I. Delination of domain b oundries from sequence similarity. Bioinformat-

ics 14:2, 164-187

[Grantham 1974] Grantham, R. (1974). Amino acid di erence formula to help explain protein evo-

lution. Science 185, 862-864.

[Gray 1984] Gray, R. M. (1984). Vector Quantization. IEEE ASSP Mag. 4-29.

[Gray et al. 1980] Gray, R. M., Kie er, J. C., & Linde, Y. (1980). Lo cally optimal blo ck quantizier

design. Information and Control 45 , 178-198.

[Green et al. 1993] Green, P., Lipman, D., Hillier, L., Waterston, R., States, D. & Claverie, J. M.

(1993). Ancient conserved regions in new gene sequences and the protein databases. Science,

259, 1711-1716.

[Gribskov & Veretnik 1996] Gribskov, M. & Veretnik, S. (1996). Identi cation of sequence patterns

with pro le analysis. Methods Enzymol 266, 198-211.

[Gumb el 1958] Gumb el, E. J. (1958). "Statistics of Extremes". Columbia University Press, New

York.

[Han & Baker 1995] Han, K. F. & Baker, D. (1995). Recurring lo cal sequence motifs in proteins.

J. Mol. Biol. 251, 176-187.

[Hardison 1996] Hardison, R. C. (1996). A brief history of hemoglobins: plant, animal, protist, and

bacteria. Proc. Natl. Acad. Sci. USA 93, 5675-5679.

[Harpaz & Chothia 1994] Harp ez, Y. & Chothia, C. (1994). Many of the immunoglobulin sup er-

family domains in cell adhesion molecules and surface receptors b elong to a new structural set

which is close to that containing variable domains. J. Mol. Biol. 238, 528-539.

[Harris et al. 1992] Harris, N. L., Hunter, L. & States, D.J. (1992). Mega-classi cation: Discovering

motifs in massive datastreams. In Proc. of the 10th national conf. on AI, 837-842, AAAI

press/The MIT Press, Menlo park/Cambridge.

[Heniko & Heniko 1991] Heniko , S. & Heniko , J. G. (1991). Automated assembly of protein

blo cks for database searching. Nucl. Acids Res. 19, 6565-6572.

[Heniko & Heniko 1992] Heniko , S. & Heniko , J. G. (1992). Amino acid substitution matrices

from protein blo cks. Proc. Natl Acad. Sci. USA 89, 10915-10919.

[Heniko & Heniko 1993] Heniko S. & Heniko , J. G. (1993). Performance evaluation of amino

acid substitution matrices. Proteins 17, 49-61.

[Hilb ert et al. 1993] Hilb ert, M., Bohm, G. & Jaenicke, R. (1993). Structural relationships of ho-

mologous proteins as a fundamental principle in homology mo deling. Proteins 17, 138-151.

[Hjorth 1994] Hjorth, J. S. U. (1994). "Computer intensive statistical metho ds validation, mo del

selection, and b o otstrap". Chapman & Hall, London.

[Hob ohm & Sander 1995] Hob ohm, U. & Sander, C. (1995). A sequence prop erty approach to

searching protein database. J. Mol. Biol. bf 251, 390-399.

[Holm & Sander 1996] Holm, L. & Sander, C. (1996). Mapping the protein universe. Science 273,

595-602.

[Jain & Dub es 1988] Jain, A.K. & Dub es, R.C. (1988). "Algorithms for clustering data". Prentice

Hall, New Jersey.

[Johnson & Lindenstrauss 1984] Johnson, W. B. & Lindenstrauss, J. (1984). Extensions of Lips-

chitz mappings into a Hilb ert space. Contemporary Mathematics 26, 189-206.

[Johnson & Overington 1993] Johnson, M. S. & Overington, J. P. (1993). A structural basis for

sequence comparisons. An evaluation of scoring metho dologies. J. Mol. Biol. 233, 716-738.

[Jones et al. 1992] Jones, D. T., Taylor, W. R. & Thornton, J. M. (1992). The rapid generation of

mutation data matrices from protein sequences. Comput. Appl. Biosci. 8:3, 275-282

[Karlin & Altschul 1990] Karlin, S. & Altschul, S. F. (1990). Metho ds for assessing the statistical

signi cance of molecular sequence features by using general scoring schemes. Proc. Natl Acad.

Sci. USA 87, 2264-2268.

[Karlin & Altschul 1993] Karlin, S. & Altschul, S. F. (1993). Applications and statistics for multiple

high-scoring segments in molecular sequences. Proc. Natl Acad. Sci. USA 90, 5873-5877.

[Kearns & Vazirani 1994] Kearns, M.J. & Vazirani, U.V. (1994). "An Intro duction to Computa-

tional Learning Theory". MIT Press, Cambridge, Mass. 0p 1-48, chp. 1-2.

[Khachian 1979] Khachian, L. G. (1990). A p olynomial algorithm in linear programming. Dokklady

Akademiia Nauk SSSR 244, 1093-1096.

[Klo ck & Buhmann 1997] Klo ck, H. & Buhmann, J. M. (1997). Multidimensional scaling by de-

terministic annealing. In Proceedings of the International Workshop on Energy Minimization

Methods in Computer Vision and Pattern Recognition pp. 245-260.

[Ko onin et al. 1996] Ko onin, E. V., Tatusov, R. L. & Rudd, K. E. (1996). Protein sequence com-

parison at genome scale. Methods Enzymol 266, 295-321.

[Krause & Vingron 1998] Krause, A. & Vingron, M. (1998). A set-theoretic approach to database

searching and clustering. 14:5, 430-438.

[Levin et al. 1986] Levin, J. M., Robson, B. & Garnier, J. (1986). An algorithm for secondary

structure determination in proteins based on sequence similarity. FEBS Letters 205, 303-308.

[Levitt & Gerstein 1998] Levitt, M & Gerstein, M. (1998). A Uni ed Statistical Framework for

Sequence Comparison and Structure Comparison. Proc. Natl. Acad. Sci. USA 95, 5913-5920.

[Linial et al. 1995] Linial, N., London, E. & Rabinovich,Yu. (1995). The geometry of graphs and

some of its algorithmic applications. Combinatorica 15, 215-245.

[Lipman & Pearson 1985] Lipman, D. J. & Pearson, W. R. (1985). Rapid and sensitive protein

similarity. Science 227, 1435-1441.

[McLachlan 1971] McLachlan, A. D. (1971). Tests for comparing related amino-acid sequences.

Cyto chrome c and cyto chrome c551. J. Mol. Biol. 61, 409-424.

[Miyata et al. 1979] Miyata, T., Miyazawa, S. & Yasunaga, T. (1979). Two typ es of amino acid

substitutions in protein evolution. J. Mol. Evol. 12, 219-236.

[Mohana Rao 1987] Mohana Rao, J. K. (1987). New scoring matrix for amino acid residue ex-

changes based on residue characteristic physical parameters. Int. J. Pept. Protein. Res. 29,

276-281.

[Murzin 1993] Murzin, A. G. (1993). OB(oligonucleotide/oligosaccharide binding)-fold: common

structural and functional solution for non-homologous sequences. EMBO J. 12:3, 861-867.

[Murzin et al. 1995] Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). scop: a

structural classi cation of proteins database for the investigation of sequences and structures.

J. Mol. Biol. 247, 536-540.

[Needleman & Wunsch 1970] Needleman, S. B. & Wunsch, C. D. (1970). A general metho d appli-

cable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol.

48, 443-453.

[Neuwald et al. 1997] Neuwald, A. F., Liu, J. S., Lipman, D. J. & Lawrence, C. E. (1997). Extract-

ing protein alignment mo dels from the sequence database. Nucl. Acids Res. 25:9, 1665-1677.

[Nuo er & Balch 1994] Nuo er, C. & Balch, W. (1994). GTPase: multifunctional molecular

switches regulating vesicular trac. Ann. Rev. Biochem. 63, 949-990.

[Orengo 1994] Orengo, C. (1994). Classi cation of protein folds. Curr. Opin. Struct. Biol. 4, 429-

440.

[Orengo et al. 1997] Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. &

Thornton, J. M. (1997). CATH-a hierarchic classi cation of protein domain structures. Struc-

ture 5, 1093-1108.

[Park et al. 1997] Park J., Teichmann, S. A., Hubbard, T. & Chothia, C. (1997). Intermediate

sequences increase the detection of homology b etween sequences. J. Mol. Biol. 273, 349-354.

[Pearson & Lipman 1988] Pearson, W. R. & Lipman, D. J. (1988). Improved to ols for biological

sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444-2448.

[Pearson 1995] Pearson, W. R. (1995). Comparison of metho ds for searching protein sequence

databases. Protein Sci. 4, 1145-1160.

[Pearson 1996] Pearson, W. R. (1996). E ective protein sequence comparison. Methods Enzymol

266, 227-258.

[Pearson 1997] Pearson, W. R. (1997). Identifying distantly related protein sequences. Comp. App.

Biosci. 13:4, 325-332.

[Pearson 1998] Pearson, W. R. (1998). Empirical statistical estimates for sequence similarity

searches. J. Mol. Biol. 276, 71-84.

[Pennisi 1997] Pennisi, E. (1997). Microbial genomes come tumbling in. Science 277, 1433.

[Press et al. 1988] Press, W. H., Teukolsky, S.A., Vetterling, W. T. & Flannery, B. P. (1988).

"Numerical Recip es in C". Cambridge University press, Cambridge. 0p 60-72,

[Rand 1971] Rand, W. M. (1971). Ob jective criteria for the evaluation of clustering metho ds. J.

Am. Stat. Assoc. 66, 846-850.

[Risler et al. 1988] Risler, J. L., Delorme, M. O., Delacroix, H. & Henaut, A. (1988). Amino acid

substitutions in structurally related proteins. A pattern recognition approach. Determination

of a new and ecient scoring matrix. J. Mol. Biol. 204, 1019-1029.

[Rissanen 1989] Rissanen, J. (1989). Sto chastic Complexity in Statistical Inquiry. World Scienti c.

[Rose et al. 1990] Rose, K., Gurewitz, E. & Fox, G. (1990). A deterministic annealing approach to

clustering. Patt. Rec. Letters 11:4, 589-594.

[Sander & Schneider 1991] Sander, C. & Schneider, R. (1991). Database of homology-derived pro-

tein structures and the structural meaning of sequence alignment. Proteins 9, 56-68.

[Setubal & Meidanis 1996] Setubal, J. C. & Meidanis, J. (1996). "Intro duction to computational

molecular biology". PWS Publishing Co., Boston.

[Shi & Malik 1997] Shi, J. & Malik, J. (1997). Normalized cuts and image segmentation. Proc.

CVPR 731-737.

[Smith & Waterman 1981] Smith, T. F. & Waterman, M. S. (1981). Comparison of Biosequences.

Adv. App. Math. 2, 482-489 .

[Smith et al. 1985] Smith, T. F., Waterman, M. S. & Burks, C. (1985). The statistical distribution

of nucleic acid similarities. Nucl. Acids. Res. 13, 645-656.

[Smith et al. 1990] Smith, H. O., Annau. T. M. & Chandrasegaran, S. (1990). Finding sequence

motifs in groups of functionally related proteins. Proc. Natl. Acad. Sci. USA 87, 826-830.

[Smith 1993] Smith, P. S. (1993). Threshold validity for mutual neighb orho o d clustering. PAMI

15, 89-92.

[Smith & Xue 1997] Smith, D. K. & Xue, H. (1997). Sequence pro les of immunoglobulin and

immunoglobulin-like domains. J. Mol. Biol. 274, 530-545.

[Sonnhammer & Kahn 1994] Sonnhammer, E. L. L. & Kahn, D. (1994). Mo dular arrangement of

proteins as inferred from analysis of homology. Protein Sci. 3, 482-492.

[Sonnhammer et al. 1997] Sonnhammer, E. L., Eddy, S. R., Durbin, R. (1997). Pfam: a compre-

hensive database of protein domain families based on seed alignments. Proteins 28, 405-420.

[Sonnhammer et al. 1998] Sonnhammer, E. L., Eddy, S.R., Birney, E., Bateman, A., & Durbin,

R. (1998). Pfam: multiple sequence alignments and HMM-pro les of protein domains. Nucl.

Acids Res. 26, 320-322.

[Tatusov et al. 1997] Tatusov, R. L., Eugene, V. K. & David, J. L. (1997). A genomic p ersp ective

on protein families. science 278, 631-637.

[Taylor 1996] Taylor, W. R. (1996). Multiple protein sequence alignment: algorithms and gap

insertion. Methods Enzymol 266, 343-367.

[van Heel et al. 1991] van Heel, M. (1991). A new family of p owerful multivariate statistical se-

quence analysis techniques. J. Mol. Biol. 220, 877-887.

[Vapnik 1982] Vapnik, V.N. (1982). "Estimation of Dep endences Based on Empirical Data".

Springer-Verlag, New-York. 0p. 162-176,

[Wang 1996] Wang, Z. (1996). How many fold typ es of protein are there in nature? Proteins 26,

186-191.

[Wang et al. 1990] Wang, D., Buhmann, J. & von der Malsburg, C. (1990). Pattern segmentation

in asso ciative memory. Neural Computation 2, 94-106.

[Watanab e & Otsuka 1995] Watanab e, H. & Otsuka, J. (1995). A comprehensive representation

of extensive similarity linkage b etween large numb ers of proteins. Comp. App. Biosci. 11:2,

159-166.

[Waterman & Vingron 1994] Waterman, M. S. & Vingron, M. (1994). Rapid and accurate estimates

of statistical signi cance for sequence data base searches. Proc. Natl. Acad. Sci. USA 91, 4625-

4628.

[Waterman 1995] Waterman, M. S. (1995). "Intro duction to computational biology". Chapman &

Hall, London.

[Wo otton 1994] Wo otton, J. C. (1994). Sequences with 'unusual' amino acid comp ositions. Curr.

Opin. Struct. Biol. 4, 413-421.

[Wo otton & Federhen 1993] Wo otton, J. C. & Federhen, S. (1993). Statistics of lo cal complexity

in amino acid sequences and sequence databases. Comp. Chem. 17, 149-163.

[Wu et al. 1992] Wu, C., Whitson, G., Mclarty, J., Ermongkonchai A. & Chang, T. (1992). Protein

classi cation arti cial neural system. Protein Sci. 1, 667-677.

[Wu & Leahy] Wu, Z. & Leahy, R. (1993). An optimal graph theoretic approach to data clustering:

theory and its application to image segmentation. PAMI 15, 1101-1113.