Metho ds for Global Organization of all
Known Protein Sequences
Thesis submitted for the degree
\Do ctor of Philosophy"
Golan Yona
Submitted to the Senate of the Hebrew University in the year 1999
1
This work was carried out under the sup ervision of
Prof. Nathan Linial, Prof. Naftali Tishby and Dr. Michal Linial.
2
Acknowledgments
First, I would like to thank my advisors: Nati Linial and Tali Tishby of the computer science
department and Michal Linial of the life science department at the Hebrew University. Within the
last ve years we have come a long way together, and I'm grateful for their guidance, supp ort,
warmth and generous help all these years.
I would like to thank my colleagues and friends in the machine learning lab, with whom I sp ent
most of the last ve years: Itay Gat, Lidror Troyansky, Elad Schneidman, Shlomo Dubnov, Shai
Fine, Noam Slonim, Ofer Neiman and Ran El-Yaniv, for all their help and for b eing great company.
A sp ecial thanks to Ran El-Yaniv, a dear friend who encouraged me to keep b elieving in myself. I
very much enjoyed working with Gill Bejerano, with whom I am still working on fascinating new
directions.
I would next like to thank all of my colleagues and friends in the molecular genetics and
biotechnology lab headed by Hanah Margalit: Ora Schueler, Eyal Nadir, Iddo Friedb erg, Yael
Mandel and Yael Altuvia, for the great time I had every Thursday in our journal club group
meetings. Besides the many pap ers I had in my bag after each such meeting, it was also great fun.
A very sp ecial thanks to Yael Altuvia for her continuous and wonderful help and supp ort, and to
Hanah Margalit, who was like an advisor for me, always happy to answer questions and give advice,
as well as give me encouragement in hard times.
I thank Amnon Barak for making the MOSIX parallel system available, and the p eople in
Compugen Ltd. who gave me access to their Bio ccelerator and to their software.
It was a pleasure to work with Alex Kremer, Avi Kavas, Yoav Etsion and Daniel Avrahami,
the great team of students with whom I created the ProtoMap web site.
I thank Michael Levitt, Steven Brenner, Patrice Ko ehl, Boaz Shaanan, Yoram Gdalyahu and
Ran El-Yaniv for critically reading parts of this manuscript and for making many helpful comments,
and Avner Magen for valuable suggestions. A sp ecial thanks to Nati Linial, my advisor, who with
much patience read most of this thesis, commenting, correcting my English, and improving my
writing style.
To my family, esp ecially my mother and father for their tremendous love and encouragement,
and for continually reminding me that they will b e happy with whatever I cho ose to do (always
leaving me the option of b ecoming a carp enter).
And last, to my b est friends, Yoram Gdalyahu and Rami Doron, for their enormous help and
for their invaluable friendship during these intensive years.
Contents
1 Intro duction 7
1.1 What are proteins? ...... 7
1.2 Functional analysis of protein sequences ...... 8
1.3 The explosion of biological information ...... 9
1.4 Towards global organization ...... 10
2 Comparison of protein sequences 13
2.1 Alignment of sequences ...... 13
2.1.1 Global similarity of protein sequences ...... 14
2.1.2 Penalties for gaps ...... 15
2.1.3 Global distance alignment ...... 16
2.1.4 Position dep endent scores ...... 18
2.1.5 Lo cal alignments ...... 19
2.2 Other algorithms for sequence comparison ...... 19
2.2.1 BLAST (Basic Lo cal Alignment Search To ol) ...... 20
2.2.2 FASTA ...... 22
2.3 Probability and statistics of sequence alignments ...... 23
2.3.1 Basic random mo del ...... 23
2.3.2 Statistics of global alignment ...... 24
2.3.3 Statistics of lo cal alignment ...... 26
2.4 Scoring matrices and gap p enalties ...... 31
2.4.1 The PAM family of scoring matrices ...... 32
2.4.2 The BLOSUM family of scoring matrices ...... 34
2.4.3 Information content of scoring matrix ...... 35
2.4.4 Gap p enalties ...... 38
3 Large scale analyses of protein sequences 39
3.1 Motif and domain based analyses ...... 39
3.1.1 The PROSITE dictionary [Bairo ch 1991 ] ...... 39
3.1.2 The BLOCKS database [Heniko & Heniko 1991 ] ...... 40
3.1.3 The ProDom database[Sonnhammer & Kahn 1994 ] ...... 40
3.1.4 The Pfam database [Sonnhammer et al. 1997 ] ...... 41
3.1.5 The DOMO database [Gracy & Argos 1998 ] ...... 41
3.2 Protein based analysis ...... 41
3.2.1 The study by [Gonnet et al. 1992 ] ...... 41
3.2.2 The study by [Harris et al. 1992 ] ...... 42 3
4 CONTENTS
3.2.3 The study by [Watanab e & Otsuka 1995 ] ...... 42
3.2.4 The PIR classi cation [Barker et al. 1996 ] ...... 42
3.2.5 The COGS database [Tatusov et al. 1997 ] ...... 43
3.2.6 The SYSTERS database [Krause & Vingron 1998 ] ...... 43
3.3 Alternative representations of protein sequences ...... 44
3.3.1 The study by [van Heel et al. 1991 ] ...... 44
3.3.2 The study by [Ferran et al. 1994 ] ...... 44
3.3.3 The study by [Han & Baker 1995 ] ...... 44
3.3.4 The study by [Casari et al. 1995 ] ...... 45
3.4 Structural classi cations of proteins ...... 45
3.5 Summary ...... 46
4 Part I - The Euclidean emb edding approach 47
4.1 Constructing a metric space on the protein sequence space ...... 47
4.2 Why emb ed? ...... 49
4.3 Emb edding strategies ...... 50
4.3.1 Classical approach ...... 50
4.3.2 Current approach ...... 52
4.4 Application of emb edding techniques for protein sequences ...... 57
5 Data clustering 61
5.1 Intro duction ...... 61
5.1.1 Basic de nitions ...... 62
5.1.2 Clustering metho ds ...... 62
5.1.3 Partitional clustering algorithms ...... 63
5.1.4 The basic clustering pro cedure ...... 64
5.1.5 Hierarchical clustering algorithms ...... 65
5.1.6 Cluster validation ...... 66
5.2 Our clustering algorithm ...... 67
5.2.1 Algorithm outline ...... 68
5.2.2 Cross validation ...... 68
5.2.3 Algorithm sketch ...... 69
5.2.4 Conditions for agreement ...... 69
6 Global organization of protein segments 75
6.1 Results ...... 75
6.1.1 Clusters of protein sequences ...... 76
6.1.2 \Fingerprints" of biological families based on cluster memb ership ...... 86
6.1.3 Higher-level measures of similarity b etween sequences ...... 87
6.2 Discussion ...... 91
7 Part II - The graph-based approach 95
7.1 Preface ...... 95
7.2 Pairwise clustering algorithms ...... 96
7.2.1 The single linkage algorithm ...... 97
7.2.2 The complete linkage algorithm ...... 98
7.2.3 The average linkage algorithm ...... 98
CONTENTS 5
7.2.4 Other pairwise clustering algorithms ...... 99
8 The graph based approach - representation and algorithms 103
8.1 Intro duction ...... 104
8.1.1 Related works ...... 105
8.2 Metho ds ...... 107
8.2.1 De ning the graph ...... 107
8.2.2 Placing all metho ds on a common numerical scale ...... 108
8.2.3 Neighb ors' lists ...... 109
8.2.4 Exploring the connectivity ...... 111
9 ProtoMap: Automatic classi cation of protein sequences, a hierarchy of protein
families, and lo cal maps of the protein space. 115
9.1 General information on clusters ...... 115
9.2 Performance evaluation ...... 118
9.2.1 The evaluation metho dology ...... 118
9.2.2 The reference databases ...... 120
9.2.3 Evaluation results ...... 121
9.2.4 Critique of the evaluation metho dology ...... 122
9.2.5 New clusters ...... 123
9.3 ProtoMap as a to ol for analysis ...... 124
9.3.1 Tracing the formation of clusters ...... 124
9.3.2 Hierarchical organization within protein families and sup erfamilies ...... 125
9.3.3 Possibly related clusters and lo cal maps ...... 126
9.4 Discussion ...... 128
9.5 App endix A - Correlation of ProtoMap releases ...... 132
10 The ProtoMap website 135
10.1 How to use ProtoMap ...... 135
10.1.1 Search ...... 135
10.1.2 Browsing a cluster ...... 137
10.1.3 Analyzing new sequences ...... 141
11 Concluding remarks 143
11.1 The whole picture ...... 145
11.2 Future directions ...... 145
6 CONTENTS
Chapter 1
Intro duction
This study fo cuses on metho ds for global organization of all known protein sequences. The main
ob jective of this research is to explore high order organization in the protein sequence space, and to
identify clusters of related proteins, by utilizing information ab out pairwise similarities. By these
means we wish to achieve a classi cation of all protein sequences into groups of shared biological
function (protein families), and gain a b etter understanding of the geometry of the sequence space.
The category of this work is \computational molecular biology". This is an essentially new
discipline which is ro oted in two di erent disciplines: computer science and molecular biology.
Being on the b order line b etween the two disciplines, this study is related to elds of intensive
research in b oth. It incorp orates several recent developments in the theory of metric emb edding
and unsup ervised learning, ecient graph algorithms, algorithms for the comparison of protein
sequences, and the statistical theory of sequence comparison. While a full survey of each eld is
b eyond the scop e of this thesis, I have provided sucient background to evaluate this work.
This chapter b egins with a very brief intro duction to protein sequences, then de nes the ba-
sic problem on which this study is fo cused, and intro duces the metho ds that are employed. An
elab orated intro duction and discussion of each sub ject app ears in the following chapters.
1.1 What are proteins?
Proteins are the molecules which form much of the functional and structural machinery of each
cell in every organism. They carry out the transp ort and storage of small molecules, and are
involved with the transmission of biological signals. Proteins include the enzymes which catalyze
and regulate a variety of bio chemical pro cesses in the cell, as well as antib o dies which defend the
b o dy against infections. Dep ending on the tissue, each typ e of cell has several thousand kinds of
proteins which play a primary role in determining the characteristic of the cell and how it functions.
Proteins are complex molecules. They are assembled from 20 di erent amino acids, with
diverse chemical prop erties. The amino acids form the vo cabulary that allows proteins to exhibit a
great variety of structures and prop erties. In the cell, the co de for proteins is stored in the DNA
sequence, the cell's rep ository of genetic information, whose building blo cks are four di erent small
molecules, called nucleic acids. Each protein sequence is enco ded in a DNA sequence called a
gene, in which every blo ck of three nucleic acids (co don) corresp onds to an individual amino acid
(the set of rules that sp ecify which amino acid is enco ded in each co don is called the genetic
co de). Sp ecial machinery in the cell is resp onsible for the deco ding pro cess, i.e., the translation
and execution of a gene's \instructions" resulting in the formation of a protein sequence. 7
8 Chapter 1. Introduction
Each protein is created in the cell rst as a well-de ned chain of amino acids, whose sequence is
called the primary structure of a protein. The average length of a protein sequence is 350 amino
acids, but it can b e as short as few amino acids, and as long as a few thousand (the longest protein
known is over 5000 amino acids long). Schematically, each amino acid may b e represented by one
letter, and proteins can b e viewed as long \words" over an alphab et of 20 letters.
According to the central dogma of protein folding, the protein sequence (the primary structure)
dictates how the protein folds in three dimensions. It is the sp eci c three dimensional structure that
enables the protein to function in its particular biological role. This folding prescrib es the function
of the protein, and the way it interacts with other molecules. Di erent terms are used to describ e
the structural hierarchy in proteins: Secondary structures are lo cal sequence elements (3-40
amino acids long) that have a well determined regular shap e, such as a (alpha) helix, or a (b eta)
strand. Other lo cal sequence elements, which are neither helices or strands, are usually called lo ops
or coils, and they may adopt a large variety of shap es. Secondary structures are packed into what
is called the tertiary structure - the three dimensional (3D) structure of the whole sequence. In
general, helices and strands form the core of the protein structure, whereas lo ops are more likely
to b e found on the surface of the molecule. The shap e (the architecture) of the 3D structure is
called the fold. Each protein sequence has a unique 3D structure, but several di erent proteins
may adopt the same fold, i.e. the shap e of their 3D structure is similar. Finally, several protein
sequences may arrange themselves together to form a complex, which is called the quaternary
structure. The terms \motifs" and \domains" are also very common when describing protein
structure/function. A motif is a simple combination of a few consecutive secondary structure
elements with a sp eci c geometric arrangement (e.g., helix-lo op-helix). Some, but not all motifs
are asso ciated with a sp eci c biological function. A domain is the fundamental unit of structure.
It combines several secondary elements and motifs, not necessarily contiguous, which are packed in
a compact globular structure. A domain can fold indep endently into a stable 3D structure, and it
has a sp eci c function. A protein may b e comprised of a single domain or several di erent domains,
or several copies of the same domain.
Not every p ossible combination of amino acids can form a stable protein sequence that folds
and functions prop erly. Evolution has \selected" only (some of ) those sequences which could fold
into a stable functional structure. The term sequence space is used to describ e the set of all
currently known protein sequences, and those exp ected to exist. The term structure space refers
to the set of all folds adopted by protein sequences. Again, not every shap e is p ossible, and not
every p ossible shap e is exp ected to exist.
1.2 Functional analysis of protein sequences
One of the main goals of molecular biology is to understand the structure and function of proteins.
Knowledge of the biological function of a protein sequence is essential for the understanding of
fundamental bio chemical pro cesses, for drug design and for genetic engineering, to name a few. The
three dimensional structure of a protein gives the most information ab out its biological function.
However, determining the three dimensional structure of a protein (either using x-ray di raction
or nuclear magnetic resonance) is dicult, and not always feasible. At the moment, there are only
several thousand known structures. If the rules for folding were fully understo o d, it would have
b een p ossible, in principle, to deduce the three dimensional structure of a protein from its sequence
alone. However, this cannot yet b e done since we still lack a full understanding of the molecular
folding pro cess. As a result, structure prediction metho ds have so far achieved only mo derate
Chapter 1. Introduction 9
success.
In the absence of structural data, sequence analysis remains the main source of information for
most new proteins. Therefore, a considerable e ort has b een made to predict, based only on their
sequence, the bio chemical prop erties of proteins and their functional role in living cells. Shared
evolutionary ancestry is at the basis of such studies. Evolution has preserved sequences which have
proved to p erform correctly, and proteins with similar sequences are found in di erent organisms.
Similar proteins can also b e found within the same organism, due to duplication and shuing of
co ding segments in the DNA during evolution. In most cases, sequence similarity entails similar
or related functions. Therefore, detecting similarities b etween protein sequences can help to reveal
the biological function of new protein sequences, and to assess their biological relevance.
Sequence similarity is not always easily detectable. During evolution sequences have changed
by insertions, deletions and mutations. These evolutionary events may b e traced to day by ap-
plying algorithms for sequence comparison. Over the last 30 years many di erent algorithms and
approaches were tested. To day there are a numb er of generally accepted metho ds for comparing
pairs of sequences and assessing their similarity/distance (see chapter 2). These algorithms are
quite successful in exp osing sequence similarity, and have b ecome a standard to ol for biologists.
Few di erent \levels" of similarity are de ned. When sequences share a signi cant sequence
similarity they are usually assumed to have a common evolutionary ancestry, and are called ho-
mologous proteins. Proteins that are evolutionary related usually have the same core structure,
but they may have very di erent lo ops. A protein family is a group of homologous proteins,
and proteins which b elong to the same family have the same or very similar biological function. A
group of a few protein families which are distantly related through evolution make up a protein
sup erfamily. Proteins which b elong to the same sup erfamily have close or related biological func-
tions, but sequence similarity is not always detectable b etween memb ers of the same sup erfamily.
A few protein sup erfamilies may adopt the same fold. Proteins which have the same fold may have
related biological functions. This is not necessarily due to sequence similarity or evolutionary rela-
tionship, and therefore this kind of similarity is not always detectable using the standard metho ds
for sequence comparison.
1.3 The explosion of biological information
In recent years we have witnessed a massive ow of new biological data. Large-scale sequencing
pro jects throughout the world turn out new sequences, and create new challenges for investigators.
These ongoing sequencing e orts have already uncovered the sequences of over 100,000 proteins.
Such pro jects continue to yield the sequence of many new proteins whose function is not known.
Currently, 15 complete genomes (yeast, C. elegans, E. coli, other eubacteria, and several archae)
are known. Between 35%-50% of their proteins have not b een assigned a function yet [Pennisi 1997 ,
Do olittle 1998 ], and await analysis.
In the absence of structural data, analysis necessarily starts with the protein sequence. Sequence
analysis considers the basic prop erties of individual amino acids, as well as their combination.
Given a new protein sequence, the current approach to predicting its function and analyzing its
prop erties hinges on pairwise comparisons with the sequences of other proteins whose prop erties
are already known. A new sequence is analyzed by extrap olating the prop erties of its \neighb ors".
Such metho ds were applied during the last three decades with considerable success and help ed to
identify the biological function of many protein sequences, as well as to reveal many distant and
interesting relationships b etween protein families. To day, the routine pro cedure for analysis of a
10 Chapter 1. Introduction
new protein sequence almost always starts with a comparison of the sequence at hand with the
sequences in one or more of the main sequence databases. However, in many cases sequences have
diverged to the extent that their common origin is untraceable by a direct sequence comparison.
In such cases this metho d fails to provide clues ab out the functionality of the protein in question.
Consequently, many protein sequences in the protein databases still have unidenti ed functions.
1.4 Towards global organization
Over a decade ago, biologists b egan to realize that the amount of biological data accumulated in
the databases can no longer b e analyzed only by means of pairwise sequence comparison. Thus
p owerful to ols to analyze and organize this data b ecame a necessity. New metho ds for utiliz-
ing the information in these databases have b een applied, such as multiple alignment techniques
(comparisons of three or more sequences), pro le analysis (a p osition-sp eci c scoring matrix which
provides a representation of a domain or a protein family), and pattern recognition metho ds. These
techniques enhanced the ability to detect relationships b etween distantly related proteins.
Several large-scale analyses were carried out, most of which fo cused on nding common motifs,
domains and patterns of biological signi cance in protein sequences (see chapter 3 for details). The
basic approach is many of these studies is based on string analysis at the lo cal level, by applying
a semi-sup ervised pro cedure in which a pre-de ned group of related sequences serves as the basis
for building a mo del (e.g., consensus pattern, pro le, HMM). This is usually achieved by applying
multiple sequence alignment techniques. These analyses yielded excellent databases of motifs and
domains which can b e used to search for signi cant patterns in new sequences.
However, the prerequisite of having pre-de ned groups as input prevented most of these analyses
from b eing applied to all protein sequences. Moreover, these studies did not yield a mathematical
representation of protein sequences, and did not provide us with a global view of the sequence
space. Such a view can lead to the discovery of high-level features of the protein space. This is
extremely imp ortant in view of the fact that the common metho ds for protein sequence analysis
fail to assign a clear biological function to more than 40% of the sequences in the databases.
From a global p oint of view, each sequence is a p oint in the sequence space, and a high order
global organization of the protein sequence space is sought. The global p oint of view may help in
exp osing the geometry of the sequence space, and in detecting ordered structures (clusters) within
the space. Moreover, it is exp ected to add insight ab out the function of a protein from its relative
p osition in the space, and from its overall connections with its close neighb orho o d. Thus, when
currently available means of comparison fail to give satisfactory information ab out the biological
function of a protein, a global view may add the missing clues.
The task of nding high order organization entails the solution of several fundamental problems,
some of which are inherent to the analysis of protein sequences. Sp eci cally,
How to de ne a distance measure among protein sequences? A distance measure is imp ortant
for the exploration of the geometry of the sequence space. Though a numb er of metho ds for
sequence comparison have b een develop ed, none of them pro duces a measure which is a
distance measure on one hand and sensitive to weak similarities on the other (see chapters 2,
4).
Moreover, existing similarity/distance measures cannot distinguish the far from the very
far, and for sequences that have diverged signi cantly, the numerical distance measure is
essentially meaningless. This p oses serious problems in de ning the geometry of the sequence space.
Chapter 1. Introduction 11
Apparently no e ective distance measure exists for protein sequences, by means of the se-
quence prop er. However, if one could represent protein sequences as feature vectors in real
space of some dimension, then a natural measure of distance would emerge. This leads to the
second diculty.
What is the \correct" co ding of protein sequences? Several works studied the mathemat-
ical representation of protein sequences, based on sequence prop erties, such as amino acid
comp osition or chemical features (see section 3.3). Indeed, some information is enco ded in
the comp osition of amino acids in the sequence. However, this representation cannot capture
the essence of proteins as ordered sequences of amino acids. Statistics of longer substrings
in the sequence may enco de short term lo cal order within the sequence. But practically,
distributions of k-tuples are limited to dip eptides or trip eptides at the most, since already
for k=2 the numb er of p ossible dip eptides is 400, whereas the length of an average protein
is around 350 amino acids. Addition of chemical features may improve the representation
but still misses the order of amino acids within the sequence. Consequently, the information
enco ded in the order of amino acids within a protein sequence cannot b e fully recovered by
these representations.
The organization is involved with problems of unsup ervised learning (how to obtain the \cor-
rect" clustering). In general, clustering (pattern recognition) is largely an unsolved problem.
Even the concept of a \correct" clustering is not well de ned.
There are many heuristics for the clustering problem (see chapter 5). The choice of the
clustering algorithm may have a great e ect on the outcome, but none of these algorithms
is guaranteed to nd the \b est" or the \correct" clustering. Moreover, structures within the
space may b e sub jected to noise and bias in the training data, which makes their detection
even more dicult.
Finally, is it not clear how to compare clusterings which are obtained by di erent approaches,
in the absence of an apparent general quality measure. Known quality measures are usually
application sp eci c, but no obvious measure exists for protein sequences. Therefore, given
a mo del, it is hard to evaluate its validity and its ability to generalize to new samples (see
chapter 5).
Starting from the novel concept of global organization, the work presented here fo cuses on
metho ds for global organization of all protein sequences. Sp eci cally, two di erent approaches were
tested: (i) a Euclidean emb edding approach, and (ii) a graph-based approach.
The Euclidean emb edding approach studies segments of protein sequences and consists of three
main steps: (i) constructing a metric space on the protein sequence space, (ii) low-distortion em-
b edding of this metric space in Euclidean space, and (iii) hierarchical cross-validated clustering.
The graph-based approach studies complete protein sequences. An exhaustive comparison of all
protein sequences in the sequence database yields a list of signi cant similarities b etween protein
sequences. Based on this list, the sequence space is represented as a directed weighted graph, and
graph algorithms are applied to identify clusters of related proteins in this space.
In the next chapters I will present these two approaches in detail. Chapter 2 is a preliminary
chapter which describ es algorithms for sequence comparison, and the statistics of sequence align-
ment. Chapter 3 is a survey of large-scale analyses of protein sequences. Chapter 4 describ es the
rst two steps of the Euclidean emb edding approach, and chapter 5 describ es the third step (cross-
validated clustering). The results of this approach when applied to protein sequences are given in
chapter 6. Chapters 7 and 8 intro duce the graph-based approach. The results of this approach
12 Chapter 1. Introduction
as well as a p erformance evaluation and few biological examples are given in chapter 9. A tour
through a web site which was constructed to present the results of this study follows in chapter 10.
Chapter 11 closes this thesis with concluding remarks and suggestions regarding the direction of
future research.
Chapter 2
Comparison of protein sequences
During the last three decades a considerable e ort has b een made to develop algorithms that com-
pare sequences of macromolecules (proteins, DNA). The purp ose of such algorithms is to detect
evolutionary, structural and functional relations among the sequences. Successful sequence com-
parison would allow to infer the biological prop erties of new sequences from data accumulated on
related genes. For example, a similarity b etween a translated nucleotide sequence and a known
protein sequence suggests a homologous co ding region in the corresp onding nucleotide sequence.
Signi cant sequence similarity among proteins may imply that the proteins share the same sec-
ondary and tertiary structure, and have close biological functions. The prediction of unknown
protein structures is often based on the study of known structures of homologous proteins.
This chapter is a survey of sequence comparison, scoring schemes, and the statistics of sequence
alignments which are essential for the purp ose of distinguishing true relations among proteins from
chance similarities. The chapter is based on b o oks by M. Waterman [1995] and by Setubal and
Meidanis [1996], as well as on various pap ers in this eld.
2.1 Alignment of sequences
As is well known, the DNA molecule serves as a blueprint for the genetic information of every
organism. Therefore, the evolution of organisms must b e related to changes in the DNA. The
simplest events of molecular evolution are the substitution of one base by another (p oint mutations)
and the insertion or deletion of a base pair. Supp ose that the sequence b is obtained from the
sequence a by substitutions, insertions and deletions. It is customary and useful to represent the
transformation by an alignment where a is written ab ove b with the common (conserved) bases
aligned appropriately. For example, say that a = AC T T GA and b is obtained by substituting the
second letter from C to G, inserting an A b etween the second and the third letters, and by deleting
the fth base (G). The corresp onding alignment will b e:
a = A C T T G A
b = A G A T T A
We usually do not actually know which sequence evolved from the other. Therefore the events are
not directional and insertion of A in b might have b een a deletion of A in a.
In a typical application we are given two related sequences and we wish to recover the evolu-
tionary events that transformed one to the other. The goal of sequence alignment is to nd the 13
14 Chapter 2. Comparison of protein sequences
correct alignment that enco des the true series of evolutionary events which have o ccurred. The
alignment can b e assigned a score which accounts for the numb er of identities (a match of two
identical letters), the numb er of substitutions (a match of two di erent letters), and the numb er
of gaps (insertions/deletions). With high scores for identities, and low scores for substitutions and
gaps, the basic strategy towards tracing the correct alignment seeks the alignment which scores
b est (see next section for details).
The algorithms describ ed b elow may b e applied to the comparison of protein sequences as well
as to DNA sequences (co ding or non-co ding regions). Though the evolutionary events o ccur at the
DNA level, the main genetic pressure is on the protein sequence. Consequently, the comparison of
protein sequences has proven to b e a much more e ective to ol [Pearson 1996 ]. Mutations at the
DNA level do not necessarily change the enco ded amino acid due to the redundancy of the genetic
co de. Mutations often result in conservative substitutions at the protein level, namely, replacement
of an amino acid by another amino acid with similar bio chemical prop erties. Such changes tend to
have only a minor e ect on the protein's functionality. Within the scop e of this work, and in view
of the last paragraph, this chapter fo cuses on the comparison of protein sequences.
2.1.1 Global similarity of protein sequences
Let a = a a :::a and b = b b :::b where a ; b 2 A, the alphab et of amino acids, b e two given
1 2 n 1 2 m i j
protein sequences . A global alignment of these sequences is an alignment where all letters of a
and b are accounted for.
Let s(a ; b ) b e the similarity of a ; b and let > 0 b e the p enalty for deleting/inserting of one
i j i j
amino acid. The score of an alignment with N matches of a and b and N insertions/deletions
ij i j g ap
is de ned as
X
N s(a ; b ) N
ij i j g ap
i;j
The global similarity of sequences a and b is de ned as the largest score of any alignment of
sequences a and b, i.e.
X
S (a; b) = max f N s(a ; b ) N g
ij i j g ap
al ig nments
i;j
Usually, pairwise scores are the logarithm of likeliho o d ratios (log-o dds), as explained in section
2.4. Therefore, the de nition of the alignment score as the sum of the pairwise scores may b e inter-
preted as likeliho o d. The corresp onding similarity score is thus essentially a maximum likeliho o d
measure.
How to nd the b est alignment ? The exp onentially large numb er of p ossible alignments makes
it imp ossible to p erform a direct search. For example, the numb er of p ossible alignments of two
600
sequences of length 1000 exceeds 10 [Waterman 1995 ]. However, a dynamic programming algo-
rithm makes it p ossible to nd the optimal alignment without checking all p ossible alignments, but
only a very small p ortion of the search space.
Calculating the global similarity score S (a; b)
Denote by S the score of the b est alignment of the substring a a ::a with the substring b b ::b ,
i;j 1 2 i 1 2 j
i.e.
S = S (a a ::a ; b b ::b )
i;j 1 2 i 1 2 j
Chapter 2. Comparison of protein sequences 15
The optimal alignment of a a ::a with the substring b b ::b can end only in one of following
1 2 i 1 2 j
three ways:
a a
i i
b b
j j
Also, each sub-alignment must b e optimal as well. Therefore, the score S (a; b) can b e calculated
recursively:
Initialize
S = 0 S = i for i = 1::n S = j for j = 1::m
0;0 i;0 0;j
Recursively de ne
S = maxfS + s(a ; b ); S ; S g
i;j i 1;j 1 i j i;j 1 i 1;j
In practice, the scores are stored in a two-dimensional array of size (n + 1) (m + 1). The
initialization set the values at row zero and column zero. The computation pro ceeds row by row
so that the value of each matrix cell is calculated from entries which were already calculated.
Sp eci cally, we need the three matrix cells, on the west, the south and the southwest of the current
cell. The time complexity of this algorithm is (n m).
Dynamic programming algorithms were already intro duced in the late 50's [Bellman 1957 ].
However, the rst to prop ose a dynamic programming algorithm for global comparison of macro-
molecules, were Needleman and Wunsch [Needleman & Wunsch 1970 ].
2.1.2 Penalties for gaps
In sequence evolution, an insertion or deletion of a segment (several adjacent amino acids) usually
o ccurs as a single event. That is, the op ening of the gap is the signi cant event. Therefore, most
computational mo dels assign a p enalty for a gap of length k that is smaller than the sum of k
indep endent gaps of length 1. If the p enalty for gap of length k is (k ), we are thus interested in
sub-additive functions and assume that (x + y ) (x) + (y ).
Denote by N the numb er of gaps of length k in a given alignment. Then the score of this
k g ap
alignment is de ned in this case as
X X
N (k ) N s(a ; b )
ij i j
k g ap
i;j
k
This discussion allows the alignment to end in a gap of arbitrary length, and therefore S is de ned
i;j
as follow:
Set
S = 0 S = (i) S = (j )
0;0 i;0 0;j
then
S = maxfS + s(a ; b ); max fS (k )g; max fS (l )gg
i;j i 1;j 1 i j
1k j i;j k 1l i i l ;j
In practice, to compute each matrix cell, we need now to check the cell in the diagonal, all the
cells in the same row and all the cells in the same column. For arbitrary function (k ), the time
P
2 2
(i + j + 1) = (n m + n m + n m) plus n + m calculations of the function complexity is
i;j
3
(k ). For n = m we get (n ).
16 Chapter 2. Comparison of protein sequences
Optimization
A b etter time complexity can b e obtained for a linear gap function (k ) = + (k 1) where is the p enalty
0 1 0
for op ening a gap, and is the p enalty for extending the gap.
1
Set E = max fS (k )g. This is the maximum value over all the matrix cells in the same row, where
i;j
1k j i;j k
a gap is tested with resp ect to each one of them. Note that
E = maxfS (1); max fS (k )gg
i;j i;j 1
2k j i;j k
= maxfS ; max fS (k + 1)gg
i;j 1 0
1k j 1
i;j (k +1)
= maxfS ; max fS (k )g + g
i;j 1 0 1
1k j 1
i;(j 1) k
= maxfS ; E g
i;j 1 0 i;j 1 1
Therefore, only two matrix cells from the same row need to b e checked. These cells corresp ond to the two options
we need to check, i.e. op ening a new gap at this p osition ( rst term) or extending a previously op ened gap (second
term). The sub-linearity of implies that it is never b ene cial to concatenate gaps.
Similarly, for a column de ne F = max fS (l )g and in the same way obtain
i;j
1lj i l;j
F = maxfS ;F g
i;j i 1;j 0 i 1;j 1
Initialize
E = F = S = 0
0;0 0;0 0;0
E = S = (i) F = S = (j )
i;0 i;0 0;j 0;j
then
S = maxfS + s(a ; b ); E ; F g
i;j i 1;j 1 i j i;j i;j
and the time complexity is (m n)
For arbitrary convex gap functions it is p ossible to obtain an O (n m) algorithm [Waterman 1995 ]. However,
such functions seems to b e inappropriate for the context of comparison of macromolecules. For arbitrary concave
2
function (and in particular, for sub-additive functions) it is p ossible to obtain time complexity of O (n log (n)) with
a more complex algorithm [Waterman 1995 ].
Gonnet et al. [Gonnet et al. 1992 ] have prop osed a mo del for gaps that is based on gaps o ccurring in pairwise
alignments of related proteins. The mo del suggests an exp onentially decreasing gap p enalty function (see section
2.4.4). However, a linear p enalty function has the advantage of b etter time complexity, and in most cases the results
are satisfactory. Therefore the use of linear gap functions is very common.
2.1.3 Global distance alignment
In some cases it is interesting to de ne a distance among sequences D (a; b) - for example for the
construction of evolutionary trees, or when investigating the geometry of the sequence space. The
advantage of de ning a distance among sequences, is the construction of a metric space on the
space of sequences.
In general a metric is a function D : X X ! R which is:
Nonnegative: D (a; b) 0 for all a; b 2 X with equality if and only if a = b
Symmetric: D (a; b) = D (b; a)
Satis es the triangle inequality: D (a; b) D (a; c) + D (c; b) 8a; b; c 2 X
Chapter 2. Comparison of protein sequences 17
In contrast to similarity, where we are interested in the b est alignment and its score gives a
measure of how much the strings are alike, distances approach assigns a cost to elementary edit
op erations (evolutionary events) and seeks a series of op erations that transforms one string to
another, with the minimal cost.
The de nition of distance resembles the de nition of similarity, except that s(a ; b ) is replaced
i j
with d(a ; b ) which re ect the distance b etween amino acids a and b , and (k ), the gap p enalty
i j i j
now adds to the total distance (instead of decreasing the similarity). The minimum distance is
obtained by minimizing the sum of matches/mismatches costs and the p enalties for gaps.
X X
N (k )g N d(a ; b ) + D (a; b) = min f
ij i j
k g ap al ig nments
i;j
k
One restriction is that d(; ) is a metric on A, the alphab et of amino acids (and so satis es the
three requirements ab ove). Also (k ) has to b e p ositive. These restrictions imply that the total
global distance D (; ) is zero only if the two sequences are identical.
In some cases, distance and similarity measures are related by a simple formula: Let s(a ; b )
i j
b e a similarity measure over A and (k ) the p enalty for gap of length k . Let d(a ; b ) a metric on
i j
A and ^ (k ) a corresp onding cost for gaps of length k . If there is a constant c such that
d(a ; b ) = c s(a ; b ) 8a ; b 2 A
i j i j i j
and
k c
^ (k ) = (k ) +
2
then each alignment A satis es:
l
c(n + m)
D (A ) = S (A )
l l
2
where n (m) is the length of the sequence a (b). In particular,
D (a; b) = min D (A )
al ig nments A l
l
c(n + m)
= max S (A )
al ig nments A l
l
2
c(n + m)