JOURNAL OF Volume 5, Number 2,1998 Mary Ann Liebert, Inc. Pp. 173-196

Sequence Alignment in Molecular Biology

ALBERTO APOSTÓLICO1 and RAFFAELE GIANCARLO2

ABSTRACT

Molecular biology is becoming a computationally intense realm of contemporary science and faces some of the current grand scientific challenges. In its context, tools that identify, store, compare and analyze effectively large and growing numbers of bio-sequences are found of increasingly crucial importance. Biosequences are routinely compared or aligned, in a variety of ways, to infer common ancestry, to detect functional equivalence, or simply while search- ing for similar entries in a database. A considerable body of knowledge has accumulated on during the past few decades. Without pretending to be exhaustive, this paper attempts a survey of some criteria of wide use in sequence alignment and comparison problems, and of the corresponding solutions. The paper is based on presentations and lit- erature at the on held at in November

given Workshop Sequence Alignment Princeton, N.J., 1994, as part of the DIMACS Special Year on Mathematical Support for Molecular Biology.

Key words: computational molecular biology, design and analysis of algorithms, sequence align- ment, sequence data banks, edit distances, , philogeny, evolutionary tree.

1. INTRODUCTION

taxonomy is based ON THE assumption that conspicuous morphological and functional Classicalsimilarities in species denote close common ancestry. Likewise, modern molecular taxonomy pursues phylogeny and classification of living species based on the conformation and structure of their respective ge- netic codes. It is assumed that DNA code presides over the reproduction, development and susteinance of living organisms, part of it (RNA) being employed directly in various biological functions, part serving as a template or blueprint for . For proteins, following somewhat the dual principle of modern architecture, function follows form. The apparent functional, structural and sequence resemblance of proteins found in organisms of completely different morphology has further stimulated interest in a taxonomy based on sequence homology. Two biomolecules are said to be homologous if their sequences are likely to be offsprings of a common an- cestor sequence. However, ancestor sequences are not available, and homology is deduced from the similarity of existing sequences: the more two such sequences are similar, the more likely their homology is assumed tobe. Early studies on methods and tools for the automated analysis and comparisons of biosequences were stimulated by the identification of the aminoacid sequences of a number of proteins in the 1960s. Activity in this area has been growing ever since and culminates with the considerable outburst of computational biology

Department of Computer Sciences, Purdue University, West Lafayette, IN 47907, and Dipartimento di Elettronica e Informática, Università di Padova, Via Gradenigo 6/A, 35131 Padova, Italy. 2Dipartimento di Matemática, Università di Palermo, Via Archirafi 34, 90123 Palermo, Italy.

173 174 APOSTÓLICO AND GIANCARLO

(Waterman, 1995) in recent years, as more and more sequences accumulate from the genome sequencing of humans and other species. Despite some key conceptual acquisitions and substantial practical progress, we still do not have consolidated methods for every sequence alignment task. This should not come as a surprise, since the complexity of biosequence alignment is rooted in some of the most subtle and most delicate aspects of scientific inference. In an attempt to capture genetically significant relationships among sequences, several criteria have been proposed as measures of sequence similarity. While they all need to relate in one way or another to the basic mechanisms of evolution, there is no simple way to test or validate any one of them. At the empirical level, such a validation is made impossible by the uniqueness of the evolutionary experiment. At the epistemological level, the notion of a class of similar objects does not rest on very firm grounds. For instance, a claim known as the "Theorem of the Ugly Duckling" (Watanabe, 1969) states that as long as all of the predicates characterizing the objects to be classified are given the same importance or "weight," then a swan will be found to be just as similar to a duck as to another swan. The reason for this apparent paradox is in the fact that classification as we experience it on an empirical basis is only possible to the extent that the various predicates characterizing objects are given nonuniform weights. Applied to the study of biomolecules, this means that we shall be able to build better and better alignment algorithms as we learn more and more about precisely those homologies that our algorithms seek to detect. DNA molecules appear naturally to be repositories of "information" that is propagated through evolu- tionary history and used in replication and transcription mechanisms which are central to the life-cycle of cells and organisms. Quantitative definitions and measures of such information seem to be at the heart of sequence comparison methods and have been attempted in various ways over the years. On one hand, this information seems to be relatable to Shannon's notion of uncertainty and may be measured as such [see, e.g., Gatlin (1966)]. On the other, this information is used at the cellular level for the synthesis of the spatial conformation that presides over molecular structure and, ultimately, function, hence for departure from chaos. While Shannon's classical measure applies to the information in transmission, structural information possibly stored in biosequences seems related instead to the notion of redundancy. Measuring structure in finite objects, however, presupposes the of the "intuitive of randomness as far as it is accomplishment program" "analyzing possible within the region of finite sequences" (Popper, 1959), a goal cultivated already by Von Mises (1939) and other statisticians at the turn of the century. One of the deepest acquisitions in this domain is linked to Kolmogorov's (1965) innovative approach to the definition of information [see also Martin-Löf (1966)]. In this approach—which seems reminiscent in a seductive way of the very mechanism of molecular evolution— information (alternatively, conditional information) is measured by the length of the recorded sequence of zeroes and ones that constitute a program by which a universal machine produces one string from scratch (alt., from another string). The sobering conclusion, however, is that under this scheme the proportion of random strings in the set of all strings of a given length increases with the length: while a great many sequences of sufficiently large length tend to behave randomly in the limit, many short sequence exhibits some kind of regularity. It appears thus that we can measure and study randomness (whence also structure) in finite objects [see, in particular, Lesk et al. (1986)] only to the extent that we can legitimately privilege (i.e., assign a high weight to) certain regularities and neglect others, a principle that seems to be pendant, in syntactic pattern recognition, to the statistical paradox of the ugly duckling. In conclusion, the difficulties inherent to sequence alignment lie in the process of forming the necessary educated biases, in the dynamical and interactive process of extracting the feature weights. Along the way, the practice of sequence alignment shall constantly oscilate between the risk of overlooking important structure and that of discovering any arbitrarily defined kind of structure everywhere. There is a number of biological motivations and contexts for computer alignment of molecular sequences, which result in a variety of methods and tools. There is global alignment of pairs of sequences that are globally related by common ancestry, local alignments of related sequence segments, multiple alignments of members of families. Some special techniques of self-alignment have also emerged recently as useful measure of the "autocorrelation" inherent to molecular sequences. Finally, and not entirely disjointly from the above, ancillary alignments are performed in data base searches, typically, in order to detect homology of proteins. The corresponding available computational methods have developed considerably since the pioneering work by M. O. Dayhoff (Dayhoff and Eck, 1968) and others. Since 1982, Nucleic Acids Research has routinely devoted special issues of 600 pages or more to the latest developments in this area. By 1983, enough original material had accumulated on sequence comparison techniques to warrant dedication of a full volume (Sankoff and Kruskal, 1983). Part of these developments occurred in parallel with algorithmics on strings, thus contributing some remarkable challenges to algorithmic design. In general, however, the two endeavors conserved rather SEQUENCE ALIGNMENT IN MOLECULAR BIOLOGY 175 distinctive individual flavors. On one hand, the computational biologist would be typically driven by the need to obtain some significant speed-up in the processing of a limited class of sequences on some specific machine or class of (commercial) machines. On the other, the designer of algorithms would be concerned with unveiling the combinatorial structure of a rigidly defined problem, in order to achieve a higher asymptotic efficiency in computations performed by some abstract paradigm of physical computers. Historically, efficient algorithmics on strings and efficient computational methods for bio-sequence manipulations have found separate places in the literature, and in fact mostly addressed separate audiences. In general, molecular biologists are dismayed to find that formalization and computation involves almost invariably a number of abstractions and simplifications that denature the original problem, at least in part. On the other hand, mathematicians and theoretical computer scientists understandably incline towards an attitude that a problem that does not lend itself to a crisp formulation, or that manifests itself as either trivial or intractable, is not a problem. While some of the mutual isolation of the past is successfully being removed, the challenges that lay ahead in sequence alignment are still numerous and staggering. Some of the focal contemporary issues are addressed in this paper, the basic purpose of which is a somewhat deeper understanding of what the biologists expect from sequence alignment, how much of such an expectation has been fulfilled and how one could go about in defining future tasks. Along the way, the paper implicitly reviews some past history, and tries to pin-point the main biological findings and formal results that led designers of sequence alignment in their pursuit. Next, some of the bio-mathematical theory of sequence alignment is addressed, as various notions and models of similarity between sequences are recalled and compared, and the mechanisms that guide the choices of parameters (e.g., weighting functions) are discussed. This leads in turn to the study of the statistical tools that are used to establish the significance of alignments and how well they model the underlying biological knowledge. As already mentioned, when models are translated into computational tools, some of their subtleties are inevitably lost. We shall thus try to understand what the trade-offs are between the sophistication of the theoretical models and the difficulties associated with their implementation. This is of particular interest in the case of multiple sequence alignment, perhaps the single most prominent instance of a computationally intensive alignment task where approximations, computational tradeoffs and significance are all subtly intertwined.

2. SOME BACKGROUND AND A SYNOPSIS

An early reference to a problem involving sequence alignment dates back to 1879. It consists of a puzzle due to Lewis Carrol (1879) that works as follows: two English words of equal length are given, and one is asked to transform one into the other by going through a series of intermediate (English) words where each word differs from the next in only one letter. Insertion and deletion of letters are not allowed. Thus, for example, "head" can be transformed into "tail" using the intermediate words: "heal," "teal," "tell" and "tall." Intuitively, one can say that the two given words are at a "distance" of at most 5, since we found a way to go from one to the other through that number of substitutions. Decades later, in the development of coding theory, this simple notion of distance would be more formally introduced by Hamming (1950) and used extensively for the design of good codes for data transmission (Gallager, 1968). A richer notion of distance between sequences was defined by Levenshtein (1966). It can be described as follows. Given two strings, transform one into the other by using a sequence of the following operations: insertion of a symbol, deletion, substitution of a symbol with another one. Each such operation carries a cost or "weight," and one would like to perform the transformation at the cheapest possible overall cost. The most commonly used weights for DNA are the ones by Fitch and Margoliash ( 1967). For amino acids, many weight systems have been proposed [see for instance Feng etal. (1985), McLachlan (1971), Rao (1987), Rister et al. (1988), Smith and Smith (1990)]. However, for a long time the ones by Dayhoff et al. (1978) have been the standard and, very recently, the ones by Henikoff and Henikoff (1992) seem to be superior for the alignment of distantly related proteins (Henikoff and Henikoff, 1993). A very important aspect of sequence alignment is to establish how "meaningful" a given alignment is. This basic question applies both to global and to local alignments. Informally, it can be stated as follows: is a given alignment of two sequences due to chance? Answers to this question involve the investigation of many statistical aspects of sequence alignment, such as the expected length of an alignment or the expected distribution of scores of alignments. The mathematical tractability of those statistical questions seems to depend on two main factors: ( 1 ) the particular type of alignment being considered, and (2) the weights that we assign to the edit operations. Early results in this area are reported in Chvatal and Sankoff (1975), Karlin et al. 176 APOSTÓLICO AND GIANCARLO

(1983), Smith et al (1985). A closely related issue is a quantification of the "sensitivity" and "selectivity" of sequence alignment methods. In loose terms, sensitivity is the ability to detect distant evolutionary relationships between two strings, while selectivity is the ability to avoid the assignment of high scores to unrelated strings. In order to evaluate an alignment method with respect to those two parameters, one needs criteria to establish whether a meaningful alignment has been found or missed. Obviously, the time required to compute an alignment is also very important. As for the statistical signifi- cance of an alignment, the efficiency seems to depend on the particular alignment method and on the properties satisfied by the weights. Over the years, as strings to be aligned got longer, the objective of alignment shifted from that of the basic Needleman and Wunch method (1970) (which computes an optimal global alignment) to those of methods that trade the optimality of the computed alignment for speed. Examples are the algo- rithms reported in Altschul (1990), Lipman and Pearson (1985), Wilbur and Lipman (1984), and Wilbur and Lipman (1983). For the interested reader, an overview of some algorithmic issues related to such a speed- optimality trade-off, as well as to the impact of weights on the computational speed can be found in Giancarlo (1997). As mentioned, alignment methods are also routinely used in data base searches. That is, we are given a data base of proteins and we want to establish whether a query sequence is biologically related to strings in the data base. Usually, that involves the computation of a local similarity between the query sequence and each sequence in the data base. The strings having a similarity score exceeding some threshold are then reported and further screened to assess the biological meaningfulness of the similarity criterion used. This stage is based typically on statistical knowledge about score distributions or empirical knowledge about biological function. As knowledge about biological sequences grows, more complex questions are found to revolve around some notion of similarity among sequences. One such type of question concerns the problem of finding approximate repeated patterns within each string. Those approximate repeated patterns seem to be associated with several important properties of DNA and proteins. For instance, parts of chromosomes are made of repeated patterns. A fixed number of those repeated patterns is lost when the cell divides. So, it would appear that those some kind of clock the number of divisions that the cell can

repeated patterns represent "ticking" still undergo. Repeating patterns are also useful laboratory tools. Indeed, the number of times a particular pattern repeats in a DNA sequence varies from individual to individual. Such regions are useful in localizing genes in a chromosome. The first algorithmic studies in this area are due to Miller (1992), Kannan and Myers (1993) and Landau and Schmidt (1993). Additional algorithms based on heuristics are provided in Benson and Waterman (1994) and Milosavljevic and Jurka (1993). We will briefly review this kind of algorithm in Section 8. One more important and very active area of Sequence Alignment is the one that deals with the alignment of multiple sequences. There are many ways in which this problem can be stated, but at this point we will only highlight a few of them. For instance, one obvious multiple sequence alignment problem is the one that asks for the longest common subsequence of a set of strings [see, e.g., Hsu and Du (1984)]. Perhaps more significant from a biological perspective is the following problem originally stated by Sankoff, and now referred to as multiple alignment under an evolutionary tree (Sankoff, 1975). We are given a set of strings (typically, DNA) labeling the leaves of a given tree. We want to find ancestral strings to label the internal nodes of the tree and such that a given score function is minimized. Intuitively, one is trying to establish whether a set of strings have common ancestors and what those ancestors look like. Sankoff also provided algorithms for this problem (Sankoff, 1975). In multiple sequence alignment problems one faces the same difficulties and issues already mentioned for the alignment of two sequences, i.e., choice of weights, statistical significance, etc., except that here computations become overwhelmingly demanding. In fact, many multiple sequence alignment problems are NP-Hard [see for instance Wang and Jiang (1994)], or too time consuming [see Itoga(1981)], or both. Also in the case of multiple sequence alignment, there is a trend to develop fast algorithms that trade-off optimality of the required solution for speed [see for instance Bafna et al. (1994)]. However, in some cases, even the computation of approximate solutions is hard [see for instance Wang and Jiang (1994)]. The notion of distance between strings that uses symbol-based edit operations models evolution in terms of very local changes to the genomes. Although such a notion of distance is appropriate to describe evolution of small portions of a genome (for instance, a segment of a sequence describing a chromosome), it shows limitations in quantifying the evolutionary distance between two entire genomes or two entire chromosomes within two genomes. Indeed, evolution at the genomic level seems to involve also nonlocal, large scale operations, which can rearrange a whole segment of a chromosome in one evolutionary event. Such nonlocal SEQUENCE ALIGNMENT IN MOLECULAR BIOLOGY 177 changes are hard to detect using a distance based on local edit operations [see, for instance, discussions in Griffin and Boursnell (1990), Karlin etal. (1994), Palmerand Herbon (1988), Sankoff (1992), Sankoff et al. (1992), Watterson et al. (1982)], and require taking into account more global structure and information. One approach is as follows. Assume that we can write two genomes we want to compare as two strings of genes. Now, molecular biology suggests that if the two genomes are related, one should be obtainable from the other by a suitable rearrangement of its genes. This approach (gene rearrangement) has been pioneered by Palmer and Herbon (1988) even though the earliest application of gene rearrangement to genome comparison seems to date back to 1938 (Dobzhanski and Sturtvant, 1938). From the combinatorial point of view, gene rearrangement requires the definition of new distances between the DNA sequences that encode genomes. Moreover, it gives rise to a set of very interesting and challenging combinatorial problems. In this context, the first formulation of a new distance and of an algorithm computing it seems to be due to Watterson et al. (1982). Very recently, there has been quite a bit of work, initiated by Kececiouglu and Sankoff (1995), for the formal investigation of the computational complexity of various distances between genomes. In the last part of this paper, we will briefly review the state of the art in this domain.

3. PAIRWISE ALIGNMENTS

In this section, we discuss two basic families of methods for detecting similarities between two bio- molecular sequences, regarded as two strings x[\ : n] and y[l : m] over some suitable alphabet of symbols. (For instance, proteins are strings over the alphabet of aminoacids.) The families we are about to discuss are reminiscent, respectively, of the Hamming and Levenshtein distances introduced earlier, so that the first class of methods is in fact a restricted version of the second. In the practice of computational molecular biology, global resemblance between two bio-sequences is significant in a variety of cases, but most often the interest of the biologist concentrates on unveiling segments bearing high mutual resemblance within otherwise feebly related sequences. In particular, Hamming-like methods are applied almost exclusively as a filter to detect local similarities.

MSP. Maximal segment pair (MSP) computes an extremely simple notion of alignment in which no insertions and deletions of symbols are allowed. We now define it. Consider two substrings x[i : i + k] and ylj ' j +k]ofx and y, respectively. We can align those two substrings by pairing x[i] with y[j], x[i + 1] with y[j + 1 ] and so on. We can define a score for that alignment. Indeed, let S(a, b) be the weight of "pairing" a with b, where a and b are letters over the input alphabet Y.. S can be thought of as a matrix, which we refer to as the substitution matrix. The score of aligning two substrings (of equal length) of* and y is given by the sum of the weights of the letters paired together. The underlying alignment is locally optimal if and only if its score cannot be improved by either extending both substrings or by shortening both of them. A maximal segment pair is a best locally optimal alignment and it is taken as the (local) alignment of the two strings. In some cases, one may be interested in all locally optimal alignments. This simple method is the base for the BLAST program, one of the fastest programs for local sequence alignment (Altschul et al, 1990). Other methods of this kind have also been proposed [see, for instance, Arratia et al. (1986), Arratia and Waterman (1989), Karlin and Ost (1988), Sellers (1984)]. Beginning in the late sixties, measures of similarity based on Levenshtein distances were independently proposed in such diverse fields as speech processing, text editing and, of course, molecular biology. Solutions to the corresponding global alignment problem were also supplied at that time, based mostly on dynamic programming. In molecular biology, distances and related computations were proposed by Ulam, Needleman and Wunsch, Sellers, Sankoff and others (Goad, 1989; Ulam, 1972a, 1972b; Sellers, 1974; Needleman and Wunsch, 1970; Sankoff, 1972). A recollection of how the idea of dynamic programming became a standard tool in many problems of computational biology is provided by Sankoff (1996). In computer science, the problem was dubbed "the string to string correction problem" by Wagner and Fischer (1974), who gave it the following formalization. Given two strings x and y, the problem is to transform x into y by performing a series of weighted edit operations on x of overall minimum cost. An edit operation on x can be the deletion of a symbol from x, the insertion of a symbol in x or the substitution of a symbol of x with another symbol. Let D(a) denote the cost of deleting from x an occurrence of symbol a, 1(a) the cost of inserting some symbol a between two consecutive positions of x, and S(a, b) the cost of substituting some occurrence of a in x with an occurrence of b. The corresponding computation by dynamic programming is then readily stated. For this, let C(i, j), (0 < i < |jc|, 0 < / < \y\) be the minimum cost of transforming the prefix of 178 APOSTÓLICO AND GIANCARLO x of length i into the prefix of y of length j. Let sk denote the ¿th symbol of string s. Then C(0, 0) = 0, C(i, 0) = C(i -1,0) + D(xi) (i = 1,2,..., m), C(0, j) = C(0, j 1) + /(>>,) 0 = 1,2,...,«), and -

C(i, j) = min{C(/ -1,7-1)+ Six,-, C(/ 1, j) + D(x¡), C(i, j 1) + (1) y,), - - /(j,)} for all /, j, (1 < / £ |jc|; 1 < j < \y\). Observe that, of all entries of the C-matrix, only the three entries C(i 1, j 1), C(i 1, j), and C(i, j 1) are involved in the computation of the final value of C(i, j). — — — — Hence C(i, j) can be evaluated row-by-row or column-by-column in 0(|x||y|) = &(mn) time. An optimal edit script can be retrieved at the end by backtracking through the local decisions that were made by the algorithm. A few special cases of the recurrence above are of interest in various applications. Among them, the longest (or heaviest) common subsequences appear also among the earliest attempts at defining similarity among bio-sequences (Needleman and Wunsch, 1970). A subsequence of x is any string w that can be obtained from x by deleting zero or more (not necessarily consecutive) symbols. The longest common subsequence (LCS) problem for input strings x = x\x2 xm and y = y\y2- • • yn(m < n) consists of finding a third string w = w\W2 wi such that w is a subsequence of x and also a subsequence of y, and w is of maximum possible length. The relation of this problem to string editing is understood by observing that the effect of a given substitution can be always achieved, alternatively, through an appropriate sequence consisting of one deletion and one insertion. When the cost of substituting a symbol with a different one is higher than the global cost of one deletion followed by one insertion, then an optimum edit script will always avoid substitutions and produce instead y from x solely by insertions and deletions of overall minimum cost. If insertions and deletions have unit cost, and a substitution has a cost higher than 2, then the pairs of matching symbols preserved in an optimal edit script constitute a longest common subsequence of x and y. It is not difficult to see that the cost e of such an optimal edit script, the length / of an LCS and the lengths of the input strings obey the simple relationship: e = n +m 21. Several LCS algorithms that do not require quadratic time on all — inputs have been proposed over the years. Most fall in one of two basic paradigms due to Hirschberg (1977) and Hunt and Szymanski (1977), respectively, and improve on them in various ways [see, e.g., Apostólico and Guerra (1985)]. Other approaches, e.g., in Myers (1986) and Kambayashi et al (1982), are especially suitable for similar strings. Space saving techniques are often crucial in molecular sequence alignment. For some time, the early &(nm) algorithm by Hirschberg (1977) was the only one to require linear space. Solutions where linear space was not necessarily accompanied by &(nm) time began with Apostólico and Guerra (1985). Further work in this area is found in Apostólico et al. (1992), Chao (1994), Myers (1986), Myers and Miller (1988) and others. As mentioned, not all of the genetic material seems equally crucial to the functionality of organisms, so that the process of evolution tends to preserve some segments of DNA while accepting more liberally changes in others. Notable among the most preserved regions of DNA are those special segments that translate into sequences that fold up as proteins. Often in the practice of biological sequence comparison, sequences are compared that are known to have poor global similarity, but are also suspected to contain similar segments. In these cases, one resorts to methods that detect such local resemblances, such as the one below, formulated by Smith and Waterman.

SW-Local [Kanehisa and Goad, 1982; Sellers (1980); Smith and Waterman, 1981]. It computes local alignments of two substrings of x and y, but, unlike in the MSP, this time insertion and deletion of symbols are allowed. For each pair of prefixes x[l : i] and y[l : j] we are interested in finding the best suffixes of those two prefixes that can be aligned with each other using the full set of edit operations. That can be done by means of the following dynamic programming recurrence (Smith and Waterman, 1981):

LA(i, j) = max{LA(i - 1, j - 1)+ S(x¡, y¡), LA(i -1J) + S, LA(i, j 1) + S. 0} (2) - where S is the weight of deleting or inserting a symbol, 1 < i < n and 1 < j < m, and is chosen to be a negative number. Also the cost of a substitution is assumed to be negative on average. The initial conditions of the recurrence are given by LA(i, 0) 0 and LA(Q, j) 0. This set of conditions, and the 0 appearing — — under the max operator, is where Recurrence 2 departs from Recurrence 1. Clearly, the scores assigned in this way to M reflect the best local alignments ending at each point of that matrix. Among these locally optimal scores, we can choose a best one, i.e., one corresponding to max{LA(z, j), 1 < i < n and 1 < / < m\, SEQUENCE ALIGNMENT IN MOLECULAR BIOLOGY 179 and correspondingly identify two substrings of the input that achieve the best pairwise alignment. We can change (2) and its initial conditions getting back a method for the computation of the global similarity between two strings. SW-local [with modifications by Gotoh (1982)] is the base of the SSEARCH program for local alignment (Pearson, 1991).

4. THE STRUCTURE AND CHOICE OF WEIGHTS

In the two alignment methods we have outlined in the previous section, the values assigned to the weights or costs of insertions, deletions and substitutions are extremely important in obtaining alignments that are biologically meaningful. There are two natural questions that one can ask about weights. What is the structure of weights, i.e., is there any mathematical methodology that can help us to obtain numeric values for weights that turn out to be useful for molecular biology? How do we actually compute those weights? The current state of the art provides partial answers to those questions and many of the results depend on whether we want to compare two DNA or protein sequences, or whether we want to use local or global alignment and on which method we are using. As for alignment of DNA sequences, S in MSP or equations ( 1 ) is equal to a fixed constant c¡ if the two characters are equal (a match) else it is equal to another constant c2 (a mismatch). We will briefly discuss weights for DNA alignment in Section 4.4. As for alignment of proteins, S is a substitution matrix and it has been the object of many studies. In Sections 4.1-4.3 we will present results concerning substitution matrices for the MSP alignment method. That is, we will discuss those matrices mainly in the local alignment context and, unless otherwise stated, an alignment of two strings is the one obtained by MSP. Then, in Section 4.4 we will consider the problem of how to choose gap and match-mismatch weights (for DNA sequences) or substitution matrices (for proteins) for the SW-local alignment method, since it is a good representative of the class of methods allowing gaps.

4.1. The structure ofsubstitution matrices

When one looks for optimal local alignments of two strings, one is trying to discover whether parts of those two strings are related to each other through evolution. The scores (and therefore the weights of substitutions) give a quantitative measure ofthat relationship. So, the weights should be designed to discriminate biologically meaningful alignments from ones due to chance. In very loose and intuitive terms, two strings can always be obtained one from the other through some amount of "evolutionary change." So, when we align those two strings, it seems reasonable to use the weights that best capture that amount of "evolutionary change." Unfortunately, that quantity is not known to us. However, in order to investigate the "structure of weights," let us assume that it is known. Moreover, to simplify notation, we assume that all letters of the alphabet are integers in [1, |E|]. Let pi be the probability that symbol i appears in a randomly chosen string from E*. We refer to the p¡ 's as the background probability distribution of the symbols of the alphabet. Let q,j be the probability that, for the amount of evolutionary change we have fixed, symbol i is substituted by symbol j. We refer to q¡,fs as the target probability distribution of two symbols of the alphabet substituting each other. In Karlin and Altschul (1990), it is argued that the best weights one can use to capture the similarity of two sequences at a given evolutionary distance are of the form

su, j) = l2É3hilMjl (3) À where A. is a normalization constant. A few remarks are in order. As already stated in Section 2, many substitution matrices have been proposed, even before (3) was discovered. Most of them are log-likelihood matrices and therefore have the form prescribed by (3). This is appealing to intuition, since S(i, j) gives a "measure" of how much the probability of the event "symbol i substitutes symbol j" differs from chance. Equation (3) is important because it reinforces the view that, at least for MSP alignments, that intuition is indeed correct. There are other requirements which those substitution weights must satisfy. We will state and justify them. There must be at least two letters in £ for which the weight is positive, since we would like to assign positive scores to locally optimal alignments. Consider the substitution weight of two randomly chosen symbols, then its expected value must be negative. If the expected weight of two randomly chosen letters would be positive, 180 APOSTÓLICO AND GIANCARLO extending two substrings in an alignment as far as possible would tend to increase the score of that alignment (so the definition of locally optimal alignment would be meaningless).

4.2. How to compute substitution matrices We now turn to the problem of how to actually determine the numeric values for substitution matrices. The hard part is the computation of the target probabilities in (3). We will briefly discuss two methods, one due to Dayhoff et al (1978) and the other one due to S. and J. Henikoff (1992). The PAM Matrices. Dayhoff et al (1978) use a stochastic process to model the evolution of proteins: at each discrete time instant, a symbol (an amino acid) has a certain fixed probability of substituting another symbol. They also define a unitary measure of evolutionary change as a Point Accepted Mutation (PAM for short): one PAM unit is the amount of evolutionary change required for a sequence of length n to have endured n/100 point mutations. Note that the same position might mutate more than once, or even mutate back to its original character, so that two sequences that are one PAM diverged will be expected to differ in less than 1 % of their positions. Likewise, experiments confirm that two sequences that are 250 PAM apart are still expected to agree on about 20% of their symbols. Let Mi be the stochastic matrix giving the target probability distribution at one PAM. That is, Mi [i, j] = q¡j at one PAM. M\ is a matrix that has been experimentally determined by studying a group of closely related proteins [Dayhoff er al. (1978)]. M„ = (Mi)" is the target probability distribution at n PAM's. Now, for a given evolutionary distance (expressed in PAM's), we can use the corresponding M matrix to compute the substitution matrix as given by (3). Actually, one has a family of matrices from which to obtain substitution matrices. Those target distribution matrices are simply referred to by the PAM distance corresponding to them. So, PAM 250 is the matrix corresponding to an evolutionary distance of 250 PAMs. One drawback of the method just described to compute weights is that, for distantly related strings, one infers target probabilities from the target probabilities of closely related strings. That may lead to inaccuracies since we are not estimating the target probabilities directly from samples of distantly related strings. Henikoff and Henikoff ( 1992) have recently proposed a different method to obtain target probabilities for distantly related strings. The idea is to infer those probabilities from a large set of representative families of proteins. Here we outline the main ideas of the method.

The BLOSUM Matrices. Since | E | 20 for the alphabet of amino acids, there are 210 distinct pairs of symbols (i, j), each of which corresponds—to a substitution of symbol i with j. (Here (j, i) is considered to be same as (/, j)). The target probabilities are a normalized estimate of the frequencies with which one can find a substitution (i, j). One estimates those frequencies and therefore the corresponding probabilities using a data base of proteins. That is done as follows. The proteins in the database are divided into blocks, following the criteria specified in Henikoff and Henikoff (1992). A block is a set of s strings, each of length w. So, we can think of a block as ans x w matrix. For each column ofthat matrix, we count how many times symbols i and j appear in that column and we increase the frequency estimate of that pair of symbols accordingly. (Each such occurrence corresponds to a substitution of a symbol of a string into another symbol in a different string in the block.) The process is repeated for all blocks. An important feature of this method is that one can cluster closely related proteins that are in the same block. Such a clustering avoids that closely related proteins give a large contribution to the target frequencies (we want to obtain target frequencies that "capture" similarity between distantly related proteins). The idea is the following. Let us assume that we want to reduce the contributions to the target frequencies of proteins that are at least 80% similar. Then, for each block, we cluster together all proteins that are at least 80% similar and each of those clusters is now one string in the block. We use those new blocks to estimate target frequencies. We remark that the similarity of proteins within blocks is obtained by other methods [Henikoff and Henikoff (1992)]. The above approach gives rise to a family of target probability distribution matrices, the BLOSUM family. Each BLOSUM matrix has also a number attached to it that refers to the percentage that one has used for clustering. For instance, BLOSUM 62 is the matrix obtained by clustering together, in each block, proteins that are at least 62% similar. From each BLOSUM matrix, we can obtain substitution matrices again using (3).

4.3. How to choose substitution matrices

When we want to align two strings using MSP, we need to choose a substitution matrix. From what we have said above, we should use the substitution matrix that best captures the evolutionary change transforming one string into the other. Unfortunately, that distance is not known to us, even though we may have some idea SEQUENCE ALIGNMENT IN MOLECULAR BIOLOGY 181 about it. Up until recently, one would resort to use the PAM 250 matrix since experimentally it seemed the best suited to align distantly related strings (Dayhoff and Eck, 1968). Recently, Altschul (1991) has proposed an interpretation of substitution matrices in information theoretic terms. That interpretation is useful both in choosing a substitution matrix from a family of matrices (like the PAM or BLOSUM families) and in comparing the performance of matrices from different families, i.e., the ability of substitution matrices to yield biologically meaningful alignments. Let H = ¿Zjjqjj ]og2(q¡j/piPj) be the relative entropy of the target and background distributions. We associate the quantity expressed by H to a substitution matrix with target probabilities q¡j's. Now, fix two strings for which we want to compute an MSP alignment and assume that q'¡ is the target probability distribution for the alignment of those two strings. Let H' be the same as H but with the new target probability distribution. Notice that H' is the average score per symbol substitution in that alignment. Since H' is also a measure of information, we can interpret it as the average information (in bits) per position that is needed to distinguish that alignment from chance. In other words, it gives, for each position of the alignment, the amount of information that we need to discriminate between the target and the background probability distributions. Now, to align the two chosen strings, we should choose a substitution matrix with relative entropy close to //'. Since we do not know H', we have to resort to approximation. That is done as follows. Altschul (1991) has shown that, in order for an MSP alignment to be meaningful, its score must be of at least log N bits, where N is the product of the lengths of the two strings being aligned. Now assume that, for the two strings we want to align, we expect the length of the best local alignment to be / (this value is usually determined by resorting to heuristic knowledge). So, the amount of bits each position must contribute to the score of that alignment is \og(N)/f. Therefore, we choose a substitution matrix with relative entropy close to \og(N)/f. Based on this criterion, Altschul (1991) provides a set of guidelines on which PAM matrices to use in which contexts. Further studies are reported in Altschul (1993) and States et al. (1991). The relative entropy of a substitution matrix is also very important to establish a criterion on which to base the comparison of the performance (in terms of parameters such as sensitivity and selectivity, to be discussed below) of two different substitution matrices. For instance, the PAM of matrices has been obtained family in a totally different way from the BLOSUM family and there seems to be no clear relationship between members of the two families. The question is how to compare matrices in those two families (e.g., should the performance of the PAM 250 matrix be compared with BLOSUM 62 or with another matrix?). Using relative entropy, it seems reasonable to compare the performance of two matrices that have the same entropy. Using this criterion, S. and J. Henikoff (1992) have compared those two families of matrices and they showed that the BLOSUM matrices seem to perform much better than the PAM matrices. Additional extensive studies are reported in Henikoff and Henikoff (1993). For completeness, we point out that those experimental studies have been carried out for local alignment methods that allow gaps. The problem of choosing weights for global or local alignment methods when both gaps and substitutions are allowed is considered next.

4.4. Weights for alignments with gaps In pairwise alignment of DNA sequences, the weight c\ of a match is usually set to 1, while the weight c2 of a mismatch and the gap penalty S are determined by trial and error. Typically, the first symbol in a gap (the first insertion/deletion in a series) gets a higher penalty weight (e.g., —12) than the one (e.g., —4) assigned to the remaining insertions/deletions in the series. A discussion about which weights for DNA sequence alignment are best suited in order not to miss biologically meaningful alignments is reported in Fitch and Smith (1983). In the alignment of proteins sequences, weights are given in form of a substitution matrix and a gap penalty 8. The policy used in the choice of S for DNA sequences, i.e., giving the first insertion/deletion in a series a weight higher than the rest, is followed also in this case. However, numerical values may be different than the ones used for DNA [see, for instance, the values suggested by Pearson (1991)]. We also point out the interplay between a good choice of gap penalties and the substitution matrix used. Vingron and Waterman ( 1994) discuss criteria on which to base these choices. In general, it is an open problem to determine a provably optimal set of values for gap penalties. The substitution matrix is usually chosen to be a log-likelihood matrix, i.e., a matrix with entries in the form prescribed by (3). Therefore, any of the matrices described in Section 4.2 qualifies in principle. Unfortunately, such a choice is based on heuristics only, since no theory was developed so far stating that a substitution matrix for SW-local should take that form. Moreover, it has been argued in Altschul (1991) that a result analogous to equation (3) may not be possible to prove for the case of alignments with gaps. 182 APOSTÓLICO AND GIANCARLO

For the time being, one resorts to extensive experiments to establish how well a set of weights performs. In this context, it is important to come up with good methodologies on which to base the design of those experiments. Some steps in that direction have been taken [see, for instance, Henikoff and Henikoff (1992), Pascarella and Argos (1992)]. We will see one of those methodologies in Section 7. In conclusion, quite some progress has been made in the quest for sound criteria on which to base the choice of weights for sequence alignment. This is still a challenging task, and it is reasonable to expect that no unique set of weights will do. Some other types of investigations can be used to circumvent the choice of weights (e.g., parametric alignment methods—suboptimal alignments), or to assess the biological significance of the alignment found (e.g., sensitivity and selectivity of alignment methods) for a given set of weights. In the next three sections we will review the state of the art on those topics. However, it should be kept in mind that results in those areas may have a fall-out on the issues treated in the present section [see for instance discussions in Fitch and Smith (1983), Gusfield and Stelling (1996), Vingron and Waterman (1994)].

5. PARAMETRIC SEQUENCE ALIGNMENT

In Parametric Sequence Alignment (PS alignment for short), the two input sequences are assumed to be fixed and the problem is to compute optimal alignments as a function of varying edit costs. The idea is to partition the parameter space into (necessarily convex) maximal regions with the property that a same alignment is optimal throughout each region. This approach was pioneered by Fitch and Smith (1983) for the alignment of two hemoglobin genes, but they gave no general algorithm for PS alignment. At about the same time, in the realm of Computer Science, Gusfield (1983) considered algorithms for the closely related problem of computing two-dimensional parametric decompositions. It was only later, however, that both mathematical formulations and algorithms for PS alignment were obtained by Gusfield et al Additional work is found in Zimmer and Lengaver (1997), Vingron and Waterman (1994), and Waterman (1984) and software has been developed specifically for this kind of alignment (Gusfield et al 1994; Gusfield and Stelling, 1996; Waterman etal, 1992).

Before proceeding further, we need to make some preliminary observations. Till noted differently, we refer to the "global" sequence alignment. Implicit in the discussion of Section 3 is the following interpretation of sequence alignment. Given two sequences x and y of respective lengthts m and n > m, a (not necessarily optimal) alignment is obtained by introducing spaces into the two sequences in such a way as to build sequences x' and y' of equal length. If at this point x' is juxtaposed to y', we have that any pair of identical (resp., different) characters facing each other constitutes a match (resp., a mismatch). A character of x' facing a space of y', or vice versa, represents an indel. A maximal run of one or more consecutive indels in one of the sequences is a gap. Consider now the case of two DNA sequences and let, like in Section 4, c\ denote the weight of a match, c2 that of a mismatch, and S be the weight of an indel. Consider now a gap in one sequence. According to the policy mentioned in Section 4.4, all the indels in the gap except the first one are assigned the same weight. This may be regarded as assigning the same weight to all indels in the gap and then a global weight to the gap itself, irrespective of indels it contains. Let y be this global weight. Throughout the rest of this discussion, we assume, without loss of generality, that c\ = 1 and that c2, S and y are nonnegative constant (this latter assumption is in contrast with the choices in Sections 3-4, but all material presented there can be adjusted accordingly). The weight of an alignment A between two sequences x and y, consisting of MIT matches, MiS mismatches, Iff indels and QV gaps, can be written as:

WA = MT c2MS SIM yQV. (4)

- - - Notice that, with the appropriate adjustments of weight signs and having fixed their numeric magnitude, the dynamic programming recurrence (which can be easily obtained from (2)) computing the global alignment of two sequences is the one that actually computes the alignment maximizing (4). Here we are interested in studying how the optimal alignment changes as a function of the parameters (c2, S, y). We can now give a formal definition of the problem. Consider the positive hyperplane of the space (c2, S, y) (we refer to it simply as the 3-parameter space). Observe that for any particular choice of (c2, 8, y), there is an optimal alignment of the two sequences. Conversely, given an optimal alignment, there is a region of maximal area in the 3-parameter space within which optimality of the alignment is preserved. Moreover, one can show that such (convex) region partition the 3-parameter space. Given the two sequences, the problem consists in finding such a partition. SEQUENCE ALIGNMENT IN MOLECULAR BIOLOGY 183

Consider the special case in which gaps are ignored, i.e., y = 0. The parameter space reduces then to two dimensions. Gusfield et al. have shown that the number of regions is bounded by n2^ for global alignment and n for local alignment. There are two families of efficient algorithms available here (Gusfield et al., 1994; Gusfield and Stelling, 1996; Waterman et al., 1992). They all use sequence alignment algorithms as subroutines but differ on how the boundaries of the regions are found. In one family (Waterman et al., 1992), one uses the novel idea of infinitesimal sequence alignment in which the computation of equation (2) is carried out as a symbolic computation. In another family (Gusfield et al., 1994; Gusfield and Stelling, 1996), several alternative techniques are used in a nonobvious way to find regional boundaries. The techniques are related to algorithms for record segmentation in shared databases (Eisner and Severance, 1976), Combinatorial Optimization (Gusfield, 1983; Meggiddo, 1979) and Newton's zero finding method (Gusfield and Stelling, 1996). Letting R be the number of regions in a particular partition, the time complexity of those families of algorithms ranges from O(Rnm) to 0(nm2 log n) time (Gusfield et al, 1994; Gusfield and Stelling, 1996), depending on which technique is used to find region boundaries. Consider now the problem in 3-parameter space. For global alignment, Gusfield et al. (1994) have shown that the number of regions is bounded by n2 and, for local alignment, this number increases to n3. The algorithms for the 2-parameter generalize to 3-parameter space. The time complexity of the algorithms increases accordingly. Finally, we consider parametric alignment of protein sequences. In this case, the weight of a match or mismatch depends on the specific pair of symbols involved, as reported in substitution matrices. Obviously, the dimension of the parameter space grows accordingly, and the added complexity makes PS alignment methods unpractical. However, one can resort to an approximate version of PS, which we now describe. Again, consider an alignment A of two sequences. Let MT denote the total weight given to the matches in A and let MS be the total weight given to mismatches. Then, one can define the value of alignment A as (Gusfield et al, 1994; Gusfield and Stelling, 1996):

WA = aMT + ßMS SIN yGV. (5) - -

The term MS is added rather than substracted because, in substitution matrices, the entries corresponding to mismatches already have negative values. Now, it is possible to study (5) as a function of the parameters a, ß, y, and S. However, no nontrivial bound is known for the number of regions, even when the parameter space is reduced to three (for discussion and related results, see Gusfield et al., 1994). Usually, (5) is studied in 2-parameter space, i.e., only two of the four parameters are retained while two are fixed. Indeed, under these conditions the algorithms for parametric alignment in 2-parameter space work within guaranteed complexity bounds.

6. SUBOPTIMAL ALIGNMENTS

Even with the best choice of weights, the alignment of two sequences that follows from implementation of one of equations ( 1 ) or (2) represents only one possible solution to an abstract optimization problem, with little guarantee of capturing the underlying biological substance. For this reason, researchers have considered the problem of extracting suboptimal alignments that are acceptably close to the optimal one (Chao, 1994; Naor and Brutlag, 1993; Vingron and Argos, 1990; Waterman and Byers, 1985; Waterman and Eggert, 1987; Zuker, 1991). Along these lines, one gets to expose: (a) alignments that are close to the optimum but substantially different from it; (b) regions that are common to many suboptimal alignments (an indication of the fact that those regions might be biologically significant); (c) alignments that optimize alternative biological constraints. Moreover, based on suboptimal alignment, one can define and compute robustness measures indicating how confident one can be about the biological correctness of two symbols (one from each sequence) being aligned in an optimal alignment (Chao, 1983; Mevissen and Vingron, 1997). It is well known that the computation of equation (2) can be regarded as the identification of an optimum path on a special type of directed graph (called grid graph) on nm vertices having a single source s, a single sink t and 3nm edges. In this representation, the problem of finding suboptimal alignments then becomes one of enumerating the K shortest paths in a weighted graph, for some fixed K. This latter problem is a very well known combinatorial problem for which many algorithms already exist [for references, see Ahuja etal. (1993) and Lawler (1976)]. Unfortunately, none of those algorithms is customized to exploit the special topology of the graphs we are interested in. 184 APOSTÓLICO AND GIANCARLO

Recently, Naor and Brutlag (1993) devised an algorithm such that, given an alignment (a path in the grid graph) of some cost, it outputs the next best alignment in 0(m) time (assuming that m > n). Their algorithm is based on an interesting combinatorial study of the graph in which it is shown that all suboptimal s-t paths can be characterized in terms of some special paths that the authors introduce as canonical paths. A path P in the grid graph is canonical with respect to an edge e = (u, v) if it consists of the best path from s to u, followed by e, and then followed by the best path from vtot. Another key ingredient to this algorithm is the definition of a transformation of edge weights of the grid graph achieving that no edge has a negative weight. It is interesting to point out that this powerful transformation is related to the one devised by Edmonds and Karp ( 1972) for general weighted graphs and that is at the basis of fast and simple shortest path algorithms. Chao (1994) provides algorithms that compute suboptimal alignments in linear space.

7. SENSITIVITY AND SELECTIVITY OF ALIGNMENT METHODS

Good algorithms for sequence alignment must show a good balance between sensitivity and selectivity. As already mentioned, sensitivity is the ability of an algorithm to detect distant evolutionary relationships between two strings, while selectivity is the ability to avoid the assignment of high scores to unrelated sequences. In order to evaluate an alignment method with respect to these two parameters, one needs criteria to establish whether a meaningful alignment has been found or missed. One possible criterion is to establish the statistical significance for the score of a given alignment. For MSP, one can phrase the problem as follows. Consider two protein strings that are random in the sense that they satisfy the probabilistic assumptions stated in Section 4.1. How many distinct locally optimal alignments with score at least s are expected to occur simply by chance? Karlin and Althschul (1990) have shown that such a number is well approximated by the formula KNexp'Xs (6) where X is the same normalization constant of (3), A' is the of the of the we equation product lengths strings are aligning and K is a parameter that can be explicity computed (Karlin and Altschul, 1990; Karlin et al, 1990). Once we know the score s of a locally optimal alignment, we can use (6) to establish how meaningful it is (the smaller the number given by (6), the more meaningful the alignment is). For the more general case of global or local alignment between two strings, where both gaps and substitutions are allowed (e.g., alignments computed by SW-local), it would be interesting to develop some statistical theory about the expected distribution of optimal scores. For the time being, the available results are limited to some important families of "random" strings [see Lander et al ( 1989), Waterman et al. ( 1987), Arratia and Waterman (1994), Waterman (1994), Waterman (1995), Waterman and Vingron (1994)]. Lacking a general statistical theory, Monte Carlo techniques are used sometimes to establish how likely it is for a given alignment score to be due to chance (Doolittle, 1981; Lipman and Pearson, 1985; Pearson and Lipman, 1998). We now briefly mention some results from the available statistical theory of local sequence alignment. Assume that the alphabet symbols are i id and consider two sequences x and v randomly drawn from this "information source." Consider the local alignment score of x and y according to Recurrence (2). We are interested in the statistical properties of such a score. The problem is not trivial and one of the major findings in this area is that those statistical properties depend heavily on the choice of indel and mismatch costs. Indeed, Arratia and Waterman (Arratia and Waterman, 1989; Waterman, 1995) have studied the asymptotic growth of the random variable "local alignment score of two sequences," and they have discovered that, as a function of indel and mismatch costs, it first grows logarithmically and then linearly with the length of the sequences, depending on gap penalties. Those two regions of growth are referred to as the "log region" and as the "linear region." Arratia and Waterman (1994) provide some statistical tools to establish how meaningful a given alignment is for choices of alignment parameters in the linear region. As for choices of parameters in the "log region," the probability that a score shall exceed some constant can be estimated through a Poisson approximation (Waterman and Vingron, 1994), provided that some heuristic considerations are applied. Based on that approximation, one can derive estimates on the number of alignments with scores exceeding some bound. This is in analogy to, although more general than, what Karlin and Althschul (1990) have obtained for "mismatches only" [see equation (4) above]. The famous Chen-Stein Poisson approximation is key to all this. Another approach that has been recently proposed is to come up with a methodology for the empirical assessment of the selectivity and sensitivity of alignment methods. Pearson (1991) has proposed one such SEQUENCE ALIGNMENT IN MOLECULAR BIOLOGY 185 methodology and he has applied it to the comparison of some alignment programs for data base searches. We will briefly outline the main points of that methodology. The sensitivity and selectivity of a method is quantified by two parameters. One is the number of strings in the data base that are biologically related to the query string and that were "missed" by the method during a search (that measures sensitivity). The other is the number of strings in the data base that are biologically unrelated to the query string and that were "found" by the method during a search (that measures selectivity). As for establishing when a string in the data base has been found or missed, Pearson (1991) has suggested several statistical criteria. In principle, those criteria would need the a priori knowledge of the statistical distribution of scores. That knowledge is experimentally precomputed as follows. Using previously available knowledge, the data base is conceptually divided into two sets of strings: the ones related to a sample query string and the ones that are not related. Using SSEARCH and a set of weights for the edit operations, the scores between the sample string and each unrelated string is computed. That gives a distribution of scores for unrelated sequences. The same is done for the set of related sequences. Now a method is evaluated by running many experiments using different criteria to establish when it has found or missed a string in the data base. The method uses the same weights as the ones used by SSEARCH to precompute statistical information. We remark that the above experimental methodology [or similar ones—see Henikoff and Henikoff ( 1992)] can also be used to evaluate the performance of substitution matrices. Indeed, one fixes the method once and for all and the experiments are performed by changing the matrices.

8. SELF-ALIGNMENTS

Periodicities, repetitions, palindromes and other regularities in a string x[\ : n] are found of biological significance in various contexts. Here we concentrate on the two families represented by non-overlapping repeats and tandem repeats. Consider two xr = : and = : with £ > r. Let : : be substrings x[j r] x'e x[£ t] Score(x[j r], x[£ t]) the similarity score between those two strings, i.e., the sum of the weights of the operations in an edit script transforming x[j : r]intox[€ : i]. Such an edit script is associated with an alignment of x[j : r] withx[€ : t]. We refer to that alignment as a repeat. A repeat is a tandem repeat if £ = r + 1. To keep things simple, let us assume that, among all possible repeats, we are interested in finding the best one, that is, the one of maximum score. This problem has been proposed by Miller (1992), who also gave a practical algorithm to solve it. The heart of the algorithm is the dynamic programming recurrence (2) that is used for local alignment. However, the computation of that recurrence needs to be modified to account for the facts that we are aligning a string with itself and that we are interested in non-overlapping regions of similarity. Kannan and Myers (1993) subsequently devised an algorithm with an 0(n2 log2 n) time performance, by combining some of the ideas in Miller (1992) and Apostólico et al. (1990). More recently, Schmidt (1995) generalized the notion of locally optimal alignment for strings [the one due to Sellers (1980)] to locally optimal (non-overlapping) repeats and to locally optimal tandem repeats. She also devised an 0(n2\ogn) time algorithm that computes all such locally optimal repeats. The data structures she devised may be useful in other sequence alignment contexts. Intuitively, such data structures support fast answers to queries on shortest paths in the grid graphs associated with the dynamic programming recurrence (2). For completeness, we mention that a simplified version of the problem of finding all locally optimal tandem repeats had been considered in Landau and Schmidt (1993). Algorithms dealing with similar "incremental" versions of string editing and longest common subsequences were produced in the mid 80's by Myers (1986) and, in a distant context of searching for cliques on some special classes of graphs, by Apostólico et al. (1992). A recent paper by Landau, Mayers and Schmidt (1995) combines the incremental methods for edit distance without mismatches of Milosavoljevic and Jurka (1993) with the incremental methods for the Levenshtein edit distance and discusses a number biological applications where this is useful. In all the algorithms mentioned above, "tandem repeat" is used to indicate that a certain (perhaps approx- imate) pattern repeats twice in the given sequence. A natural question to ask is how to find patterns that (approximately) repeat a certain number of times. (In some cases, it is advantageous to give as a parameter the type of pattern one looks for or the number of times any unspecified pattern approximately repeats.) This is a generalization of the well known problem in combinatorial algorithms on words that consists of finding all repetitions in a string [cf., e.g., Apostólico and Preparata (1983) and Crochemore and Rytter (1994)]. There are two heuristics to identify (approximate) repeated patterns. One is based on Data Compression Methods 186 APOSTÓLICO AND GIANCARLO

[see Storer (1988) for an extensive presentation of Data Compression]. The heuristic by Milosavljevic and lurka (1993) uses the well known Lempel-Ziv Data Compression method (Lempel and Ziv, 1976; Ziv and Lempel, 1977) to find exact repeats. Other heuristics of this type attempt to capture repeatedness of a sequence using compression schemes which are specifically designed for genetic sequences. Among which the algo- rithm Biocompress by Grumbach and Tahi (1993) combines repeats and generic palindromes, and the one by Rivals et al (1995) deals with approximate tandem repeats. A different type of heuristic, due to Benson and Waterman (1994), first looks for "suspicious patterns" within the sequence, i.e., patterns that may give rise to repeats and then tries to validate such patterns through a wrap around dynamic programming scheme. Intuitively, this is a method that computes an alignment of sequences according to recurrence ( 1 ) or (2), except that one can view the computation as taking place on a torus rather than a matrix. This variation of dynamic programming seems to be due to Fischett et al (1992) and Myers and Miller (1988).

9. MULTIPLE SEQUENCE ALIGNMENT

Given a set of strings x\, x2,..., xk over an alphabet E, a multiple alignment A for those strings is a two- dimensional matrix of k rows satisfying the following conditions. The entries of A are either symbols from the input alphabet or the empty symbol X (the identity under concatenation), and concatenating the entries in row i of A yields the string x,. Given a cost function d on T,k, an optimal alignment is one that minimizes the sum of the costs on all columns of a multiple alignment matrix A. Formally, we want a multiple alignment matrix A = (a¡j)i<¡ 0 for the LCS, the existence of acceptable approximation algorithms can be confuted (Jiang and Li, 1993). The general multiple alignment problem inherits NP-completeness from its LCS and SCS specializations. In fact, NP-completeness results relative to LCS and SCS problems were first achieved by Maier (1978). More recent results are inTimkovskii (1990) and Bodlander et al (1994).

9.1. Minimum sum ofpairs For this version of the multiple sequence alignment problem, the cost function d in (7) is given by:

d(alja2j akj) = £i

SEQUENCE ALIGNMENT IN MOLECULAR BIOLOGY 187 where c(A, A) = 0, c(a, b) is an entry in one of the substution matrices discussed in Section 4.1 when both a and b are in the alphabet E. When only one of the arguments is A, c is a gap cost (see Section 4.4). Moreover, c must satisfy the triangle inequality. Intuitively, an optimal multiple alignment in this case is one that best accommodates the pairwise alignments of the strings involved subject to the constraint that all strings must be aligned. For the case k = 2, it reduces to standard two-sequence alignment discussed in previous sections. Notice that minimum sum ofpairs alignments (SP alignments, for short) tend to maximize the number of positions at which all, or nearly all, of the aligned strings agree. Because of this property, the method is useful in multiple alignments of biological sequences, e.g., proteins, where we want to measure the "variability" of the sequences rather than identify plausible ancestors (Altschul and Lipman, 1989; Carillo and Lipman, 1988). In principle, it is not difficult to set up a solution in 0(2knk) time and 0(nk) space. However, SP alignment is NP-complete (Wang and Jiang, 1994). We point out that NP-completeness of this problem cannot be directly inferred from the NP-completeness of the general alignment problem given by (7), since the cost function d is now specified. We also point out that, despite the expensiveness of this alignment method, it is used in several available software systems for multiple sequence alignment (Lipman et al, 1989; Schuler et al, 1992). Even before the computational complexity of SP was settled, there have been two lines of research aiming at the design of fast algorithms for the problem. A common feature of both efforts is to find a good matrix A. That is, a matrix A that, although not optimal, has a cost "close" to the optimum. That is done by using an observation, which we now state. Let A be any SP alignment and let A¡j be the pairwise alignment of string x¡ and x¡ induced by A. That is, Ajj is obtained from A by deleting all of its rows, except the ¡-th and the 7-th, and then by deleting all columns that have only A's. Now, if there is a multiple alignment A such that A¡j gives the optimal pairwise alignment of x, with Xj, for each 1 < i < j < k, then A is an optimal SP alignment. This observation suggests to look for a good alignment using the following scheme. First, compute the (íj) optimal pairwise alignments of the k strings. Next, by suitably combining those pairwise alignments, build an alignment A in such a way that the cost of its induced pairwise alignments A¡j is "close" to the cost of the of and for all 1 < i < < n. optimal pairwise alignment x, x;, j Unfortunately, it is not always possible to combine all pairwise alignments into a multiple alignment A. That is due to the fact that some pairwise alignments may be incompatible for such a combination. In intuitive terms, a pairwise alignment of two strings x and y is incompatible with a multiple alignment of k strings involving also x and y) when the way in which x and y are aligned in multiple is not equal to the pairwise alignment. Because of possible incompatibility, one usually resorts to the less ambitious task of choosing a subset of pairwise alignments (usually k 1) that are compatible, i.e., they can be combined into a multiple — alignment. For instance, in the Feng and Doolittle (1987) alignment algorithm, one chooses a spanning tree from the complete graph of all pairwise alignments. Now, let's see how the above observations are used. Carrillo and Lipman (1988) have proposed a method that "cuts down" the search space in which the optimal solution lies. Indeed, they give a method to compute upper bounds on the cost of pairwise alignments induced by the optimal alignment we are seeking. Then, the reduction of the search space is achieved by limiting the search to alignment matrices that induce pairwise alignments of cost within the upper bounds. Essential for the determination of those bounds is to find a good, although not optimal, multiple alignment A. Significant reductions of the computational effort have been reported in Lipman et al. (1989). Another line of research tries to devise approximation algorithms that guarantee a provably good solution. Specifically, this consists of computing a matrix A, again, not necessarily optimal, but for which one can prove "a priori" how close it is to the optimal solution. The first polynomial-time approximation scheme for this problem has been devised recently by Bafna et al. ( 1996). They have shown how to obtain a performance ratio of e/(e 2) in time 0((ke)0le>), where e > 0 is a constant. The best performance ratio previously available is — 2 l/k, for any fixed /. The algorithm that achieves it, due to Bafna, Lawler and Pevzner (1994), has a — running time that increases exponentially with increasing /. Additional early work in this area is due to Pevzner (1992), Gusfield (1993) and Bafna and Pevzner (1993). From the practical point of view, it seems that the method by Gusfield is the one that, so far, yields biologically plausible solutions that are only 2% away from optimal (Pevzner, 1992). We now briefly outline the main ideas at the heart of those approximation algorithms. Consider an undirected weighted graph G. Let T be a spanning tree for G. The routing cost of T is the sum, taken over all pairs of nodes in the tree, of the weights of edges on the path of T that connects each pair. Now, let G be the complete graph naturally associated with all pairwise alignments of the k sequences and, for each edge, let its weight be the edit distance of the sequences at its end points. By elaborating on the Feng-Doolittle ( 1987) heuristics mentioned earlier and on further observations by Gusfield ( 1993), Bafna 188 APOSTÓLICO AND GIANCARLO et al. (1996), have reduced the problem of finding a good approximation algorithm for SP alignment to that of finding a good approximation algorithm for the problem of computing a spanning tree of G with minimum routing cost. Even though this latter problem is also known to be NP-complete (Johnson et al, 1978), Bafna et al. were able to set up an approximation scheme for it. Another type of investigation in the design of good approximation algorithms for SP is based on a careful analysis of the connection between the compatibility of pairwise alignments to form multiple alignment of k strings and the choice of which pairwise alignments to use [see, for instance, Bafna et al. ( 1994) and Pevzner (1992)]. Along this line, first the notion of compatibility has been extended from pairwise alignment to multiple alignments involving / strings, I < k. That is, the optimal alignments from which we try to build A consist of/ strings each (so, now we have (,) possible alignments to choose from). Moreover, that notion has been formalized in terms of a graph having some specific topological constraints [such graphs are called configurations in Pevzner (1992)]. Finally, the problem of finding a good multiple alignment A has been reduced to that of finding an optimal configuration satisfying some additional topological constraints. This latter task can be done in polynomial time. We remark that not all of the (;) possible alignments of / strings are used to obtain A.

9.2. Traces As we have discussed in the previous subsection, even if we have all (ij) possible pairwise alignments, only a limited number is used to build the "approximate solution" A. In view of the observation in the previous subsection, there is a fundamental question that needs to be addressed: Given a set of pairwise alignments, how can we find a multiple alignment that is as close as possible to all pairwise alignments in the set? This question was brought to light by Kececioglu (1993), who formalized it and gave algorithms solving it. We briefly review this multiple alignment method. Given k strings x\,x2,... ,xk fix a set F of pairwise alignments of those strings. F can be represented by an alignment graph G (V, E) together with a partial order relation "<". V is given by pairs (j, i), for — 1 < j < k. The pair (j, /) represents character x;[/]. There is an edge between two vertices if and only if the two characters represented by the vertices are matched in the corresponding pairwise alignment in F. Moreover, given two vertices v and w, v < w if and only if the character in v immediately precedes the character in w in one of the strings. This partial order relation captures the ordering of characters within a string. One can observe the following. A path in an alignment graph is a set of characters that can be compatibly aligned into a column of a multiple alignment matrix A. Generalizing, the connected components of a set of edges in G can be seen as columns in a multiple alignment A. Those columns are valid if they can be ordered so as to respect the "<" relation character by character (on the rows of A). Informally, the maximum trace problem consists of identifying an optimal subset of the edges of E such that their connected components can be combined into a valid multiple alignment A. Optimality is defined so as to "maximize" the use of the information available from the pairwise alignments in F. This problem can be formally stated so as to satisfy formulation (7). Moreover, it contains Minimum Sum of Pairs as a special case (Kececioglu, 1993). Kececioglu showed that the general problem is NP-complete and he devised a branch and bound solution for it.

9.3. Multiple sequence alignment under an evolutionary tree Fix a set X of k strings and let F be a set of hypothetical ancestral strings. An evolutionary tree Tx,y for X is a weighted tree of \X\ + \Y\ nodes, where the leaves are associated bijectively to the strings in X, and the internal nodes (are associated bijectively) to the strings in Y. The weight or cost of an edge is usually taken to be the edit distance between the strings at its endpoints, i.e., the cost of the pairwise alignment. The cost of a tree is the sum of the costs of its edges. In its most general form, multiple sequence alignment under an evolutionary tree consists of finding a minimal cost tree Tx,y for X. Minimization is carried out over all sets Y and trees Tx.y. The optimal multiple alignment matrix A can be recovered from the optimal tree since any tree of pairwise alignments can be merged into a multiple alignment that complies with the alignments in the tree. An important special case, due to Sankoff (1975) and which was the first formulation of this kind of alignment, is for a fixed tree. That is, the topology of the tree is fixed and we must optimize over the strings in Y only (satisfying the constraint given by the topology of the tree). Following Sankoff (1975), this "family" of multiple sequence alignments can be cast in the form given by (7). We also point out that this type of alignment tries to minimize the number of mutations from the species represented by strings in X to their ancestors, represented by the strings in Y. SEQUENCE ALIGNMENT IN MOLECULAR BIOLOGY 189

For this class of problems, there are a set of negative results stating that the existence of approximation algorithms that can get arbitrarily close to the optimum is unlikely, even if the topology of the tree is given (Wang and Jiang, 1994). Without any tree being assumed, an approximation with a performance ratio of 2 in 0(k2n2) time has been established by Gusfield (1993). When the tree is fixed, Sankoff (1975) provides an exact algorithm. Studies that generalize to this problem some of the ideas of Carrillo and Lipman for SP alignment are reported in Altschul and Lipman (1989) and Pevzner (1992). An approximation with performance ratio of 2 for fixed tree with triangle inequality was afforded in time 0(k2n2 + k3) (Jiang et al., 1994). A reduction of this time to 0(k2n2 + k2), along with a much improved approximation scheme, has appeared quite recently in Wang and Gusfield (1996).

9.4. Dot-plots Multiple sequence comparisons may also be based on measures other than edit scripts, e.g., dot-plots (Collins and Coulson, 1987). In its simplest version, the dot-plot associated with a pair of strings is the raw binary matrix of matches between those two strings, i.e., entry (i, j) = 1 iff the /th symbol of the first string is identical to the jth symbol of the second string. Criteria more general than mere matching can be adopted, e.g., weighted matches. In the most striking cases, just glancing at a binary dot-plot one finds regions of high similarity in the underlying strings: such regions are in fact overwhelmingly exposed through visual impact from dense dot areas clustered along diagonals. Working with a set of strings, the issue becomes one of "merging" multiple pairwise dot-plots into a single, meaningful region of density in the hyperspace determined by that set. A natural problem in this context is that of how to define consistency, i.e., the degree of significance which could be assigned to a dot in one of the planes given the existence of the other dot- plots. Dot-plots cannot be simply superimposed, because of deletions and insertions in the various alignment pairs. The alternative proposed in Vingron and Argos (1991) consists of filtering the individual original plots according to a consistency criterion that results in an easy and elegant algebraic computation. Specifically, assume 3 strings are given and let A(i,2), A(i,3) and Arzj) be the pairwise plots, with obvious meaning. The criterion consists of requiring that a similarity (/', j) in A(i,3) be supported by the appearance, for some k, of dots at (/', k) and (k, j) in Ai\¿) and A¡2,y¡, respectively. This validation of dots translates thus to a standard boolean matrix multiplication and can be performed accordingly. The result is a filtered dot-matrix which is now consistent with the other two, according to this criterion. Filtration of the triple is obtained by subjecting the other two plots to the same treatment. For many strings, the problem reduces to iteration over sets of 3 strings. In Pevzner and Vingron (1995), the problem is modeled as a multipartite graph, and "having a similarity in common" is interpreted in terms of a suitably introduced notion of consistency on multipartite graphs. Finding all there is in common between the dot-plots becomes then the problem of finding the maximum consistent subgraph to a given multipartite graph. Algorithms based on this formalization have successfully performed on biological examples.

10. GENOME COMPARISON AND RELATED SORTING PROBLEMS

As mentioned in Section 2, evolution at the level of the genome seems to involve non-local, large scale operations, which can rearrange a whole segment of a chromosome in a single evolutionary event. These non- local operations are: segment inversion, which replaces a segment of the chromosome with the reverse DNA string; segment transposition, which moves a segment to a new location of the genome; segment translocation, which exchanges prefixes or suffixes of two substrings in the same string; segment duplication, which copies a segment to a new location; finally, segment insertion and segment deletion, with obvious meaning. Notions of distance can be defined that quantify the "evolutionary change" of one genome into the other by means of (subsets of) the above operations [see Sankoff (1992) and Hannehalli and Pevzner and references therein]. The mechanism through which one can define a distance between genomes is the same used to define any other metric on strings (see Section 2). One formally defines and assigns weights to the operations of interest (say, inversion, transposition and translocation). Then, given two genomes, the distance between the two is defined in much the same way as in Section 2, except that the transformation of one genome into the other uses the new set of operations. In what follows, we will assume, for simplicity, that the two genomes we are comparing consist of a single chromosome. 190 APOSTÓLICO AND GIANCARLO

We need some information on how genomes are represented. Most of the current data for genomes is in the form of maps that provide the location of genes along the chromosomes. For a single chromosome containing n distinct genes, a map can be represented as a permutation of the integers 1,2,...,«, e.g., (2, 5, 4, 3, 1) is an example of such a map for a "chromosome" with five genes. Sometimes, the genes have an orientation on the map [the reader is referred to Kececioglu and Sankoff (1994) and Hannenhalli and Pevzner (1995) for motivation]. In that case, each gene is assigned a sign: + or —. So, the oriented version of our example "chromosome" could be (+2, —5, +4, +3, +1). For the time being, we will concentrate mainly on maps in which there is no orientation. Consider two chromosomes we want to compare. When the two maps consist of the same « genes, then the operations of interest are inversion, transposition and translocation. In what follows, we will mainly concentrate on this case and we will define some of those operations formally. We anticipate that all those operations will have unit weights. When the two maps are not composed of the same « genes, then duplication, insertion and deletion are also relevant. However, we will not consider this case.

10.1. Inversion distance

We are given the maps of two chromosomes, each consisting of some permutation of the same « genes. Let these two maps be described by the permutations a = (o\, a2,..., oH) and r = (tj, x2,..., t„). We model the inversion operation described earlier by means of the reversal of the interval [i, j]. Formally, a reversal of the interval [i, j] is the permutation p = i, i + 1,..., j —* j, j — 1,..., i. Applying p to a by the composition o p has the effect of reversing the order of the genes ct, ,_o¡. For instance, let o (2,4, 3, 1 ). A reversal

[2, 3] applied to o would yield the new permutation (2, 3,4, 1). — We are interested in the following problem. Given the permutations o and r, find a series of reversals Pl, p2,..., Pd such that a p\ P2- Pd T, where d is minimum. Parameter d is referred to as the reversal

— distance. Notice that the problem reduces to finding the reversal distance between the permutation

problem, sorting by generate symmetric group Sn, an alternative statement of the problem is: given an arbitrary permutation (¡> of S„, find the shortest product of generators that equals . We let d(n) min^g^ d((p), where d((p) is the reversal distance of from the identity permutation. — Sorting by reversal was posed in 1982 by Watterson et al (1982), who devised also an heuristic algorithm computing it. Very recently, Kececioglu and Sankoff (1995) began a formal study of the computational complexity of this problem and conjectured that it is NP-Hard. NP-completeness of the problem has been settled (Carpara, 1997). The relationship between sorting by reversal and other fairly well understood problems in Computer Science is discussed in Kececioglu and Sankoff (1995). Kececioglu and Sankoff also devised two algorithms. One is a polynomial time approximation algorithm that finds a solution within a factor of two from the optimum. It is the first algorithm for which one can show a performance guarantee bound on the solution. The other one is a branch and bound algorithm that always finds an exact solution to the problem. In order to have a fast convergence of the branch and bound search, one needs to come up with nontrivial lower and upper bounds on the reversal distance d. Both algorithms have been shown to perform well on biologically important data Kececioglu and Sankoff (1995). Bafna and Pevzner (1996) have recently obtained a 7/4 approximation algorithm. In order to obtain that result, they had to study some combinatorial and probabilistic problems on S„ that are interesting in their own right and relevant to molecular biology as well. They proved that d(n) = n 1, settling a conjecture by Gollan [see Bafna and Pevzner (1996) and Kececioglu and Sankoff (1995)]. They— also studied the problem of expected reversal distance between random permutations and showed that it is very close to d(n). That is an indication of the fact that reversal distance provides a good separation between related and non-related sequences in molecular evolution studies. A related problem is the one of sorting signed permutations by reversals. Here the input is still a permutation on (1,2,...,«}, but each element of has a + or sign, e.g., (+1, —5, +4, —3, +2) is a — signed permutation. Now, a reversal [;', j] is defined as above, but it changes also the signs of the elements ,-, 4>i+\_,

10.2. Some other notions ofdistance Another operation for which a distance has been defined is transposition. Its relevance to biology is addressed in Hannenhalli etal. (1994). Again, the two genomes are (unsigned) permutations o and r of the integers in [1, n]. A transposition p(i, j, k) exchanges the "place of the integers" [/', j 1] and [j, it — 1] in the interval — [/', k 1]. For instance p(2, 4, 6) transforms (1, 2, 3, 4, 5) into (1, 4, 5, 2, 3). In general, o p(i, j, k) has — the effect of exchanging o¡,..., a¡-\ with a¡,..., ok-\. The transposition distance between o and r can be defined in the same way as the inversion distance between those two permutations, except that we apply transposition transformations rather than reversals. Again, since transpositions generate the symmetric group S„, the transposition distance between o and r can be reduced to the computation of the transposition distance between

ACKNOWLEDGMENTS

This paper is based on presentations and literature given at the Workshop on "Sequence Alignment" which was held in Princeton, NJ, Nov. 10-12, 1995, as part of the DIMACS Special Year on Mathematical Support for Molecular Biology. The authors gratefully acknowledge the support and dedication of their colleagues in the Workshop Committee, most of whom also served as lecturers delivering the notes that the present paper tried to capture: S. Altschul, D. Brutlag, R. Doolittle, M. Farach, P. Green, D. Gusfield, S. Henikoff, J. Kececioglu, D. Lipman, W. Miller, W. Pearson, P. Pevzner, D. Sankoff, J. Schmidt, G. Stormo, M. Vingron, M. Waterman. In particular, D. Gusfield, J. Schmidt, M. Vingron and M. Waterman provided many useful comments on an early version of this paper, and so did J. Bentley, V. Brimkov, S. Lonardi, and the Referees. Finally, thanks are expressed to DIMACS and to the Steering Committee of its 1994-95 Special Year for devising this topic, stimulating and supporting every aspect ofthat meeting. A. Apostólico was partially supported by NSF Grants CCR-92-01078 and CCR-97-00276, by NATO grant CRG 900293, by British Engineering and Physical Sciences Research Council Grant GR/L19362, by the National Research Council of Italy, by the ESPRIT III Basic Research Programme of the EC under contract No. 9072 (Project GEPPCOM), and by the Italian Ministry of Research. R. Giancarlo was partially supported by MURST Grant "Algoritmi, Strutture di Calcólo e Sistemi Informativi" and by the National Research Council of Italy; part of this work was done while the author was visiting AT&T Bell Labs., Murray Hill, N.J., U.S.A.

REFERENCES

Ahuja, R., and Magnanti, T., and Orlin, J. 1993. Network Flows: Theory, Algorithms and Applications, Prentice Hall. Altschul, S.F. 1991. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Bio., 219:555-565. Altschul, S.F. 1993. A protein alignment scoring system sensitive to all evolutionary distances. J. Mol. Evoi, 36:290-300. 192 APOSTÓLICO AND GIANCARLO

Altschul, S.F, Gish, W., Miller, W., Myers, E.W., and Lipman, DJ. 1990. Basic local alignment search tool. J. Mol. Bio., 215:403^10. Altschul, S.F., and Lipman, D.J. 1989. Trees, stars, and multiple biological sequence alignment. SIAM J. Appl. Math, 49(1): 197-209. Apostólico, A., Atallah, M.J., and Hambrusch, S. 1992. New clique and independent set algorithms for circle graphs. Discrete Applied Mathematics, 37:1-24. Apostólico, A., Atallah, M.J., Larmore, L.L., and McFaddin, S. 1990. Efficient parallel algorithms for string editing and related problems. SiamJ. on Computing, 19:968-988. Apostólico, A., Browne, S., and Guerra, C. 1992. Fast linear space computations of longest common subsequences. Theoretical Computer Science, 92:3-17. Apostólico, A., and Guerra, C. 1985. A fast linear space algorithm for computing longest common subsequences. In Proceedings of the 23rd Allerton Conference, Monticello, 111. Apostólico, A., and Guerra, C. 1987. The longest common subsequence problem revisited. Algorithmica, 2:315-336. Apostólico, A., and Preparata, F.P. 1983. Optimal off-line detection of repetitions in a string. Theoretical Computer Science, 22:297-315. Argos, P., Vingron, M., and Vogt, G. 1991. Protein sequence comparison: Methods and significance. Protein Engineering, 4:375-383. Arratia, R., Gordon, L., and Waterman, M.S. 1986. An extreme value theory for sequence matching. Ann. Stat., 14:971- 993. Arratia, R., and Waterman, M.S. 1994. A phase transition for the score of matching random sequences allowing deletions. Ann. Appl. Prob. 4:200-225. Arratia, R., and Waterman, M.S. 1989. The Erdos-Renyi strong law for pattern matching with a given proportion of mismatches. Ann. Prob., 17:1152-1169. Bafna, V, Lancia, G., and Ravi, R. 1996. A polynomial-time approximation scheme for tree driven SP alignment problem. manuscript. Bafna, V, Lawler, E., and Pevzner, P.A. 1994. Approximation algorithms for multiple sequence alignment. In M. Crochemore and D. Gusfield, editors, Proc. Combinatorial Pattern Matching 94 (Lecture Notes in Computer Science, Vol. 807), 43-53, Berlin, Springer-Verlag. Bafna, V, and Pevzner, P. 1996. Genome rearrangments and sorting by reversals. Siam J. on Computing, 25(2):272-289. Bafna, V, and Pevzner, P. 1993. Approximate methods for sequence alignment, manuscript. Bafna, V, Pevzner, P. 1995. Sorting permutations by transpositions. In Proc. 6th Symposium on Discrete Algorithms, 614-621. ACM-SIAM. Benson, G., and Waterman, M.S. 1994. A method for fast database searches for all k-nucleotide repeats. Nucleic Acids Research, 22:4828^1836. Bodlaender, H.L., and Fellows, M.R., and Hallett, M.T. 1994. Beyond NP-completeness for problems of bounded width: Hardness for the w hierarchy (extended abstract). ACM Sympos. Theroy Comput., 26:449^-58. Carrillo, H., and Lipman, D. 1988. The multiple sequence alignment problem in biology. SIAM J. Appl. Math., 48(5): 1073- 1082. Carrol, L. 1879. A new puzzle. Vanity Fair. Carpara, A. 1997. Sorting by Reversals is Difficult. RECOMB 97: Proceedings ofthe FirstAnnual International Conference on Computational Molecular Biology, pages 75-83. Chan, S.C., Wong, A.K.C., and Chui, D.K.Y. 1992. A survey of multiple sequence comparison methods. Bull. Math. BioL, 54:563-598. Chao, K.M., Hardison, R.C., and Miller, W. 1993. Locating well-conserved regions within a pairwise alignment CABIOS, 4:387-396. Chao, K.M. 1994. Computing all suboptimal alignments in linear space. In M. Crochemore and D. Gusfield, editors, Proc. Combinatorial Pattern Matching 94 (Lecture Notes in Computer Science, Vol. 807), 1-14, Berlin, Springer-Verlag. Chvatal, V, and Sankoff, D. 1975. Longest common subsequence of two random sequences. J. ofAppl. Probab., 12:306- 315. Collins, J.F., and Coulson, A.F.W. 1987. Molecular sequence comparison and alignment. In M.J. Bishop and C.J. Rawlings, editors, Nucleic Acid and Protein Sequence Analysis: A Practical Approach, 323-358. IRL Press, Oxford. Crochemore, M., and Rytter, W. 1994. Text Algorithms. Oxford University Press. Dayhoff, M.O., and Eck, R.V 1968. Atlas of Protein Sequences and Structures. National Biomédical Res. Foundation. Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C. 1978. A model of evolutionary change in proteins. In M.O. Dayhoff, editor. Atlas of Protein Sequence and Structure, 345-352. Nat. Biomed. Res. Found., 5, supp. 3. Dobzhanski, T., and Sturtvant, A.H. 1938. Inversions in the chromosomes of drosophila pseudoscura. Genetics, 23:28-64. Doolittle, R.F. 1981. Similar amino acid sequences: Chance or common ancestry? Science, 214:149-159. Eisner, M., and Severance, D. 1976. Mathematical techniques for efficient record segmentation in large shared databases. J. ofACM, 23:619-635. Feng, D., and Doolittle, R. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal ofMolecular Evolution, 25:351-360. SEQUENCE ALIGNMENT IN MOLECULAR BIOLOGY 193

Feng, D.F., Johnson, M.S., and Doolittle, R.F. 1985. Aligning amino acid sequences: Comparison of commonly used methods. /. Mol. Evol, 21:112-125. Fischetti, V. Landau, G., Schmidt, J., and Sellers, P. 1992. Identifying periodic occurrences of a template with applications to protein structure. In Third Annual Symposium, CPM 92, Tucson, Arizona, April 29-May 1, 1992. Proceedings. (Lecture Notes in Computer Science, Vol. 644), 111-120. Fitch, W.M.. and Margoliash, E. 1967. Construction of phylogenetic trees. Science, 155:279-284. Fitch, W, and Smith, T.F. 1983. Optimal sequence alignment. Proc Nat Acad Sei USA, 80:1382-1386. Gallager, R.G. 1968. Information Theory and Reliable Communication, John Wiley and Sons. Gatlin, L.L. 1966. The information content of DNA. J. Theor. Biol, 10:281-300. Giancarlo, R. 1997. Dynamic programming-special cases. In A. Apostólico and Z. Galil, editors, Pattern Matching Algorithms, Oxford University Press, NY, 201-236. Goad, W.M. 1989. Sequence analysis- contributions by Ulam to molecular genetics, in: From Cardinals to Chaos. Reflections on the life and legacy of Stanislaw Ulam N.G. Cooper, editor, Cambridge University Press, 288-291. Goldstein, L., and Waterman, M.S. 1994. Approximations to profile score distributions. J. Comp. Biol, 1:93-104. Gotoh, 0.1982. An improved algorithm for matching ofbiological sequences. Journal ofMolecular Biology, 162:705-708. Griffin, A.M., and Boursnell, M.E.G. 1990. Analysis of the nucleotide sequence of DNA from the region of the thymidine kinage gene of infectious laryngotracheitis virus; potential evolutionary relationships between the herpes virus subfamilies. 7. Gen. Virol, 71:841-850. Grumbach, S., and Tahi, F. 1993. Compression of DNA sequences. Data Compression Conference, IEEE Computer Society Press, 340-350. Gusfield, D. 1983. Parametric combinatorial computing and a problem of program module decomposition. J. of ACM, 30:551-563. Gusfield, D. 1993. Efficient methods for multiple sequence alignment with guaranteed error bounds. Bull. Math. Biol, 55(1):141-154. Gusfield, D., Balasubramanian, K., and Naor, D. 1994. Parametric optimization of sequence alignment. Algorithmica, 12:312-326. Gusfield, D., and Stelling, P. 1996. Parametric and inverse-parametric sequence alignment with xparal. In R. Doolittle, editor, Methods in Enzymology vol. 266. Computer Methods for Macromolecular Sequence Analysis, pages 481^194. Academic Press.

Hamming, R.W. 1950. Error detecting and error correcting codes. Bell System Tech. J., 29:147-160. Hannenhalli, S., and Pevzner, P. Towards a computational theory of genome rearrangement. In I. van Leeuwen, editor, Computer Science Today. Hannenhalli, S., Chappey, C, Koonin, E., and Pevzner, P. 1994. Algorithms for genome rearrangements: Herpesvirus evolution as a test case. In 3rd Int. Conference on and Complex Genome Analysis. Hannenhalli, S., and Pevzner, P. 1995. Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals). In Proc. 27th Symposium on Theory of Computing, 178-187. ACM. Henikoff, S., and Henikoff, J. 1992. Amino acid substitution matrices from protein blocks. Proc. Nat. Acad. of Sei., USA, 89:10915-10919. Henikoff, S., and Henikoff, J. 1993. Performance evaluation of amino acid substitution matrices. Proteins: Structure, Function and Genetics, 17:49-61. Hirschberg, D.S. 1997. Algorithms for the longest common subsequence problem. JACM, 24, 4:664-675. Hirschberg, D.S. 1985. A linear space algorithm for computing maximal common subsequences. CACM, ¡8/6:341-343. Hsu, W.J., and Du, M.W. 1984. Computing a longest common subsequence for a set of strings. BIT, 24:45-59. Hunt, J.W., and Szymanski, T.G. 1977. A fast algorithm for computing longest common subsequences. CACM, 20,5:350-353. Irving, R.W., and Fraser, C.B. 1992. Two algorithms for the longest common subsequence of three (or more) strings. In Third Annual Symposium, CPM 92, Tucson, Arizona, April 29-May 1, 1992. Proceedings. (Lecture Notes in Computer Science, Vol. 644), 214-229. Itoga, S.J. 1981. The string merging problem. BIT, 21:20-30. Jacobson, G., and Vo, K.P. 1992. Heaviest increasing/common subsequence problems. In A. Apostólico, M. Crochemore, Z. Galil, and U. Manber, editors, Proc. Combinatorial Pattern Matching 92 (Lecture Notes in Computer Science, Vol. 644), 52-66, Berlin. Springer-Verlag. Jiang, T., Lawler, E.L., and Wang, L. 1994. Aligning sequences via an evolutionary tree: Complexity and approximation. ACM Sympos. Theory Comput., 26:760-760. Jiang, T., and Li, M. 1993. Optimization problems in molecular biology. In D.Z. Du and J. Sun, editors, Advances in Optimization and Approximation. Johnson, D.S.. Lenstra, J.K., and Rinnooy Kan, A.H.E. 1978. The complexity of the Network Design Problem. Networks, 8:279-285. Kanehisa, M.I., and Goad, W.B. 1982. Pattern recognition in nucleic acid sequences I: A general method for finding local homologies and simmetries. Nucl. Acid Res., 10:247-264. Kannan, S.K., and Myers, E.W. 1993. An algorithm for locating non-overlapping regions of maximum alignment score. 194 APOSTÓLICO AND GIANCARLO

In A. Apostólico, M. Crochemore, Z. Galil, and U. Manber, editors, Proc. Combinatorial Pattern Matching 93 (Lecture Notes in Computer Science, Vol. 684), 74-86, Berlin, Springer-Verlag. Kaplan, H., and Shamir, R., and Tarjan, R.E. 1997. Faster and Simpler Algorithm for Sorting Permutatios by Reversals. RECOMB 97: Proceedings of the First Annual International Conference on Computational Molecular Biology, 163. Karlin, S., and Altschul, S.F. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring systems. In Proc. Nat. Acad. of Sei., U.S.A., volume 87, 2264-2268. Karlin, S., Dembo, A., and Kawabata, T. 1990. Statistical significance of high scoring segments from molecular sequences. Ann. Stat. 18:571-581. Karlin, S., Glandour, G., Ost, F, Tavare, S., and Korn, L.J. 1983. New approaches for computer analysis of nucleic acid sequences. Proc Nat Acad Sei USA, 80:5660-5664. Karlin, S., Mocarski, E.S., and Schachtel, G.A. 1994. Molecular evolution of herpesviruses: genomic and protein sequence comparison. /. Virol, 68:1886-1902. Karlin, S., and Ost, F. 1988. Maximum length of common words among random letter sequences. Ann. Prob., 16:535-563. Edmonds, J., and Karp, R.M. 1972. Theoretical improvements in algorithmic efficiency for network flow problems. /. of ACM, 19:248-264. Kececioglu, J. 1993. The maximum weight trace problem in multiple sequence alignment. In A. Apostólico, M. Crochemore, Z. Galil, and U. Manber, editors, Proc. Combinatorial Pattern Matching 93 (Lecture Notes in Computer Science, Vol. 684), 106-119, Berlin, Springer-Verlag. Kececioglu, J., and Sankoff, D. 1994. Efficient bounds for oriented chromosome inversion distance. In A. Apostólico, M. Crochemore, Z. Galil, and U. Manber, editors, Proc. Combinatorial Pattern Matching 94 (Lecture Notes in Computer Science, Vol. 807), 307-325, Berlin, Springer-Verlag. Kececioglu, J., and Sankoff, D. 1995. Exact and approximation algorithms for sorting by reversal. Algorithmica, 13(1/2): 180-210. Kececioglu, J.D., and Ravi, R., Of Mice and Men: Algorithms for evolutionary distances between genomes with translocation. In Proc. 6th Symposium on Discrete Algorithms, 604-613. ACM-SIAM. Kolmogorov, A.N. 1965. Three approaches to the quantitative definition of information. Problemi Pederachi Inf., 1. Landau, G.M. 1995. Myers, G., and Schmidt, J.P. Incremental string comparison, manuscript. Landau, G.M., and Schmidt, J.P. 1993. An algorithm for locating tandem repeats. In A. Apostólico, M. Crochemore, Z. Galil, and U. Manber, editors, Proc. Combinatorial Pattern Matching 93 (Lecture Notes in Computer Science, Vol.

684), 120-133, Berlin, Springer-Verlag. Lander, E., Mesirov, J.P, and Taylor, W. 1989. Study of protein sequence comparison metrics on the Connection Machine CM-2. J. Supercomputing, 3:255-269. Lawler, E.L. 1976. Combinatorial Optimization, Networks and Matroids. Holt, Rinehart and Winston. Lempel, A., and Ziv, J. 1976. On the complexity of finite sequences. IEEE Trans, on Information Theory, 22:75-81. Lesk, A.M., Levitt, M., and Chothia, C. 1986. Alignment of amino acid sequences of" distantly related proteins using variable gap penalties. Protein Engineering, 1:77-78. Levenshtein, V.l. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys. Dokl, 6:707-710. Lipman, D.J., Altschul, S.F., and Kececioglu, J.D. 1989. A tool for multiple sequence alignment. Proc. Nat. Acad. Sei. (USA), 86:4412^415. Lipman, D.J., and Pearson, W.L. 1985. Rapid and sensitive protein similarity searches. Science, 2:1435-1441. Maier, D. 1978. The complexity of some problems on subsequences and supersequences. J. Assoc. Comput. Mach., 25(2):322-336. Martin-Löf, P. 1966. The definition of random sequences. Information and Control, 9. McLachlan, A.D. 1971. Tests for comparing related amino acid sequences, cytochrome c and cytochrome c551. J. Mol. Biol, 61:409-424. Meggiddo, N. 1979. Combinatorial optimization with rational objective functions. Math. Oper. Res., 4:414-424. Mevissen, H.T., and Vingron, M. 1997. Quantifying the local reliability of a sequence alignment. Protein Engin., 9:127-132. Miller, W. 1992. An algorithm for locating a repeated region, manuscript. Milosavljevic, A., and Jurka, J. 1993. Discovering simple DNA sequences by the algorithmic significance method. CABIOS, 9:407^111. Myers, E.W. 1986. Incremental alignment algorithms and their applications. Technical Report, U. Az.:l-23. Myers, E.W. 1986. An O(ND) difference algorithm and its variations. Algorithmica, 1:251-266. Myers, E.W., and Miller, W. 1988. Optimal alignment in linear space. Comput. Appl. Biosci., 4:11-17. Kambayashi, Y., Nakatsu, N., and Yajima, S. 1982. A longest common subsequence algorithm suitable for similar text strings. Acta Informática, 18:171-179. Naor, D., and Brutlag, D. 1993. On suboptimal alignments of biological sequences. In Fourth Annual Symposium, CPM 93, Padova, Italy. Proceedings. (Lecture Notes in Computer Science, Vol. 684), 179-196. Needleman, S.B., and Wunsch, CD. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol, 48:433-443. SEQUENCE ALIGNMENT IN MOLECULAR BIOLOGY 195

Palmer, J., and Herbon, L. 1988. Plant mitochondrial DNA evolves rapidly in structure, but slowly in sequence. J. Mol. Evol., 27:87-97. Pascarella, S., and Argos, P. 1992. Analysis of insertion/deletion in proteins. J. Mol. Biol., 224:461-471. Pearson, W.L. 1991. Searching protein sequence libraries: Comparison of the sensitivity and selectivity of the Smith- Waterman and FASTA algorithms. Genomics, 11:635-650. Pearson, W.L., and Lipman, W.R. 1988. Improved tools for biological sequence comparison. Proc. Nat. Acad. Sei., USA, 85:2444-2448. Pevzner, P. 1992. Multiple alignment, communication cost, and graph matching. SiamJ. on Applied Math., 52:1763-1779. Pevzner, P., and Vingron, M. 1995. Multiple sequence comparison and consistency on multipartite graphs. Adv. Appl. Math., 16:1-22. Popper, K.R. 1959. The Logic of Scientific Discovery. Hutchinson, London. Mohana Rao, J.K. 1987. New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. Int. J. Pept. Protein Res., 29:276-281. Risler, J.L., Delorme, M.O., Delacroix, H., and Henaut, A. 1988. Amino acid substitutions in structurally related proteins: A pattern recognition approach. Determination of a new and efficient scoring matrix. J. Mol. Biol., 204:1019-1029. Rivals, E., Delgrande, O., Delahaye, J.-P, and Dauchet, M. 1995. A first step towards chromosome analysis by compression algorithms. In N.G. Bourbakis, editor, First International IEEE Symposium on Intelligence in Neural and Biological Systems (IEEE Computer Society Press), 233-239. Sankoff, D. 1972. Matching sequences under deletion/insertion constraints. In Proc. Nat. Acad. ofSei., U.S.A., vol. 69,4-6. Sankoff, D. 1975. Minimal mutation trees of sequences. SIAM J. Appl. Math., 28(1):35^12. Sankoff, D. 1992. Edit distance for genome comparison based on non-local operations. In A. Apostólico, M. Crochemore, Z. Galil, and U. Manber, editors, Proc. Combinatorial Pattern Matching 92 (Lecture Notes in Computer Science, Vol. 644), 121-135, Berlin, Springer-Verlag. Sankoff, D. 1996. The early introduction of dynamic programming into Computational Biology, manuscript. Sankoff, D., and Kruskal, J.B., editors. 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, U.S.A. Sankoff, D., Leduc, G., Antoine, N., Paquin, B., Lang, B.F., and Cedergren, R. 1992. Gene order comparison for phylogenetic inference: evolution of the mitochondrial genome. Proc. Nat. Acad. Sei., USA, 89:6575-6579. Schmidt, J.P 1995. All highest scoring paths in weighted grid graphs and its application to finding all approximate repeats in strings. In Proc. Third Israel Symposium on Theory of Computing and Systems, 61-11, New York, IEEE Computer Society Press. Schuler, G.D.. Altschul, S.F., and Lipman, D.J. 1992. A workbench for multiple sequence alignment. In Protein: Structure, Function and Genetics. Sellers, PH. 1974. On the theory and computation of evolutionary distance. SIAM J. Appl. Math., 26:787-793. Sellers, P.H. 1980. The theory and computation of evolutionary distances: Pattern recognition. J. ofAlg.. 1:359-373. Sellers. P.H. 1984. Pattern recognition in genetic sequences by mismatch density. Bull. Math. Biol., 46:501-514. Smith, R.F. and Smith, T.F. 1990. Automatic generation of primary sequence patterns from set of related protein sequences. Proc. Nat. Acad. ofSei, USA, 87:118-122. Smith, T.. and Waterman, M.S. 1981. Identification of common molecular subsequences. /. Mol. Biol., 147:195-197. Smith, T., Waterman, M.S., and Burks. 1985. The statistical distribution of nucleic acid similarities. Nucleic Acid Research, 13:645-656. States, D.J., Gish, W., and Altschul, S.F. 1991. Improved sensitivity of nucleic acid databases searches using application specific scoring matrices. METHODS: A companion to Method, in Enzim., 3:66-70. Storer, J. 1988. Data Compression: Methods and Theory. Computer Science Press, Rockville, MD. Taylor. W.R. 1986. The classification of amino acid conservation. J. ofTheor. Biol., 119:205-218. Timkovskii. V.G. 1990. Complexity of common subsequence and supersequence problems and related problems. Cybernetics, 25(5):565-580. Ulam, S.M. 1972a. Some combinatorial problems studied experimentally on computing machines, in: Applications of Number Theory to Numerical Analysis S.K. Zaremba, editor, Academic Press, 1-10. Ulam, S.M. 1972b. Some ideas and prospects in biomathematics. Annual Review of Biophysics and Bioengineering, 1:277-291. Vingron, M., and Argos, P. 1990. Determination of reliable regions in protein sequence alignment. Protein Engin., 7:565-569. Vingron, M., and Argos, P. 1991. Motif recognition and alignment for many sequences by comparison of dot-matrices. J. Mol. Biol., 218:33-43. von Mises, R. 1939. Probability, Statistics and Truth. MacMillan, New York. Wagner, R.A., Fischer, M.J. 1974. The string to string correction problem. J. Assoc. Comput. Mach., 21:168-173. Wang, L., and Gusfield, D. 1996. Improved approximation algorithms for tree alignment. In D. Hirschberg and G. Myers, editors, Proc. Combinatorial Pattern Matching 96 (Lecture Notes in Computer Science), Berlin, Springer-Verlag. Wang, L., and Jiang, T. 1994. On the complexity of multiple sequence alignment. Journal of Computational Biology, 1:337-348. 196 APOSTÓLICO AND GIANCARLO

Watanabe, S. 1969. Knowing and Guessing. Wiley, New York. Waterman, M.S. 1994. Parametric and ensemble sequence alignment. Bull. Math. Biol, 56:743-767. Waterman, M.S. 1995. Introduction to Computational Biology. Maps, Sequences and Genomes. Chapman Hall. Waterman, M.S., and Byers, T.H. 1985. A dynamic programming algorithm to find all solutions in a neighborhood of the optimum. Math. Biosci., 77:179-188. Waterman, M.S., and Eggert, M. 1987. A new algorithm for best subsequence alignments with applications to tRNA-rRNA comparisons. J. Mol. Biol, 197:723-728. Waterman, M.S., Eggert, M., and Lander, E. 1992. Parametric sequence comparisons. Proceedings of the National Academy ofSciences of the USA, 89:6090-6093. Waterman, M.S. 1989. Sequence alignments. In M.S. Waterman, editor, Mathematical Analysis of DNA Sequences, 53-92. CRC. Waterman, M.S. 1994. Estimating of statistical significance of sequence alignments. Phil. Trans. R. Soc. London (B), 344:383-390. Waterman, M.S., Gordan, L., and Arratia, R. 1987. Phase transition in sequence matches and nucleic acid structure. Proc. Nati Acad. Sei., USA, 84:1239-1243. Waterman, M.S., and Vingron, M. 1994. Sequence comparison significance and Poisson approximation. Statistical Sciences, 9:367-381. Vingron, M., and Waterman, M.S. 1994. Sequence Alignment and Penalty Choice. Review of Concepts, Case Studies and Implications J. Mol. Biol, 235:1-12. Watterson, G.A., Ewens, W.J., Hall, T.E., and Morgan, A. 1982. The chromosome inversion problem. / ofTheor. Biol, 99:1-7. Wilbur, W. J., and Lipman, D.J. 1984. The context dependent comparison of biological sequences. SIAM J. Appl. Math., 44:557-567. Wilbur, W.J., and Lipman, D.J. 1983. Rapid similarity searches of nucleic acid and protein data banks. In Proc. Nat. Acad. Sei. USA 80, 726-730. Zimmer, R., and Lengauer, T. 1997. Fast and Numerically Stable Parametric Alignment of Biosequences. RECOMB 97: Proceedings of the First Annual International Conference on Computational Molecular Biology, 344-353. Ziv, J., and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans, on Information Theory, 23:337-343.

Zuker, M. 1991. Suboptimal sequence alignment in Molecular Biology, alignment with error analysis. /. Mol. Biol, 221:403-420.

Address reprint requests to: Alberto Apostólico Departimento di Elettronica e Informática Universita di Padova Via Gradenigo 6/A 35131 Padova, Italy

axa @ art.de i. unipd. it

Received for publication February 24, 1997; accepted as revised November 9, 1997. This article has been cited by:

1. David A. Morrison. 2006. L. A. S. JOHNSON REVIEW No. 8. Multiple sequence alignment for phylogenetic purposes. Australian Systematic Botany 19:6, 479. [CrossRef] 2. Kristel Van Steen, Benjamin A. Raby, Geert Molenberghs, Herbert Thijs, Mieke De Wit, Monika Peeters. 2005. An equivalence test for comparing DNA sequences. Pharmaceutical Statistics 4:3, 203-214. [CrossRef] 3. Andras Fiser. 2004. Protein structure modeling in the proteomics era. Expert Review of Proteomics 1:1, 97-110. [CrossRef] 4. Alberto Apostolico, Fang-Cheng Gong, Stefano Lonardi. 2004. Verbumculus and the discovery of unusual words. Journal of Computer Science and Technology 19:1, 22-41. [CrossRef] 5. Fern Y Hunt, Anthony J Kearsley, Agnes O??Gallagher. 2004. Constructing Sequence Alignments from a Markov Decision Model with Estimated Parameter Values. Applied Bioinformatics 3:2, 159-165. [CrossRef] 6. Marc A. Marti-Renom, Ashley C. Stuart, Andras Fiser, Roberto Sanchez, Francisco Melo, Andrej Sali. 2000. COMPARATIVE PROTEIN STRUCTURE MODELING OF GENES AND GENOMES. Annual Review of Biophysics and Biomolecular Structure 29:1, 291-325. [CrossRef] 7. RICHARD MOTT, ROGER TRIBE. 1999. Approximate Statistics of Gapped AlignmentsApproximate Statistics of Gapped Alignments. Journal of Computational Biology 6:1, 91-112. [Abstract] [PDF] [PDF Plus] 8. Julie D Thompson, Olivier PochSequence Alignment . [CrossRef]