<<

JOURNAL OFCOMPUTATIONAL BIOLOGY Volume10, Number 6,2003 ©MaryAnn Liebert,Inc. Pp. 803–819

AnEulerian Approach to Global Multiple Alignment for DNASequences

YU ZHANG1 andMICHAEL S.WATERMAN 2

ABSTRACT

With the rapid increase in the datasetof genomesequences, the multiple sequence alignment problemis increasingly importantand frequently involvesthe alignmentof alargenumber ofsequences. Manyheuristic algorithmshave been proposedto improve the speed ofcom- putationand the qualityof alignment. W eintroduce anovelapproach that is fundamentally different fromall currently availablemethods. Our motivationcomes from the Eulerian methodfor fragment assembly in DNAsequencing thattransforms all DNAfragmentsinto adeBruijn graphand then reduces sequence assembly toa Eulerian pathproblem. The paperfocuses onglobal multiple alignmentof DNA sequences, whereentire sequences are alignedinto oneconŽ guration. Our mainresult is analgorithm with almostlinear compu- tationalspeed with respect tothe totalsize (numberof letters) ofsequences tobe aligned. Fivehundred simulated sequences (averaging500 bases per sequence andas lowas 70% pairwise identity) havebeen alignedwithin three minutes onapersonal computer,andthe qualityof alignment is satisfactory.As aresult, accurateand simultaneous alignmentof thousandsof long sequences within areasonableamount of time becomespossible. Data froman Arabidopsis sequencing projectis used todemonstrate the performance.

Key words: multiple sequencealignment, de Bruijn graph,Eulerian path.

1.INTRODUCTION

ultiplesequence alignment is routinely used toŽ ndconserved regions in molecular se- Mquences.When a setof relatedsequences is aligned by multiple sequence alignment algorithms, the hiddencommonalities among sequences are revealed. Thus they provide a criticalcomputational tool to retrievehidden information among huge and exponentially growing genetic data. Amathematicallyoptimal solution of the multiple sequence alignment problem is achieved by using a dynamicprogramming algorithm (Sankoff, 1975; W aterman et al.,1976).Because the time and memory costof thedynamic programming algorithm is exponentialto thenumber of sequences,i.e., 2.LN / where L sequencelength and N numberof sequences, it is impossible to put this method into practical D D

1Departmentof Mathematics,University of SouthernCalifornia, 1042 W .36thPlace (DRB 289),Los Angeles, CA 90089-1113. 2Departmentof Biological Sciences, University of Southern California, Los Angeles, CA 90089-1340.

803 804 ZHANG AND WATERMAN useexcept on a smallset of sequences (see Kececioglu [1993] for extensionsof this approach). Many heuristicalgorithms achieve reasonably good solutions with limited time and memory usage, and some of themhave already been popularly used in modern molecular biology research. Mostcurrently available algorithms are in one of two classes. The Ž rst classis those algorithms using theprogressive alignment strategy (W atermanand Perlwitz, 1984; Feng and Doolittle, 1987). A multiple alignmentis gradually built up by aligning the closest pair of sequences Ž rst andthen aligning the next closestpair of sequences, or onesequence with a setof alignedsequences or twosets of alignedsequences. Thisprocedure is repeated until all given sequences are aligned together .Thekey idea is that the pair of sequenceswith minimum distance is mostlikely to having been obtained from themost recent evolutionary divergenceand that the pairwise alignment of these two speciŽ c sequencesprovides the most “ reliable” informationthat can be extracted. Many programs based on this method exist. Some of themconstruct an alignmentthroughout the entire length of sequences,such as MULTAL(Taylor,1988), AMUL T(Bartonand Sternberg,1987a, 1987b), MSA (Lipman et al.,1989),CLUST ALW (Higginsand Sharp, 1989; Thompson et al.,1994).Other programs try Ž rst toidentify an ordered series of highly conserved regions, then proceedto alignthe intervening regions, such as GENALIGN (Martinez,1988) and ASSEMBLE (Vingron andArgos, 1991). Among these progressive alignment methods, CLUST ALW,probablythe best-known andmost popular program used currently, builds a guidetree from thepairwise alignment scores and mergessubsets of sequences according to the . A recentalgorithm T -Coffee(Notredame et al., 2000) issimilar to CLUST ALWbutuses a consistencymeasure that may reduce the potential errors causedby progressivealignment. Theother class of alignmentalgorithms uses iterative reŽ nement strategies to improvean initialalignment (Sankoff et al.,1976;Sankoff and Kruskal, 1983). The basic idea for iterativemethods is to reŽ ne the initialalignment iteratively by local optimization. In HMMT (Eddy,1995), e.g., the model parameters arereestimated at each iteration. Iteration continues until convergence or reaching the maximal user- deŽned number of iterations. Currently available programs that apply iterative strategies include DIALIGN (Morgenstein et al.,1996),which uses a localalignment approach to construct multiple alignments based onsegment– segment comparisons, where the segments are incorporated into a multiplealignment by an iterativeprocess. The PRRP program (Gotoh, 1996) optimizes an alignment by iteratively dividing the sequencesinto two groups and then realigning them using a globalgroup-to-group alignment algorithm. SAGA (Notredameand Higgins, 1996) uses a geneticalgorithm to select from anevolving population thealignment which optimizes an objective function (OF). AnOF namedCOFFEE (Notredameet al., 1998)is used in SAGA thatmeasures the consistency between the multiple alignment and a libraryof CLUSTALWpairwisealignments. HMMT (Eddy,1995) and SAM (Hugheyand Krogh, 1996) apply a stochasticiterative strategy that maximizes the probability based on a hiddenMarkov model (HMM), but theyare often used to reŽ ne prealigned sequences (Notredame, 2002). Thenumerous alignment programs mentioned above have their own advantages and drawbacks. However, thereare some common issues for allof thecurrently available alignment algorithms: 1) robustness under certainconditions, such as gap-richregions, repeat-rich regions, etc.; 2) local optima problems, especially for iterativemethods; 3) timeefŽ ciency and memory usage. The time cost for allcurrent algorithms is at bestproportional to the square of the number of sequences to be aligned. Althoughmany heuristic algorithms have been applied to alleviatethe huge time and memory expenses, thosecosts remain a barrierfor practicalapplication when confronted with thousands of inputsequences, or millionsof letters in each sequence, which are not unusual numbers even in bacterial genomes. Certainly, handlinghundreds of sequences with lengths in the thousands is a worthygoal.

2.MOTIV ATION

Thispaper presents a novelapproach to the global multiple DNA sequencealignment problem using Eulerianpaths. W ecallthe method “ EulerAlign.”Themost signiŽ cant advantages of EulerAlign are its lineartime and memory cost with respect to the total size of sequences to be aligned and improved alignmentaccuracy. Our motivationcomes from thealgorithm for fragmentassembly in DNA sequencingusing the Eulerian superpathapproach (EULER) (Iduryand W aterman,1995; Pevzner et al.,2001).In that algorithm, a ANEULERIAN PATHAPPROACH 805

FIG. 1. Anextreme version of fragment assembly is multiple sequence alignment. fragmentassembly problem is Žrst reducedto aneasy-to-solve Eulerian path problem in thede Bruijngraph, andan Eulerian path is found that corresponds to the consensus sequence of the genome. A signiŽcant contributionof theEulerian approach is that it discards the traditional “ overlap-layout-consensus”outline. Inanother words, it obtainsa consensussequence before knowing an alignment and without doing pairwise alignments. Aglobalmultiple alignment problem is in fact an extreme version of the fragment assembly problem. As shownin Fig.1, ifalmost all fragments come from thesame region of agenome,then EULER should outputa consensussequence of that region. For multiplesequence alignment, if this consensus sequence isthe closest one to all given sequences, one would expect to achieve an accurate alignment using a consensusscoring scheme. Notethat the idea is similar to the star method: given N sequencesto be aligned, the star method Žrst computesthe alignments of all sequence pairs and picks one sequence among N sequencesas the consensusthat is closestto all other sequences. EulerAlign attempts to Ž ndsucha consensusthat consists offragmentsof the N sequencessuch that the conserved regions are ampliŽ ed and noise is suppressed. To understandthis, assume all input sequences are derived from acommonancestral sequence; then EulerAlign isused to Ž ndthis ancestral sequence. By comparing each sequence with the ancestral, it will be able to distinguishbetween conserved letters and mutations. Consequently, the global multiple alignment will be unambiguouslyconstructed. Acrucialrequirement of EULER is the almost “ error-free”data that is obtained by an error-correction procedure.In addition, EULER doesn’ t evaluatethe quality of its consensus path until the Ž nalquality assessmentstage. For themultiple alignment problem, however, people are more interested in aligning distantlyrelated sequences, and for EulerAlign,the consensus sequence obtained must be “accurate.”One optionis to doerror-correctionaggressively so all“ errors”(differences between sequences) are eliminated, butthis is not likely to succeed when aligning distantly related sequences. Instead, EulerAlign applies a somewhatdifferent procedure than does EULER. EulerAlign transforms the initially tangled graph into a directedacyclic graph (DAG), andduring the transformation it tries to retain k-tuplescommon to several sequences.In addition, the weight for eachedge is assigned in such a waythat the consensus sequence hasthe maximum weight. By doing these two steps, a consensus-Žnding problem is transformed into a heaviestpath problem.

3.THE ALGORITHM

Abriefoutline of our approach is as follows: (1) Constructa directedde Bruijn graph using all k- tuplesfrom thesequences to be aligned. (2) Transformthe de Bruijn graph to a DAG. (3) Extracta 806 ZHANG AND WATERMAN

FIG. 2. A k-tuplerepresented by two vertices and one directed edge connecting them.

consensuspath from theDAG accordingto the weights of edges. (4) Dofast pairwise alignment between theconsensus path and each input sequence. (5) Constructthe Ž nalmultiple alignment according to the pairwisealignments. The central part of EulerAlign is the transformation from theconstructed de Bruijn graphto a DAG thatretains the commonalities among sequences in its edges and allows us to easily constructthe consensus path. Weshouldemphasize that EULER combines all reads and their reverse complements into the graph since,in assembly, fragment orientations are unknown, whereas in our case the orientations are given.

3.1.Graph construction A k-tuple is k consecutiveletters in a sequence.Each k-tuplein givensequences can be transformedinto astructureof two vertices connected by a directededge, where the vertices are the .k 1/-tuplescontained ¡ in the k-tupleand the edge represents the k-tupleitself with the direction from thepreŽ x .k 1/-tuple to the sufŽ x .k 1/-tuple.For example,a 5-tuple“ CCTTA”containstwo 4-tuples ‘ CCTT’+‘ CTT ¡ A’with ¡ overlappingletters “ CTT.”Each4-tuple is a vertexand the directed edge connecting them represents the 5-tuple“ CCTTA.”This structure is the basic unit in our de Bruijn graph (Fig. 2). Our deBruijn graph is constructedby collecting all k-tuplesamong the sequences. Each k-tuplestructure isadded into the graph successively, and if theincoming k-tuplehas vertices or an edge identical to the verticesor edgesin the alreadyexisting graph, then they are merged together .Sequenceinformation is stored inedges.An example is shownFig. 3. Underthis construction, each k-tuplefrom thesequences corresponds

FIG. 3. Constructionof the de Bruijn graph for CCTT AG and k 5. D ANEULERIAN PATHAPPROACH 807

FIG. 4. Anexample of theinitial de Bruijngraph. Branches and loops are due to the differences and repeats among sequences.All thick edges have multiplicities larger than 1. .N 10; L 60bp; k 6/. D D D toa uniqueedge in the graph, and each .k 1/-tuplecorresponds to a uniquevertex. Consequently, each inputsequence is represented by a paththrough ¡ this graph. Inorder to Ž ndtheconsensus path, which is not necessarily in oneof thegiven sequences, we deŽne the multiplicityof an edge to be the number of sequences containing this edge. Obviously, an edge (of Žxed length)with high multiplicity is more likely to represent a consensus(Fig. 4). By deŽ ning multiplicity in thisway, EulerAlign can Ž nda reasonableconsensus path and avoid being biased by repeats. Aconsensuspath is deŽ ned as an acyclic path through the graph. Each edge is visited at most once, andthe weight of thispath (according to the multiplicities and lengths of edges)is maximized. In another words,Ž ndingconsensus is a heaviestpath problem, which is identical to a shortestpath problem with negativeweights. A shortestpath problem with negative weights can be solved in linear time when the graphis aDAG. Our heuristicis totransform the graph to a DAG suchthat we canŽ ndthe consensus in lineartime.

3.2.Transforming the graphto aDAG Becauseof therandom matches, repeats, and mutations in DNA sequences,the initial graph is extremely tangled.A tangleis deŽ ned to be a vertexthat has more than one incoming or outgoing edge. As the numberof incomingor outgoing edges in a tangleincreases, so do the chances that the tangle will result incycles. The goal is toeliminate as many cycles as possible while keeping all the similarity information amongsequences untouched (represented by edges with high multiplicities). Note that only cycles are our targets,while tangles are not.

Claim: DeŽ ne a left edge for vertex vi tobe an edge that points to vi , denoted as E v , and a right ! i edge for vertex vi tobe an edge that starts from vi andpoints to another vertex, denoted as Evi . If a n ! vertex vi hastwo or more left edges E v n 1;2;::: thatare contained in the same sequence path, then f ! i g D theremust exist a cyclein the graph and vi isavertexon the .

Proof. If apathof a sequencecontains two left edges E1 and E2 of v ,thenwhen walking though vi vi i 1 ! ! 2 this path, vi willbe visited when visiting E v , and vi willbe visited again when visiting E v . Since ! i ! i thepath is well connected, it follows that this path contains a cyclewith vi asa crossvertex. Thus, the graphcontains a cycle.

Notethat the reverse of this claim is not true; i.e., after removing all cycles found by this procedure, theresulting graph is not necessarily a DAG. Buta goodpoint is that after removing all such cycles, in ourdata the acyclic consensus we foundwas satisfactory. 808 ZHANG AND WATERMAN

Thebasic rules of our transformation are similar to those of EULER (Pevzner et al., 2001):

(1) If acycleis found by theabove procedure, i.e., the correspondingsequence path is ofform “ E1 E1 vi vi 2 2 1 2 ¢ ¢ ¢ ! ! E v Ev ,”andthe right edges Ev and Ev aredifferent, then we canmake two superedges ¢ ¢ ¢ ! i i ! ¢ ¢ ¢ i ! i ! (anedge that represents two short continuous edges connected by acommonvertex) E1 and E2 vi vi 1 1 2 2 ! ! ! ! to replace E v , Ev , E v , Ev ,suchthat the cycle is eliminated (Fig. 5a). ! i i ! ! i i ! (2) Ina situationsimilar to (1), but now E1 and E2 arethe same, denoted as E ,we againmake two vi vi vi 1 2 ! ! ! superedges E v and E v byseparating Evi suchthat the cycle is simpliŽ ed but not eliminated ! i ! ! i ! ! (Fig.5b). In this case, the sequence information stored in Evi ispartitioned into two superedges 1 2 ! E v and E v accordingto which left edge of vi theycome from, andthe multiplicities for ! i ! ! i ! superedges E1 and E2 arecomputed. This is distinct from EULER,because we knowthe vi vi entiresequences ! !and in most ! !cases it is easy to partition and assign sequences by looking at the correspondingleft/ rightedges.

1 Becausewe aredoing multiple sequence alignment, many other sequence paths may visit vi from Ev , i ! E2 orother edges and leave v indifferent directions. Thus, one issue concerning our graph transfor- vi i mation! is whether the transformations lead to the loss of similarity information and whether the order in whichwe performtransformations determines whether we willretain the maximum similarity information. Atransformationis safe ifthe transformation does not introduce the loss of similarity information. An exampleis shown in Fig. 6a. Otherwise the transformation is unsafe, as shown in Fig. 6b. W econclude thatunder the safe transformation,the desired consensus path will be well maintained. As aresult,the currentversion of EulerAlign attempts to remove all cycles by performing safe transformationsŽ rst and leavesall unsafe transformations for later. InFig. 7 isa deBruijn graph transformed from theoriginal graph in Fig. 4. Note that a loopin the originalgraph is removed, and it is a DAG now.The thick path represents our heaviest consensus path. Intuitively,the total number of possiblepaths in the graph is decreasing as transformations are performed, andthus we havefewer choicesof paths. However, when safe transformations are performed Ž rst,the desiredconsensus path is well maintained, and the number of necessary unsafe transformations to obtain thedesired graph is minimized.On the other hand, it isworthmentioning that safe transformationscannot eliminateall cycles. In some situations, no transformation is safe. But we willshow by example that, in

FIG. 5. (a) A tangle at vi iseliminated by making a copy vi0 of vertex vi . (b) A tangle at vi iseliminated by making a copy vi0 of vertex vi ,andseparating Evi .Thecycle is not eliminated yet, due to vertex vj .Whenthis step is repeatedseveral times, the cycle will be Ž nally! eliminated. ANEULERIAN PATHAPPROACH 809

FIG. 6. Examplesof safe/ unsafetransformation. Dashed lines show the directions of sequencepaths. Multiplicities are 1 1 shownon each edge. ( a)Differentsequences (with different grey scales) share E v and Ev .Aftertransformation, ! i i ! theyremain unchanged. ( b)Aftertransformation, the similarity information in E2 islost. Thus it is an unsafe vi transformation. ! ourexperience, the global multiple alignments constructed by EulerAlign are indeed more accurate than ofothermethods.

3.3.Consensus path and Ž nalalignment After performingthe above transformations, we applya greedyalgorithm to Ž nda heaviestpath within lineartime. The weight for eachedge is proportional to its multiplicity and length. Although the greedy algorithmdoes not guarantee optimal solutions, the results are satisfactory. Note that we canalways refer backto therigorous heaviest path algorithm by arbitrarily removing all edges that complete a cycle(other thanthe cycles deŽ ned by ourclaim) in linear time and thus guarantee an optimal solution for thisversion oftheproblem.

1

2

0 3 1 4

2 5 8

0 9 4 3 6 10 8 9 8 7 9 5 8 9

6 9

7

8

9

FIG. 7. ADAG achievedby doingsafe transformation. The bold path is the desired consensus path, and multiplicities areshown on each edge. 810 ZHANG AND WATERMAN

Our nextstep after obtaining a consensuspath is quitestraightforward. EulerAlign uses a bandedpairwise alignmentalgorithm (the positional shifts between two candidate letters in two sequences are bounded by aconstant)to align the consensus sequence with each input sequence and then combines the alignments to constructthe Ž nalglobal multiple alignment. As mentionedbefore, this alignment attempts to maximize theconsensus score, and it is most appropriate for thecase that all given sequences are derived from a commonsequence. However, the algorithm performs wellwith sequences derived from anevolutionary process(see resultssection).

4.PROBABILITY ANAL YSIS

Assumeall input sequences are derived from acommonancestral sequence S0.Wewillshow that when thenumber of input sequences increases, the consensus path found in the graph will be asymptotically “identical”to S0. Let N denotethe number of sequences to be aligned, L denotethe average sequence length, k denote thesize of the k-tupleused in construction of thegraph, and ¯ denotethe mutation rate (by this we mean theprobability of a substitutionor indelin a sequenceat a givenposition). Recallthat the consensus path is the heaviest path in the graph weighted by the multiplicity and the lengthfor eachedge. An edge weight is denoted as W.e/.Tobesimple, deŽ ne W.e/ multiplicityof D theedge. Then, with no mutations along the evolution history, all N sequencesare exactly identical to S0,andthe multiplicity for anyedge is N.Withmutations present, there will appear many single edges correspondingto themutated k-tuples,and the weight of edges in S0 (denoted as W.e S0/)willdecrease. 2 Wewilluse the Large Deviation Theorem (L.D.T.) for Binomials(see Arratiaand Gordon [1989] for an exposition)to estimate the expected min W.e S0/ basedon thegiven parameters .N; L; k; ¯/ as deŽ ned f 2 g above.If thisexpected minimum weight is signiŽ cantly larger than the expected max W.e S0/ , our f 62 g consensuspath should consist of edges presented in S0 andhence be accurate. Approximationby L.D.T . Considera binomialrandom variable X B.N; p/, where p isthe proba- » bilityof success. L.D.T .for Binomialsis used to approximate the probability of x ormore successes in N independenttrials when the speciŽ ed fraction of successes, ® x=N, satisŽ es 0 < p < ® < 1: DeŽ ne p.1 ®/ ´ ® 1 ® theodds ratio r ¡ andthe relative entropy H H.®; p/ .®/log. / .1 ®/log. ¡ /. Then D ®.1 p/ ´ ´ p C ¡ 1 p L.D.T.givesthe following ¡ approximation (Arratia andGordon, 1989): ¡

1 1 NH P .X ®N/ e¡ , as N : (1) ¸ » 1 r £ p2¼®.1 ®/N £ ! 1 ¡ ¡ Inour case, we deŽne a “success”to be a k-tuplethat contains at least one mutation; then the number of th mutated k-tuplesat the i positionamong N sequencesis a randomvariable Xi witha binomialdistribution B(N,p), where p P .the k-tuplehas mutations / 1 .1 ¯/k.For example,if ¯ 0:05 and k 16, D D ¡ ¡ D D then p 0:56.Applying Equation (1), it can be shownthat for Žxed ® and i , P .X ®N/ convergesto 0 D i ¸ quickly as N increases.Therefore, for Žxedsequence length L andassuming the independence of k-tuples L k 1 in S0, P .max Xi ; i ®N/ 1 P .Xi < ®N; i/ 1 P ¡ C .X1 < ®N/ 1 .1 P .X1 L k 1 f 8 g ¸ D ¡ 8 D ¡ D ¡ ¡ ¸ ®N// ¡ C 0, as N ncreases.So max Xi i allpossible positions in S0 isasymptotically less than ®N, and!thus min W.e S / N f maxj XD i isasymptotically greater g than .1 ®/N as N f 2 0 g D ¡ f i j 8 g ¡ increases.For example,if ¯ 0:05 and k 16,weights of alledges in S0 areasymptotically greater than D D 0:44N. Onthe other hand, we estimate max W.e S0/ bythemaximum number of identical k-tuplesmatching f 62 g bychance, and if it is larger than min W.e S0/ ,ourconsensus will be biased. Assume the sequences arerandomly generated with uniformly fdistributed 2 gletters and all N .L k 1/ k-tuplesare independent; £ ¡ C leta successbe a k-tupleidentical to a predeŽned k-tuple,with p 4 k.Underthese assumptions, the D ¡ numberof identical k-tuplesmatching the predeŽ ned k-tupleby chance is a binomialrandom variable Y B.N.L k 1/; p/.Weusuallypick a sufŽciently large k so that 4k NL.ByEquation (1), the » ¡ C ¸ expectednumber of events with at least ®N.L k 1/ k-tuplesmatching by chance can be estimated ¡ C by 4kN.L k 1/P .Y ®N.L k 1// 4kpN.L k 1/e N.L k 1/H .Thisnumber converges ¡ C ¸ ¡ C / ¡ ¡ ¡ ¡ C to0 quicklyas N increases.Therefore, the maximum number of identical k-tuplesmatching by chance is ANEULERIAN PATHAPPROACH 811 asymptoticallyless than ®N.L k 1/ for any ® p.Thisimplies that max W.e S0/ isasymptotically ¡ C ¸ f k 62 g 1 less than ®N.L k 1/.As mentioned,usually we choose k to satisfy p 4¡ .NL/¡ , so that ® can besufŽ ciently small ¡ Cand the inequality max W.e S / < ®N.L k 1D/ < min· W.e S / holds for f 62 0 g ¡ C f 2 0 g large N. Theabove computation is over-simpliŽ ed and many assumptions are unrealistic. But we hopeit demon- stratesthe ability of ourheaviest path method to discoverthe underlying ancestral sequence. Furthermore, theapproximation by L.D.T .for Binomialsprovides a guidancein choosing tuple-size k, for boththe lower-boundand the upper-bound.

5.SIMULA TIONRESUL TS

EulerAlignis tested on a SUN UltraSPARC750MHz(CPU speed)workstation for simulatedsequences, rangingfrom 6to500 sequences. Each sequence is from 200to 1000 bp long,and the similarities between sequencesare from 90%to 70%, i.e., the mutation rate per base ranges from 5%to 16%. Substitutions, insertions,and deletions are uniformly and randomly distributed in all sequences, and in some cases accordingto an evolutionary tree model. Weusethree different ways to evaluate alignments:

(1) Sum-of-pairs(SP): Oneof the typical scoring schema. Although this score will not be appropriatewhen bothclosely related and distantly related sequences are present, it is appropriate for theequidistant sequences(type A sequencesmentioned below). The SP scoreis normalized by the number of pairs N.N 1/ 2¡ . (2) AligningAlignment Score (AA): From anancestral sequence and a phylogenetictree along which all homologoussequences are derived, one can unambiguously construct the TRUE multiple alignment by followingthe divergence history along the tree. By usingsimulated data, we knowexactly the ancestral sequenceand the tree by which those mutations are generated. Thus, the true multiple alignment is known.By measuring the distance between the true alignment and the alignment obtained by different methods,respectively, one can tell which method gives a betteralignment. Aligning Alignment serves asageneralizedversion of pairwisealignment by regardingeach column in one alignment as a“letter” anddeŽ ning a scoringfunction for theseletters (W atermanand Perlwitz, 1984; W aterman,1995). The AAscoreis normalized by N. (3) Identity(ID): Anidentity measure between a givenalignment A andthe true alignment A¤ mentioned ij k;i

(A) Allsubstitutions, insertions and deletions for eachsequence AEi , i 1 N aregenerated randomly anduniformly on apredeterminedsequence A , where A israndom. D The ¢ ¢ ¢resulting sequences exactly E0 E0 Žttothe consensus alignment model. (B)Firstan evolutionary tree, in which the branch lengths are all equal, is generated, and then different mutations,with equal probability on each sequence based on its parent sequence from thetree, aregenerated. Although this tree is not correct from anevolutionary point of view because of the equal-lengthbranches, it introduces mutation dependence into sequences and creates sequences that shouldŽ tthemodel of CLUSTALW.Moreimportantly, we wantto testEulerAlign on nonequidistant sequences.

Tables1 and2 comparethe results from EulerAlignand CLUST ALWonboth types of sequences, with various N, L,andmutation rates. The alignment scoring function is as follows: match 0; mismatch D D 812 ZHANG AND WATERMAN

Table 1. Multiple Alignment Comparison between EulerAlign and CLUSTALW on Equidistant Sequences (Type A)

Sequences SP AA ID Time/sec

N L MutationEul Clw EulClw Eul Clw Eul Clw

6 200 5.2% 48 49 19 2997.4% 93.8% 1 2 10 200 5.2% 53 58 13 3798.2% 91.7% 1 3 15 200 5.2% 52 66 17 4697.7% 91.6% 3 4 50 200 5.2% 53 119 14 8397.9% 85.1% 4 32 6 500 5.2% 120 121 34 5597.9% 96.0% 3 6 10 500 5.2% 135 148 35 8397.8% 92.4% 4 14 15 500 5.2% 130 157 41 9297.5% 92.1% 8 25 50 500 5.2% 141 309 44 23097.3% 84.1% 11 189 100 500 5.2% 131 396 39 29897.7% 81.8% 22 643 500 500 5.2% 130 719 38 64097.8% 64.4% 132 14348

6 200 10.5% 113 103 57 10190.8% 76.6% 1 1 10 200 10.5% 140 118 91 12986.0% 70.8% 1 2 15 200 10.5% 133 140 71 14989.7% 68.2% 3 4 50 200 10.5% 146 188 82 20086.9% 59.5% 4 29 6 500 10.5% 335 283 191 26488.3% 77.3% 3 7 10 500 10.5% 334 332 201 31787.1% 72.0% 3 14 15 500 10.5% 326 339 182 35788.8% 70.3% 7 25 50 500 10.5% 380 493 268 50683.3% 61.2% 9 183 100 500 10.5% 380 580 245 60684.5% 55.9% 19 628 500 500 10.5% 349 751 195 86588.1% 43.6% 113 14093

6 200 16.4% 232 154 334 28251.5% 45.6% 1 1 10 200 16.4% 237 163 236 26064.2% 44.1% 1 2 15 200 16.4% 252 190 289 28856.8% 43.1% 2 3 50 200 16.4% 262 211 287 33259.0% 34.0% 3 19 6 500 16.4% 593 389 666 62359.2% 48.5% 3 6 10 500 16.4% 591 429 634 65261.2% 43.8% 3 12 15 500 16.4% 621 459 702 70958.3% 42.4% 5 21 50 500 16.4% 627 541 686 81859.7% 36.4% 7 129 100 500 16.4% 644 565 726 87557.7% 32.7% 16 397 500 500 16.4% 628636 685 1012 60.9% 29.2% 98 8514

1; gapopen 4;gapextension 1,an afŽ ne distance measure. The relations between mutation rates and D D sequencesimilarities are 5 :2% 90%, 10:5% 80%; 16:4% 70%. » » » As shownin the tables, the quality of our alignments for bothequidistant sequences (T able1) and nonequidistantsequences (T able2) is generally better than CLUST ALW,especiallyfor large N, although typeB sequencesŽ tCLUSTALW’smodel.More signiŽ cantly, the quality of EulerAlign’ s alignmentsis almostinvariant with N. When N is small ( 10),our SP scoresare worse thanCLUST ALW,butAA and · IDscoresare generally better .Finally,the time cost of EulerAlignis approximatelylinear with respect to N.Figure8 showstwo almost identical alignments obtained by EulerAlign and CLUST ALW. Wealsocompared EulerAlign with a programusing hidden Markov models, SAM (Hugheyand Krogh, 1996),for whichthe simulated equidistant sequences perfectly Ž tthemodel. It turns out that SAM outper- forms CLUSTALWonequidistant sequences when evaluated by the AA scoringscheme. But EulerAlign worksat least as well as SAM, and we believethat EulerAlign will be superior to SAM whenthe input sequencesare more complex. In addition, with some modiŽ cation, our method can be extended to many otherapplications, including multiple local alignment and repeat Ž nding,which will be presentedin another paper. ANEULERIAN PATHAPPROACH 813

Table 2. Multiple Alignment Comparison between EulerAlign and CLUSTALW on Nonequidistant Sequences (Type B)a

Sequences SP AA ID Time/sec

N L MutationEul Clw Eul Clw Eul Clw Eul Clw

6 500 5.2% 110 110 36 61 98.1% 95.2% 3 6 10 500 5.2% 162 172 71 109 96.4% 91.1% 4 13 15 500 5.2% 163 180 60 120 96.6% 90.9% 6 31 50 500 5.2% 138 242 48 160 97.1% 88.8% 11 181 100 500 5.2% 151 399 50 318 97.2% 80.1% 24 631

6 500 10.5% 236 219 91 175 94.4% 84.9% 3 7 10 500 10.5% 380342 280 375 83.3% 69.2% 4 14 15 500 10.5% 356340 298 338 83.8% 76.0% 7 25 50 500 10.5% 397494 283 501 83.2% 63.0% 10 181 100 500 10.5% 345582 214 610 86.9% 60.3% 18 623

6 500 16.4% 527362 571 557 65.2% 52.2% 2 6 10 500 16.4% 550396 629 647 63.9% 47.8% 4 13 15 500 16.4% 583438 613 635 66.1% 48.8% 4 24 50 500 16.4% 626540 697 838 59.7% 38.3% 8 119 100 500 16.4% 601623 649 868 62.6% 39.2% 17 500

a“Eul”is EulerAlign, Clw is CLUSTALW, N is thenumber of sequences tobe aligned, L isthe average lengthof each sequence, and“ mutation”the mutation rate, correspondingto the similarity between sequences.

FIG. 8. Anexample of alignmentsfrom EulerAlign (E) and CLUST ALW(C).T otal8 sequenceswith average length 60 bases and 90%identities are aligned. The visual tool used is ClustalX (Thompson et al., 1997). »

6.COMPUT ATIONALCOMPLEXITY

From theresults shown in T ables1 and2, the time cost for ouralgorithm is modest. When N 500 D and L 500,EulerAlign runs three minutes while CLUST ALWruns3 4hours.The time cost for EulerAlignD is O.NL):(i) for constructionand transformation of the graph, » it is O.NL/, where NL » » isthe total size of sequences, (ii) for Žndingthe heaviest path, it is O. 6 NL/,becauseeach vertex » j j connectsat most 2 6 othervertices, where 6 isthe size of the alphabet set 6,(iii)for banded j j j j pairwisealignment, it is O.NL/. » 814 ZHANG AND WATERMAN

Inaddition, EulerAlign’ s memoryusage is also efŽ cient, because the size of our de Bruijn graph constructedby k-tuplesis proportional to the size of input sequences, say, O.kNL/,andthe memory for bandedpairwise alignment is O.L/.Consequently,the total memory usage is O.kNL/.

7.APPLICA TIONON ARABIDOPSISSEQUENCES

Wehaveapplied EulerAlign on the results of resequencing a portionof the Arabidopsis genome from 96individuals in order to Ž ndpolymorphism. Data is provided by Prof. M.Nordborgand his group at USC. Eachindividual genome is sequenced from boththe forward andthe reverse complement strands, andeach letter has a qualityvalue computed by Phred (Ewing and Green, 1998). W eimplementquality valuesinto EulerAlign by twosteps: Ž rst,we redeŽne theweight of edgesbased on qualityvalues; second, we applyquality values in the scoring matrix of pairwise alignment. TodeŽne theweight of edgesfrom qualityvalues, we needto combine three types of information:edge length,multiplicity, and quality values of all letters in the edge. Recall that the Phred score is deŽned as Q D 10 log.P.® isincorrect ® observed//,andthus a higherquality value means a moreaccurate base-call ¡ £ j ofthe letter .Regardlessof the genetic variations, sampling sequences from N individualsis equivalent tosequencing one individual N times. Let G denotethe individual being sequenced and Si denote the sequenceobtained from the ith sequencing.If a k-tupleis calledin n ( > 1)sequencingprocesses from the sameposition, the probability that the k-tupleis indeedin G shouldbe higherthan the probability computed i;j from onesequencing result. Based on this fact, let Te denote the k-tuplein anedge e and let Te denote the actualoccurrence of Te in Si at position j.WeredeŽne the weight function of an edge e withmultiplicity n .n/ bycalculating the probability P P .Te G Te calledin n sequencesfrom thesame position in G / Te D 2 j (we assumeall k-tuplesin edge e arecalled from thesame position in G,becauseour graph is an alignment)and the mean value P 1 6 P .T i;j G T i;j called/, where M isthe total number of NTe M i;j e e D 2 j .n/ Te calledat any position in any sequence, and Ž nallytaking the log ratio P divided by PT . That is, Te N e P .n/ W .e/ log Te .Alargerweight means the k-tuplesare more accurate. By assuming the independence P D NTe ofletters, we calculatethis log ratio for eachletter in Te individually,where the quality values for each letterfrom Phredcan be directly applied, and then do a summing.Since each edge is constructed from overlapping k-tuples,the overlapping parts among different edges will be counted only once; i.e., during theheaviest path Ž ndingprocess, the weight for eachedge is adjusted to disregard the overlapping parts. Followingthe above formulation, the heaviest path weight is the log ratio of the actual probability that thepath is correct divided by the expected probability that the path is correct, and the heaviest path will consistof edgesrepresenting the most accurately called k-tuples.W ealsotried other formulations for the weightfunctions of edges, such as log likelihood ratio models, and obtained similar results. Whendoing pairwise alignment, instead of using an arbitrary scoring matrix, we applythe quality values tocomputean average score for aligningtwo letters. The quality values for eachsequence are given, while thequality values for consensuscan be computed from thepath, or alternativelyone can deem the consensus completelycorrect. The average score of aligningtwo letters ®, ¯ isthencomputed as S 6S P , N®¯ D ®0¯0 ®0¯0 where ® and ¯ areall possible letters, S isa predeŽned score, and P P.® and ¯ aretrue letters 0 0 ®0¯0 ®0¯0 D 0 0 j Observations / iscomputed from bothquality values of letter ® and ¯.Gappenalties remain arbitrary in thecurrent version of EulerAlign.Figure 9 showspart of thedataset At_000000072 aligned by EulerAlign. Wecomputedthe alignments of Ž vesequence sets by EulerAlign and CLUST ALWrespectively.T o applyquality values in CLUST ALW,we replacedDNA lettersof different qualities by different protein alphabets.For example,the letter A ofquality40 was replacedby W ,whereasA ofquality 20 was replaced byV .Thenwe providedCLUST ALWwitha speciŽc scorematrix to do the alignment. This work was doneby Tina Hu in Prof. M.Nordborg’s groupat USC. For thesake of consistency, we useda modiŽed versionof sum-of-pair scores to measure the alignments computed with quality values. That is, for a match/mismatch,we computedan average match/ mismatchscore as described before; for anindel, the gappenalty is inversely related to the quality value of the corresponding letter .Totestthe robustness of EulerAlign,we alsocomputed the alignments without using quality values and compared them using the typicalsum-of-pair score. T able3 showsthe results. ANEULERIAN PATHAPPROACH 815

FIG. 9. Partof EulerAlign’s alignmentof At_000000072.The visual tool used is ClustalX(Thompson et al., 1997).

Thealignments by CLUSTALW usingquality values are worse thanthose by EulerAlignon allsequence sets.W etesteddifferent score matrices and gap penalties but observed similar results. Since each individual genomewas sequencedfrom boththe forward andthe reverse complement strands, the asymmetric base- callsbetween two strands may complicate the alignment. In addition, the quality values in each sequence varygreatly. These observations imply the necessity to combine the forward andthe reverse complement strandsinto one sequence before doing alignments. Onemay argue that one can simply choose the sequence with the highest average quality value as the consensus.Doing this, in addition to its limited  exibility,gives results that are surprisingly bad compared tousing the consensus obtained from thegraph, although the two consensus sequences are very similar toeach other .Totestwhether the result is due to the poor quality regions, we cutall sequences at both endsto obtain a setof putative sequences that should align properly with high quality values. Although theresult is much better than that when using full length sequences, the conclusion remains unchanged. As shownin Fig.10, the maximal normalized multiple alignment score is 0.655 (similarity score) by using eachsequence as the consensus respectively, whereas we get0.693 by using our graph consensus. On the

Table 3. Multiple Alignment Comparison between EulerAlign and CLUSTALW on Arabidopsis Sequencesa

With qual Withoutqual

Sequences N L EulClw EulClw Time byEul

At0072 184751 214 216 188 194 48 sec At0076 188761 234 335 199 210 75 sec Arab008 181720 299 431 256 287 39 sec Arab022 190701 198 288 159 175 36 sec Arab042 187700 182 305 160 174 40 sec

a(Mismatch,GapOpen, GapExt) (1,4, 1); Eul: EulerAlign; Clw: CLUST ALW;time by D CLUSTALW: > 40 mins. 816 ZHANG AND WATERMAN

FIG. 10. Thenormalized sum of pairs score obtained by directlyusing each sequence as consensusagainst average qualityvalue of that sequence. Each point represents a sequence.The sequence achieving the best alignment score is notthe sequence with the highest average quality value. otherhand, we havethe reference sequence from theassembled genome from TIGR, whichis regarded asthe correct sequence. W ethendraw twohistograms for thesimilarity between each sequence to the referenceand to our consensus. The curve of the similarity score distribution to our consensus turns out tobe narrower thanthat to the reference (Fig. 11), which shows that our consensus is a betterestimation ofthe“ center”than is the reference based on the given data.

8.DISCUSSION AND OPENPROBLEMS

EulerAlign’s mostsigniŽ cant advantages are its capability to handlea largenumber of sequencessimul- taneouslyand its extraordinary speed. In addition, a consensussequence that captures the most conserved similaritiesis obtained before getting an alignment,a distinguishingfeature with regard to otheralignment algorithms.EulerAlign is very  exiblein that additional information can be easilyincorporated into the de Bruijngraph. For example,if input sequences have different weights, the weights can be easily incorpo- ratedinto the multiplicity for edgesby modifying the deŽ nition. If position-speciŽc informationis given, besidesdoing position-speciŽ c pairwisealignment, we canalso construct the de Bruijn graph according tothis information. Althoughour Eulerian approach to global multiple sequence alignment has been successful, there exist somepractical problems and we willnow discuss them.

Choiceof k-tuple size Thechoice of tuple-size k affectsthe quality of the consensus path, because generally one mutation in asequencewill cause a singleedge to diverge away from thecommon edges by .2k 1/ letters.Thus, ¡ ANEULERIAN PATHAPPROACH 817

FIG. 11. Distributionof similarity between each sequence to eitherthe reference sequence (grey bar) or ourconsensus (whitebar). The grey bars are slightly shifted to the left for visualization purpose. the larger k is,the fewer commonalities(or multiplicitiesfor edges)are retained in the graph. Of course, the smaller k is,the more likely it is that a k-tupleis not unique in the sequence. Currently,EulerAlign arbitrarily chooses a k toconstruct the de Bruijngraph. For small N anddistantly relatedsequences, a small k . 10/ ischosen in order to get high multiplicities for edges.For large N · andclosely related sequences, a large k . 16/ maybe proper because a graphconstructed using large k ¸ isalways simpler to solve (fewer randommatches result in fewer tanglesand cycles) than the one using smallk. The upper bound of k canbe estimated by L.D.T .

Graphtransformation may lose information As mentionedabove, sometimes the safe transformationscannot remove all cycles in the graph, and thus severalunsafe transformations may be performed in later stages to remove all remaining cycles. Unsafe transformationsinevitably cause the loss of similarity information among sequences that are represented bymultiplicitiesof edges.One reason that a transformationis unsafe comes from ourtransformation rules. Weconsideronly two left edges at one time, which is not sufŽ cient. One improvement is to consider all leftedges of avertexsimultaneously. On the other hand, since we aredoing the global sequence alignment, we canperform a bandedgraph construction; i.e., when merging a k-tupleto an edge, in addition to the requirementof identical k-tuples,their corresponding positions must also be within a window.Thisfeature hasbeen implemented as an option in current version of EulerAlign.

Arbitrary scoring function Inthe pairwise alignment step, the scoring function is arbitrarily chosen. This is acommonissue for all alignmentalgorithms that use arbitrary scoring functions, because no scoringfunction exists that is optimal for alltypes of input sequences. Consequently, how to choose the optimal scoring functions for different datais a problem.However, because our graph itself is a representationof multiple sequence alignment, 818 ZHANG AND WATERMAN thedrawback from anarbitrary scoring function might be solved by using data from thegraph; e.g., the existenceof many sequence paths sharing a commonedge in the graph provides us more conŽ dence to alignthem together at the corresponding positions. Our deBruijn graph is in fact another representation of multiplealignment, in which the alignments of lettersare represented by many sequence paths sharing common edges. Since one mutation in sequences willcause a singlepath to diverge from theconsensus by .2k 1/ letterswhen using a k-tuple, a de ¡ Bruijngraph inaccurately represents the multiple alignment. However, the graph representation has its own advantagesin searching local commonalities and showing sequence variability. In fact, the Eulerian path approachto the local multiple alignment problem is a naturalextension to its global counterpart. W ewill presentthe local alignment algorithm in another paper .

NOTEADDED IN PROOF

After thepreparation of our paper, we becameaware ofa paper“ Multiplesequence alignment using partialorder graphs” (Lee, C., Grasso,C., andSharlow, M. 2002. ,18,452– 464). In that paper,a graphstructure is embedded in the dynamic programming algorithm to efŽ ciently align a large numberof sequences. The authors’ method successively reduces the space requirement due to the graph structureapplied. A seriesof pairwise alignments are employed in an arbitrary order under a certain scoringscheme. As withany pairwise method, errors caneasily accumulate during the alignment process thatresults in misalignment. As acomparison,CLUST ALW requires O.N 2/ timesimply to Ž nda better orderto do the pairwise alignments.

ACKNOWLEDGMENTS

WethankProfessor MagnusNordborg at USC for kindlyproviding the Arabidopsis sequencing data andare grateful to Professor LeiLi at USC andDr .HaixuT angat UCSD for manyhelpful discussions. Thisresearch was supportedby NIH grantR01 HG02360-01.

REFERENCES

Altschul,S.F .,Carroll,R.J., and Lipman, D.J. 1989. W eightsfor data related by a tree. J.Mol.Biol. 207,647–653. Arratia,R., andGordon, L. 1989.Tutorial on large deviations for the binomial distribution. Bull.Math. Biol. 51, 125–131. Barton,G.J., and Sternberg, M.J.E. 1987a. Evaluation and improvements in the automatic alignment of protein se- quences. ProteinEng. 1, 89–94. Barton,G.J., and Sternberg, M.J.E. 1987b. A strategyfor the rapid multiple alignment of proteinsequences conŽ dence levelsfrom tertiary structure comparisons. J.Mol. Biol. 198,327– 337. Dayhoff,M.O., Schwartz, R.M., and Orcutt, B.C. 1978.A modelof evolutionarychange in proteins, in DayHoff,M.O., ed., Atlasof ProteinSequence and Structure ,345–352, National Biomedical Research Foundation, W ashington,DC. Eddy,S.R. 1995.Multiple alignment using hidden Markov models. ISMB 3, 114–120. Ewing,B., andGreen, P .1998.Base-calling of automatedsequencer traces using phred. II. errorprobabilities. Genome Res. 8,186–194. Feng,D., and Doolittle, R.F .1987.Progressive sequence alignment as a prerequisiteto correct phylogenetic trees. J.Mol.Evol. 25,351– 360. Gotoh,O. 1996.SigniŽ cant improvement in accuracyof multiple protein sequence alignments by iterative reŽ nement asassessed by reference to structural alignments. J.Mol.Biol. 264,823– 838. Gotoh,O. 1999.Multiple sequence alignment: Algorithms and applications. Adv.Biophys. 36,159– 206. Higgins,D.G., and Sharp, P .M.1989.Fast and sensitive multiple sequence alignments on a microcomputer. CABIOS 5, 151–153. Hughey,R., and Krogh, A. 1996.Hidden Markov models for sequence analysis: Extension and analysis of the basic method. CABIOS 12, 95–107. Idury,R., andW aterman,M.S. 1995.A newalgorithm for DNA sequenceassembly. J.Comp.Biol. 2, 291–306. Kececioglu,J.D. 1993. The maximum weight trace problem in multiple sequence alignment. CPM, 106–119. ANEULERIAN PATHAPPROACH 819

Lipman,D.J., Altschul, S.F .,andKececioglu, J.D. 1989. A toolfor multiple sequence alignment. Proc.Natl. Acad. Sci. USA. 86,4412– 4415. Martinez,H.M. 1988. A exiblemultiple sequence alignment program. Nucl.Acids Res. 16,1683– 1691. McClure,M.A., V asi,T .K.,and Fitch, W .M.1994.Comparative analysis of multiple protein-sequence alignment methods. Mol.Biol. Evol. 11,571– 592. Morgenstern,B. 1999.DIALIGN 2:Improvementof the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15,211– 218. Morgenstern,B., Dress, A., andW erner,T .1996.Multiple DNA andprotein sequence alignment based on segment- to-segmentcomparison. Proc.Natl. Acad. Sci. USA 93,12098-12103. Notredame,C. 2002.Recent progresses in multiple sequence alignment: A survey. Pharmacogenomics 3, 131–144. Notredame,C., and Higgins, D.G. 1996.SAGA: Sequencealignment by genetic algorithm. Nucl.Acids Res. 24, 1515–1524. Notredame,C., Higgins,D.G., and Heringa, J. 2000.T -Coffee:A novelmethod for fast and accurate multiple sequence alignment. J.Mol. Biol. 302,205– 217. Notredame,C., Holm,L., and Higgins, D.G. 1998.COFFEE: Anobjective function for multiple sequence alignments. Bioinformatics 14,407– 422. Pevzner,P .A.,T ang,H., and Waterman, M.S. 2001.An Eulerian path approach to DNA fragmentassembly. Proc. Natl. Acad.Sci. USA 98,9748– 9753. Sankoff,D. 1975.Minimal mutation trees of sequences. SIAMJ. Appl. Math. 28, 35–42. Sankoff,D., Cedergren, R.J., and Lapalme, G. 1976.Frequency of insertion– deletion, transversion, and transition in theevolution of 5S ribosomalRNA. J.Mol.Evol. 7, 133–149. Sankoff,D., and Kruskal,J. 1983. Time Warps,String Edits, and Macromolecules:The Theory and Practice of Sequence Comparison,Addison-Wesley,Reading, MA. Smith,R.F .,andSmith, T .F.1990.Automatic generation of primary sequence patterns from sets of related protein sequences. Proc.Natl. Acad. Sci. USA 87,118– 122. Smith,R.F .,andSmith, T .F.1992.Pattern-induced multisequence alignment (PIMA) algorithm employing secondary structure-dependentgappenalties for use in comparative protein modeling. ProteinEng. 5, 35–41. Taylor,W .R.1988. A exiblemethod to align a largenumber of sequences. J.Mol. Evol. 28,161– 169. Thompson,J.D., Higgins, D.G., and Gibson, T .J.1994.CLUST ALW:Improving the sensitivity of progressivemultiple sequencealignment through sequence weighting, position-speciŽ c gappenalties and weight matrix choice. Nucl. Acids Res. 22,4673– 4680. Thompson,J.D., Gibson, T .J.,Plewniak, F .,Jeanmougin,F .andHiggins, D.G. 1997.The ClustalX windows interface: Flexiblestrategies for multiple sequence alignment aided by quality analysis tools. Nucl.Acids Res. 24,4876– 4882. Thompson,J.D., Plewniak, F .,andPoch, O. 1999.A comprehensivecomparison of multiple sequence alignment programs. Nucl.Acids Res. 27,2682– 2690. Vingron,M., andArgos, P .1991.Motif recognition and alignment for many sequences by comparison of dotmatrices. J.Mol.Biol. 218, 33–43. Vingron,M., andW aterman,M.S. 1996.Alignment networks and electrical networks. Disc.Appl. Math. 71,297– 309. Waterman,M.S. 1995. Introductionto Computational Biology ,Chapmanand Hall, London. Waterman,M.S., and Jones, R. 1990.Consensus methods for DNA andprotein sequence alignment. MethodsEnzymol. 183,221– 237. Waterman,M.S., and Perlwitz, M.D. 1984. Line geometries for sequence comparisons. Bull.Math. Biol. 46,567– 577. Waterman,M.S., Smith, T .F.,andBeyer, W .A.1976.Some biological sequence metrics. Adv. Math. 20,367– 387.

Addresscorrespondence to: Yu Zhang Departmentof Mathematics Universityof Southern California 1042W est36th Place (DRB 289) LosAngeles, CA 90089-1113

E-mail: [email protected]