An Eulerian Path Approach to Global Multiple Alignment for DNA
Total Page:16
File Type:pdf, Size:1020Kb
JOURNAL OFCOMPUTATIONAL BIOLOGY Volume10, Number 6,2003 ©MaryAnn Liebert,Inc. Pp. 803–819 AnEulerian Path Approach to Global Multiple Alignment for DNASequences YU ZHANG1 andMICHAEL S.WATERMAN 2 ABSTRACT With the rapid increase in the datasetof genomesequences, the multiple sequence alignment problemis increasingly importantand frequently involves the alignmentof alargenumber ofsequences. Manyheuristic algorithmshave been proposedto improve the speed ofcom- putationand the qualityof alignment. W eintroduce anovelapproach that is fundamentally different fromall currently availablemethods. Our motivationcomes from the Eulerian methodfor fragment assembly in DNAsequencing thattransforms all DNAfragments into ade Bruijn graphand then reduces sequence assembly toa Eulerian pathproblem. The paperfocuses onglobal multiple alignmentof DNA sequences, whereentire sequences are aligned into onecon guration. Our mainresult is analgorithm with almostlinear compu- tationalspeed with respect tothe totalsize (numberof letters) ofsequences tobe aligned. Fivehundred simulated sequences (averaging500 bases per sequence andas lowas 70% pairwise identity) havebeen aligned within three minutes onapersonal computer,andthe quality ofalignment is satisfactory.As aresult, accurateand simultaneous alignmentof thousandsof long sequences within areasonableamount of time becomespossible. Data froman Arabidopsis sequencing projectis used todemonstrate the performance. Key words: multiple sequencealignment, de Bruijn graph,Eulerian path. 1.INTRODUCTION ultiplesequence alignment is routinely used to ndconserved regions in molecular se- Mquences.When a setof relatedsequences is aligned by multiple sequence alignment algorithms, the hiddencommonalities among sequences are revealed. Thus they provide a criticalcomputational tool to retrievehidden information among huge and exponentially growing genetic data. Amathematicallyoptimal solution of the multiple sequence alignment problem is achieved by using a dynamicprogramming algorithm (Sankoff, 1975; W aterman et al.,1976).Because the time and memory costof thedynamic programming algorithm is exponentialto thenumber of sequences,i.e., 2.LN / where L sequencelength and N numberof sequences, it is impossible to put this method into practical D D 1Departmentof Mathematics,University of SouthernCalifornia, 1042 W .36thPlace (DRB 289),Los Angeles, CA 90089-1113. 2Departmentof Biological Sciences, University of Southern California, Los Angeles, CA 90089-1340. 803 804 ZHANG AND WATERMAN useexcept on a smallset of sequences (see Kececioglu[1993] for extensionsof this approach). Many heuristicalgorithms achieve reasonably good solutions with limited time and memory usage, and some of themhave already been popularly used in modern molecular biology research. Mostcurrently available algorithms are in one of two classes. The rst classis those algorithms using theprogressive alignment strategy (W atermanand Perlwitz, 1984; Feng and Doolittle, 1987). A multiple alignmentis gradually built up by aligning the closest pair of sequences rst andthen aligning the next closestpair of sequences, or onesequence with a setof alignedsequences or twosets of alignedsequences. Thisprocedure is repeated until all given sequences are aligned together .Thekey idea is that the pair of sequenceswith minimum distance is mostlikely to having been obtained from themost recent evolutionary divergenceand that the pairwise alignment of these two speci c sequencesprovides the most “ reliable” informationthat can be extracted. Many programs based on this method exist. Some of themconstruct an alignmentthroughout the entire length of sequences,such as MULTAL(Taylor,1988), AMUL T(Bartonand Sternberg,1987a, 1987b), MSA (Lipman et al.,1989),CLUST ALW (Higginsand Sharp, 1989; Thompson et al.,1994).Other programs try rst toidentify an ordered series of highly conserved regions, then proceedto alignthe intervening regions, such as GENALIGN (Martinez,1988) and ASSEMBLE (Vingron andArgos, 1991). Among these progressive alignment methods, CLUST ALW,probablythe best-known andmost popular program used currently, builds a guidetree from thepairwise alignment scores and mergessubsets of sequences according to the tree. A recentalgorithm T -Coffee(Notredame et al., 2000) issimilar to CLUST ALWbutuses a consistencymeasure that may reduce the potential errors causedby progressivealignment. Theother class of alignmentalgorithms uses iterative re nement strategies to improvean initialalignment (Sankoff et al.,1976;Sankoff and Kruskal, 1983). The basic idea for iterativemethods is to re ne the initialalignment iteratively by local optimization. In HMMT (Eddy, 1995), e.g., the model parameters arereestimated at each iteration. Iteration continues until convergence or reaching the maximal user- dened number of iterations. Currently available programs that apply iterative strategies include DIALIGN (Morgenstein et al.,1996),which uses a localalignment approach to construct multiple alignments based onsegment– segment comparisons, where the segments are incorporated into a multiplealignment by an iterativeprocess. The PRRP program (Gotoh, 1996) optimizes an alignment by iteratively dividing the sequencesinto two groups and then realigning them using a globalgroup-to-group alignment algorithm. SAGA (Notredameand Higgins, 1996) uses a geneticalgorithm to select from anevolving population thealignment which optimizes an objective function (OF). AnOF namedCOFFEE (Notredameet al., 1998)is used in SAGA thatmeasures the consistency between the multiple alignment and a libraryof CLUSTALWpairwisealignments. HMMT (Eddy, 1995) and SAM (Hugheyand Krogh, 1996) apply a stochasticiterative strategy that maximizes the probability based on a hiddenMarkov model (HMM), but theyare often used to re ne prealigned sequences (Notredame, 2002). Thenumerous alignment programs mentioned above have their own advantages and drawbacks. However, thereare some common issues for allof thecurrently available alignment algorithms: 1) robustness under certainconditions, such as gap-richregions, repeat-rich regions, etc.; 2) local optima problems, especially for iterativemethods; 3) timeef ciency and memory usage. The time cost for allcurrent algorithms is at bestproportional to the square of the number of sequences to be aligned. Althoughmany heuristic algorithms have been applied to alleviatethe huge time and memory expenses, thosecosts remain a barrierfor practicalapplication when confronted with thousands of inputsequences, or millionsof letters in each sequence, which are not unusual numbers even in bacterial genomes. Certainly, handlinghundreds of sequences with lengths in the thousands is a worthygoal. 2.MOTIV ATION Thispaper presents a novelapproach to the global multiple DNA sequencealignment problem using Eulerianpaths. W ecallthe method “ EulerAlign.”Themost signi cant advantages of EulerAlign are its lineartime and memory cost with respect to the total size of sequences to be aligned and improved alignmentaccuracy. Our motivationcomes from thealgorithm for fragmentassembly in DNA sequencingusing the Eulerian superpathapproach (EULER) (Iduryand W aterman,1995; Pevzner et al.,2001).In that algorithm, a ANEULERIAN PATHAPPROACH 805 FIG. 1. Anextreme version of fragment assembly is multiple sequence alignment. fragmentassembly problem is rst reducedto aneasy-to-solve Eulerian path problem in thede Bruijngraph, andan Eulerian path is found that corresponds to the consensus sequence of the genome. A signicant contributionof theEulerian approach is that it discards the traditional “ overlap-layout-consensus”outline. Inanother words, it obtainsa consensussequence before knowing an alignment and without doing pairwise alignments. Aglobalmultiple alignment problem is in fact an extreme version of the fragment assembly problem. As shownin Fig.1, ifalmost all fragments come from thesame region of agenome,then EULER should outputa consensussequence of that region. For multiplesequence alignment, if thisconsensus sequence isthe closest one to all given sequences, one would expect to achieve an accurate alignment using a consensusscoring scheme. Notethat the idea is similar to the star method: given N sequencesto be aligned, the star method rst computesthe alignments of all sequence pairs and picks one sequence among N sequencesas the consensusthat is closestto all other sequences. EulerAlign attempts to ndsucha consensusthat consists offragmentsof the N sequencessuch that the conserved regions are ampli ed and noise is suppressed. To understandthis, assume all input sequences are derived from acommonancestral sequence; then EulerAlign isused to ndthis ancestral sequence. By comparing each sequence with the ancestral, it will be able to distinguishbetween conserved letters and mutations. Consequently, the global multiple alignment will be unambiguouslyconstructed. Acrucialrequirement of EULER is the almost “ error-free”data that is obtained by an error-correction procedure.In addition, EULER doesn’ t evaluatethe quality of its consensus path until the nalquality assessmentstage. For themultiple alignment problem, however, people are more interested in aligning distantlyrelated sequences, and for EulerAlign,the consensus sequence obtained must be “accurate.”One optionis to doerror-correctionaggressively so all“ errors”(differences between sequences) are eliminated, butthis is not likely to succeed when aligning distantly related sequences. Instead, EulerAlign applies a somewhatdifferent procedure than does