SampleAlignD:AHighPerformanceMultipleSequenceAlignmentSystemusingPhylogenetic SamplingandDomainDecomposition

FahadSaeedandAshfaqKhokhar

Email:{fsaeed2,ashfaq}@uic.edu DepartmentofElectricalandComputerEngineering UniversityOfIllinoisatChicago Chicago,IL60607 Abstract Multiple (MSA) is one of the most computationally intensive tasks in . Existing best known solutions for multiple sequence alignment take several hours (in some cases days) of computation time to align, for example, 2000 homologous sequences of average length 300. Inspired by the Sample Sort approach in parallel processing, in this paper we propose a highly scalable multiprocessor solution for the MSA problem in phylogenetically diverse sequences. Our method employs an intelligent scheme to partition the set of sequences into smaller subsets using k-mer count based similarity index, referred to as k-mer rank. Each subset is then independently aligned in parallel using any sequential approach. Further fine tuning of the local alignments is achieved using constraints derived from a global ancestor of the entire set. The proposed Sample-Align-D has been implemented on a cluster of workstations using MPI message passing library. The accuracy of the proposed solution has been tested on standard benchmarks such as PREFAB. The accuracy of the alignment produced by our methods is comparable to that of well known sequential MSA techniques. We were able to align 2000 randomlyselectedsequencesfromthe Methanosarcina acetivorans in less than 10 minutes using Sample-Align-D on a 16 node cluster, compared to over 23 hours on sequential MUSCLE system running on a single cluster node.

1. Introduction MultipleSequenceAlignment(MSA)isafundamentalproblemofgreatsignificanceincomputational biology as it provides vital information related to the evolutionary relationships, identify conserved motifs, and improves secondary and tertiary structure prediction for RNA and . In theory, multiplesequencealignmentcanbeachievedusingpairwisealignment,eachpairgettingalignmentscore andthenmaximizingthesumofallthepairwisealignmentscores.Optimizingthisscore,however,isNP complete [1] and dynamic programming based solutions have complexity of O( LN) , where N is the numberofsequencesand Listhelengthofeachsequence,thusmakingsuchsolutionsimpracticalfor largenumberofsequences. Theseaccurateoptimizationmethodsarealsoveryexpensiveintermsofmemoryandtime.Thisis why most multiple sequence alignments techniques rely on , most popular being CLUSTALW [27], TCoffee [28], MUSCLE [3, 12] and ProbCons [29]. These are usually complexcombinationofadhocproceduresmixedwithsomeelementsofdynamicprogramming,thusthe resulting methods do notscale well.These methods yield extremely poor performance for very large numberofsequences.Forexample,CLUSTALW[27]isestimatedtotake1yeartoalign5000sequences ofaveragelengthof350[3].MUSCLEisclaimedtobethefastestandsomewhatmostaccuratemultiple alignmenttooltilltodate.Itclaimstoalign5000syntheticsequencesofaveragelength350in 7minutes on a contemporary desktop computer [3]. We also performed experiments with real data sets of Methanosarcina acetivorans genome sequence, having 5 million base pairs and is by far the largest known archeal genome [31]. The MUSCLE system takes about 1400 minutes (~23 hrs) to align randomlyselected2000sequences,withaveragelengthof316.OurprojectionisthatMUSCLEsystem willtakemorethan30daystoalignthewholegenome. The computation demands of these heuristics make the design of parallel approaches to the MSA problemhighlydesirable.Therehavebeennumerousattemptstoparallelizeexistingsequentialmethods. CLUSTALW [27] is by far the most often parallelized algorithm [4]. James et . al. in [5] parallelized CLUSTALWforPCclustersanddistributed/sharedmemoryparallelmachines.HT[6]isparallel solutionforheterogeneousMultipleSequenceAlignmentandMultiClustal[6]isaparallelversionofan optimizedCLUSTALW.Inthesesolutions,thefirsttwostages,i.e.pairwisealignmentandguidetree, areparallelized,andthethirdstage,finalalignment,ismostlysequential,thuslimitingtheamountofhe achievablespeedup[6].DifferentmodulesoftheMUSCLEsystemhavealsobeenparallelized[7].Other parallelizationeffortsincludeparallelmultiplesequencealignmentwithphylogenysearchbysimulated annealingbyZolaetal[8],MultithreadingMultipleSequenceAlignmentbyChaichoompuetal[9]and Schmollinger et al’s parallel version of DIALIGN [10]. Although there seems to be a considerable amountofefforttoimprovetherunningtimesforlargenumberofsequencesusingparallelcomputing,it mustbenotedthatalmostalltheexistingsolutionshavebeenaimedatparallelizingdifferentmodulesof aknownsequentialsystem. Afewattemptshavealsobeenmadetocuteachsequencesintopiecesandcomputethepiecewise alignment over all the sequences to achieve multiple sequence alignment. In [22], each sequence is ‘broken’inhalf,andhalvesareassignedtodifferentprocessors.TheSmithWaterman[21]algorithmis appliedtothesedividedsequences.Thesequencesarealignedusingdynamicprogrammingalgorithms, andthencombinedusingCombineandExtendtechniques. The Combine and Extend methods follow certainmodelsdefinedtoachievealignmentofthecombinationofsequences.Thesemethodspaylittleor noattentiontothequalityoftheresultsobtained.Theendresultshaveconsiderablelossofsensitivity. Theconstraintsinthesemethodsaresolelydefinedbythemodelsused,thuslimitingthescopeofthe methodsforwidevarietyofsequences. Inourpreviouswork[34,35],wehavedevelopedsamplingbasedsequentialandparallelsolutionsfor the alignment of homologous sequences. The sampling techniques used in [34] were based on the assumptionthattheunderlyingsequencesethasphylogeneticallyhighercorrelation.Inthispaperwe proposeasolutionforaligningphylogenetically diverse setofsequences,thereforereferredtoasSample AlignD.Theproposedapproachisbasedondomaindecompositionwheredatadomainisdistributed amongprocessorsandlocalalignmentsareperformedineachprocessor.Attheend,thesummaryresults ofthelocalalignmentsaresharedamongallprocessor,andfinaladjustmentsinthealignmentsaremade basedonthisglobalinformation. OurmethoddrawsitsmotivationfromtheSampleSort[13]approachthathasbeenintroducedtosort largesequenceofnumbersonmultiprocessorordistributedplatforms.ThesortingandMSAproblems shareacommoncharacteristic,i.e.,anycorrectsolutionrequirescomparisonofeachpairofdataitems.In SampleSort,asmallsample(<< N)representingtheentiredatasetischosenoverdistributedpartitions usingsomesamplingtechniquesuchasRegularSampling.Thesampleisthenusedtodefine pbuckets, where pisthenumberofprocessorsinthesystem.Thebucketboundariesaremadeknowntoallthe processors.Eachprocessorthenplacesitslocaldataitemsintocorrespondingbuckets.Thebucketsare thenindividuallysortedtoachieveanoverallsortedsequence.Weuseasimilarsamplingapproachto redistributesequencesoveralltheprocessorsbasedonkmerranksuchthatsequenceswithsimilarkmer rank values are available on a single processor. The sequences with similar kmer ranks are aligned sequentiallyatdifferentprocessors.Thefinalalignmentisachievedbycomputingaglobalancestorofthe underlyinghomologoussetofsequencesandprofilealigningeachsequencewiththisancestor. Therestofthepaperisorganizedasfollows:Section2givesthedescriptionofourmethodknownas SampleAlignD,theassumptions,andthecomplexityanalysisofthesystem.Section3givesthedetails ofthecommunicationcostandloadbalancingaspectsofthesystem.Section4describestheperformance evaluation in terms of execution time, scalability, and quality of the alignments obtained. Section 5 discussestheconclusionsandthefuturework. 2.SampleAlignD Theobjectiveofourworkistodevelopahighlyscalabledistributedmultiplesequencealignmentmethod basedonwellknownsequentialtechniquessuchthatmultiplesubsetscouldbealignedinparallelwhile stillachievingglobalalignmentwithrespectstotheentireset.TheproposedSampleAlignDapproach usesanideaderivedfromtheSampleSorttechnique[13]wellknownintheareaofparallelcomputation to guide the distribution of sequences among processors using regular sampling. We use kmer rank definedbelowtoachievethelocalizationofsimilarsequencesonasingleprocessor. kmerRank: A kmer is a contiguous subsequence of length k and related sequences tend to have more kmer in commonthanmaybeexpectedbychance[3].Thekmerdistancebetweenanytwosequences xiand xjis definedasfollows:

ri, j = ∑τ min[nxi (τ ),nx j (τ )]/[min( xi , x j − k +1 τ

Hereτisa kmeroflength k, nx i(τ)and nx j(τ)arethenumberoftimesτoccursin xiand xjrespectively, and| xi|and| xj|arethesequencelengths.Wedefineaveragekmerdistanceofsequencex itoalltheother sequencesasfollows: 1 N Di = ∑ ri, j N j=1

Finally, the kmer rank of asequence xiwithrespecttoallthesequenceinthedatasetis defined as follows[15]: Ri=log(0.1+ Di) Thekmerdistancecanbeusedtorapidlyconstructphylogenetictrees.For Nunalignedsequencesof length L,thekmerrankgivesanapproximateestimateofthefractionalidentitythathasalsobeenusedin CLUSTALW.Ourmotivationforusingthistypeofsimilarityindexingistoimproveprocessingspeed without the need to align the sequences globally. Edgar RC [15] has shown that kmer similarities correlatewellwithfractionalidentity.Inthefollowingwefirstgiveaveryshortandintuitivedescription oftheproposedSampleAlignDapproachalongwiththealgorithm. ThemainideaintheSampleAlignDapproachistocollectarelativelysmallsampleofsequences(<< N)thatisrepresentativeoftheentiredatasetofNsequences.Thekmerrankiscomputedforallthe sequencesinthesampleandthenthisrankisusedtoredistributeandlocallyalignthedatasetsineach processorinparallelusingasequentialMSAalgorithm.Furtherfinetuningofthealignmentisperformed usingaglobalancestortemplateandthusachievingglobalalignment.Theglobalancestoriscomputed usinglocalancestorsthatareavailableattheendofthelocalalignmentphasewithineachprocessor. Basedonthisapproach,thecomplexityofthealignmentisreducedtoaligning psetsofsequences,each of size N/p with some, additional cost incurred on communication and fine tuning. A more detailed algorithmisoutlinedbelow. SampleAlignD(SequencesN) pprocessorarebeingusedforcomputation Input←Nsequencesofaminoacidsx1,x2…xN: Output←Gapsareinsertedineachofthex’ssothat • Allsequenceshavethesamelength • Scoreoftheglobalmapismaximum  Assume N/p sequencesoneachofthe pprocessors  Locallycomputekmerrankofallthesequencesineachprocessor  Sortthesequenceslocallyineachprocessorbasedonkmerrank  Chooseasamplesetof ksequencesineachprocessor,where k << N/p  Sendthe ksamplesfromeachprocessortoalltheprocessors.  Computethekmerrankofeachsequenceagainstthe k*p samples.  Sortthesequenceslocallyineachprocessorbasedonthenewkmerrank  Usingregularsampling,choosep1sequencesfromeachprocessorandsendonlytheirrankstoa rootprocessor.  Sortallthe p*(p-1) ranksattherootprocessorsanddividetherangeofranksinto pbuckets.  Sendthebucketboundariestoalltheprocessors.  Redistributedsequencesamongprocessorssuchthatsequenceswithkmerrankintherangeof bucket iareaccumulatedatprocessor i,where 0 > i < p+1 .  Alignsequencesineachprocessorusinganysequentialmultiplealignmentsystem  BroadcasttheLocalAncestortotherootprocessor  Determine‘Global’AncestorGAattherootprocessorbyaligning‘local’ancestorsreceivedfrom alltheprocessors  BroadcastGAtoalltheprocessors  Realign each of the sequences in p processors based on ancestor GA using profileprofile alignment.i.e.Eachoftheprofilesofalignedsequencesaretweakedusingtheancestorprofile, withconstraints.  Glueallthealignedsequencesattherootprocessor.  END

2.1 AssumptionsandLimitations Westatecertainassumptionsandlimitationsthataretypicalofparallelsystems/alignmenttoolsandare notoverlyrestrictiveorextensive. • Currently,themethodisonlybeingtestedforhomologoussequences.Themethodisalsoenvisionedto workbestwhenthesequencesarenothighlydivergent. • Itisassumedthatthesequencessimilarityisuniformlydistributedandredistributionstepintheabove algorithmprovidesstatisticalguaranteeofuniformloadoveralltheprocessors. • The method would work best when the number of sequences is large, so that the interprocessor communicationismuchlessthanthetimerequiredtoalignthesetofsequencesonasingleprocessor.

2.2 TheSampleAlignDAlgorithm:Details Itisassumedthattheinputconsistsof Nsequencesofaminoacids x1, x 2... x N.Theoutputofthealgorithm isagainasetof Nsequencessuchthatthegapsareinsertedinawaythatthesequenceshaveequallength andscoreoftheglobalmapismaximized.Thescoreiscalculatedasthesumofpairs’scores.The N sequences are divided equally among p processors, i.e. each processor is assigned N/p sequences. A similarityindexbasedonkmerrankofeachsequenceiscomputedlocallyoneachprocessor.

2.3.1. Globalised K-mer Rank Inourearlierpaper[34],weassumedphylogeneticallyhighercorrelationamongsequences.Therefore,k merrankcomputedineachprocessorlocally,independent of thesequencesattheother processorsis similartotherankthatmighthavebeendeterminedgloballyconsideringallthesequences.Weshowedin [34]thatthisistruewhenthesequencesarenothighlydivergent. Ifthesequencesarenothighlydivergent,kmerrankcomputedusingthisdistributedapproachmaybe verydifferentcomparedtothecentralizedcasewhererankofeachsequenceiscomputedbyconsidering allthe Nsequences. Inordertoaddressthisdiversity,wecollect ksamplesequencesfromeachprocessorsuchthatthese k samplesrepresentthecorrespondingsetof N/p sequences,yieldingatotalof k*p samples.Collectively,it is safe to assume that these k*p samples represent the entire set of N sequences. We use these k*p sequencestobuiltaphylogenetictree,andforeachsequencecomputeitskmerrankusingthistree.Thus therankcomputationforeachsequenceisagainstaglobalsample. Subsequently,redistributionbasedonthissamplingtechniquealsoensuresthatsequencesaccumulated ineachprocessorare‘similar’toeachother.InFig.1wecomparetherankcomputedusingsamplesonly (referredtoasglobalizedrank)withtherankcomputedusingallthesequences(referredtoascentralized rank).

Fig.1. Distributionofkmerranksfor500sequenceswhencomputedonacentralsystemand globalisedkmersystems. Thestatisticsofthetwoapproachesfor5000sequencesarepresentedinTable1.Ascanbeseenthat thestandarddeviationforthetwosetsofranksis0.58andthattheaverageoftheglobalisedapproachis higherthanthatofthecentralizedapproach.Thisisduetothefactthatintheglobalizedapproacheach sequence is compared against a small set of sequences, where as in the centralized approach each sequencesiscomparedagainstallthe Nsequences,yieldinglargervariationsinthekmerrankandmaking theaveragesmaller. (Maximum,Minimum)Central (1.44827 ,0.0 ) AverageCentralized 0.722962 (Maximum,Minimum)Globalized (1.46207 ,0.0 ) AverageGlobalized 1.11302 Variancew.r.t.Centralized 0.33190 StandardDev.w.r.tCentralized 0.576377

Table1.StatisticalcomparisonofthekmerrankcomputedonaGloablizedsystemvsCentralsystem.

2.3.2 Redistribution Based on k-mer Rank Each processor sorts its w=N/p sequences based onkmer ranks using a sequentialsorting algorithm. Fromeachofthe plocallysortedlists, k = (p-1) evenlyspacedsamplesarechosen.Thekmerranksof these p-1samples(pivots)dividethelocalsetofsequencesinto porderedsubsets.Thekmerranksof these p-1samplesfromalltheprocessorsaregatheredattherootprocessoryieldingasetYofsize p(p-1). ThisregularsampledsetYissortedtocomputetheorderedlistY 1,Y 2,Y 3…Y p(p1) determiningtherange of kmer ranks over allthe processors.Then Y p/2 , Y p+p/2 ... Y (p2)p+p/2 are chosen as pivots ( p in total) dividingthekmerrankrangeinto pbuckets.Thesepivotsarethenbroadcasttoalltheprocessors.Each processorsendsthesequenceshavingkmerrankintherangeofbucket itoprocessor i.Fortheboundon thesizeofthedatasetineachprocessorafterredistribution,werefertotheanalysisinSection3.

2.3.3 The Alignment NextasequentialMSAprogramisrunoneachprocessor.Sinceourultimategoalistohaveasequence alignmentfor NsequencesandnotonsomesubsetofNsequences.Therefore,awayhastobedefined that would concatenate these ‘chunks’ of aligned sequences so that a ‘global’ alignment of multiple sequencescanbeobtained.In[12]ithasbeenobservedthatmultiplesequencealignmentforhomologous sequencescanbeobtainedbyaligningeachsequencetotherootprofile.Thisapproachissimilartothe oneusedinthePSIBLAST[19],whereaprofileisusedtoalignanyquerysequencewiththesequences thathavegeneratedtheprofile.Asimilarideaforancestorconstrainedmultiplesequencealignmenthas beenproposed,althoughwithoutdomaindecomposition,forprogressivealignment[33]. We use a similar concept along with domain decomposition of the sequences. We extract the local ancestor from each processor after locally aligning each subset in parallel.Alloftheselocalancestors are collected at the root processor and are aligned using a sequential multiple sequence alignment algorithm. The ancestor of all the local ancestors, referred to as the ‘global’ ancestor, is then broadcast back to all the processors. The ‘global’ ancestor is then used to perform a profileprofile alignment usingthemethodin[12],i.e.eachof the locally aligned sequences (referred to as profile) in the Fig.2. ExampleofAncestorprofilebeingusedtotweaklocally processorisalignedwiththeglobal alignedprofilesindifferentclusternodes. ancestor profile. The profileprofile alignment with the template of the ancestorisperformed,togetabetterSPscoreandhenceamultiplesequencealignment.Thekindoffine tuningthatmaybeexpectedfromtheancestorcanbedepictedinFig.2. AsshowninFig.2,therearetwosetsofsequencesthatarealignedindependentofeachother.To perform a one global alignment of these multiple sequences, these subsets of independently aligned sequencesaretweaked,usingtheglobalancestorasatemplate.Afterthisstep,thetweakedsequencesare just‘joined’togetherandSPscoreisobtainedbyjustconcatenatingthesequences.

3.AnalysisofComputationandCommunicationCosts: ForanalysispurposesweassumethatthesequentialMSAalgorithmbeingusedistheMUSCLEsystem [12].Therefore,thecomplexitycomputationsofsequentialcomponentsarebasedontheanalysisgivenin [12].Heretheassumptionisthatinitiallyeachprocessorhas w= N/p sequences,where N isthenumber ofsequencesandpisthenumberofprocessors.Theaveragelengthofasequenceis L.Inthefollowing wefirstoutlineallthecomputationcostsandstoragerequirement.Thisisfollowedbytheanalysisofthe communicationoverhead. ComputationCosts: STEP O(Time)O(Space) 1. kmerrankcomputationon( w=N/p )sequences w2 L w+L 2. Sortingof N/p sequencesbasedonkmerrank w logw log w 3. Sample k = p-1sequences w p 4. kmerrankcomputationof( k*p )sequences p4L p2+L inrootprocessor 5. Sortingof k*p sample kmerranks (k*p)log(k*p) log(k*p) 6. Kmerrankcomputationofeachof(w=N/p) w[(k*p+1) 2 L] w(k*p+L) Sequencesagainstk*psamples 7. MUSCLEexecutedon( w=N/p ) sequencesinparallel. w4+wL2 w2+L 2 8. Ancestorextractionfromeachofthe pprocessors p2 p2 +exporttotherootprocessor. 9. MUSCLEexecutedonlocal ancestors( pelements) (p) 4+(p)L 2 (p) 2 + L 2 10.Profilealignmentwithallcombined alignedsequencesoneachofthe wL2 w processor TOTALComputationCost(for w = N/p ) O((N/p ) 4+ (N/p ) L2) O((N/p ) 2+ L 2 ) CommunicationCosts: No matter how powerful a machine may be, interprocessor communication overhead is a factor that limits the performance of a distributed message passing parallel systems [24]. Fortunately, the communicationcostofoursystemismuchlessthanthecostofthealignments.Essentially,theproposed SampleAlignDalgorithmhastworoundsofcommunication..Inthefirstround,samplesarecollected attherootprocessorandpivotsbroadcastfromtherootprocessor.Inthesecondround,sequencesare redistributed to achieve better alignments and balanced load distribution. For the analysis of the communicationcostswehaveadoptedthecoarsegrained computation model assumed in [20, 16, 2]. However,weignorethemessagestartupcostsandassumeunittimetotransmiteachdatabyte.

WehaveassumedRegularSamplingstrategyduetofollowingreasons: 1. Thestrategyisindependentofthedistributionoforiginaldata,comparedtosomeotherstrategies suchasHuangandChow[25]. 2. It helps in partitioning of data into ordered subsets of approx. equal size. This presents an efficient strategy for load balancing as unequal number of sequences on different processors would mean unequal computation load, leading to poor performance. In the presence of data skew,regularsamplingguaranteesthatnoprocessorcomputesmorethan( 2N/p )sequences[26]. 3. Ithasbeenshownin[26]thatregularsamplingyieldsoptimalpartitioningresultsaslong N>p3, i.e.,thenumberofdataitems Nismuchlargerthanthenumberofprocessors p,whichwouldbea normalcaseintheMSAapplication.

First Communication Round: Assuming k = p-1,i.e.,eachprocessorchooses p-1samples,thecomplexityofthefirstphaseisO(p 2L)+ O(plogp)+O(k*plogp) ,whereO( p2L) isthetimetocollect p(p-1) samplesofsize Leachattheroot processors,O( plogp)isthetimerequiredtobroadcast p-1pivotstoalltheprocessorand(k* plogp) isthe timerequiredtobroadcastk*psequencestoalltheprocessors.

Second Communication Round: Inthesecondroundeachprocessorsendsthesequenceshavingkmerrankintherangeofbucket ito processor i.Eachprocessorpartitionitsblocksintopsubblocks,oneforeachprocessor,usingpivotsas bucketboundaries.Eachprocessorthensendsthesubblockstotheappropriateprocessor.Thesesub blockscanvaryfrom0to N/p sequencesdependingontheinitialdatadistribution.Takingtheaverage casewheretheelementsintheprocessoraredistributeduniformly,theneachsubblocksizeis N/p 2.Thus this step would require O (N/p) time assuming an alltoall personalized broadcast communication primitive[16,2].However,inthefollowingweshowthatbasedonregularsamplingnoprocessorwill receivemorethan 2N/p elementsintheworstcase.Thereforestilltheoverallcommunicationcostwillbe O( N/pL). Let’sdenotethepivotschoseninthefirstphasebythearray:y1,y2,y3…yp1.Considerprocessori=

1, where 1 ≤i ≤p,allthedatatobealignedbyprocessor1mustbe≤y1intermsofitkmerrank.Since 2 thereare p -p-p/2 sequencesinthesamplethathavekmerrank>y 1,correspondinglythereareatleast 2 2 (p -p-p/2 ) w/p sequencesintheentiredatasetwhoserankis>y1.Inotherwordsthereare N-(p -p-p/2) w/p= (p+p/2) w/p < 2w sequencesinthedatasetswhichhavekmerrank≤y1.Thesizeofdatatobe locallyalignedbyanyprocessoristhereforealwayslessthan2w .Duetopagelimitations,wereferto[26] for further details on the analysis of this bound. The collection of the p local ancestors at the root processorandthebroadcastoftheglobalancestorwillcostO( Llogp)communicationoverheadeach. Thereforethetotalcommunicationcostis: O (p2L) + O (plogp) + O( N/pL )+O( Llogp ) ThetotalasymptotictimecomplexityTofthealgorithmwouldbe: = O(N/p) 4+ O (N/p) L 2 + O(p2L)+ O(plogp + O(N/pL) + O(Llog p)+O(k* plogp) Computationcost Communicationcost = O(( N/p ) 4 + (N/p ) L2 + (p2L) + (N/pL ))

4.PerformanceEvaluation The performance evaluation of the SampleAlignD AlgorithmiscarriedoutonaBeowulfClusterconsisting of16Pentium IIIprocessors,eachrunningat550MHz, with2levelsofcache(L1:16KandL2:512K),and384 MBDRAMmemory.Asfortheinterconnectionnetwork, Fig.3. Distributionofkmerrankofthesequencesused thesystemusesIntelGigabitNIC’soneachclusternode. intheexperiments. Theoperatingsystemoneachnodeis RedHat 7.3 Observed timing for Algorithm for sequences of (Kernel level: 2.4.2028.7). For performance evaluation we N=5000,10000,20000 haveusedbothsyntheticandrealdatasets. 1400 Toinvestigatetheresourcerequirementsandtheexecution 1200 1000 5000 sequences timeondifferentsizeinputswehaveusedthesyntheticdata 800 10000 sequences 600 setgeneratedusingrosesequencegenerator[14].Threesetsof 20000 sequences 400 sequences(N=5000,10000,and20000)weregeneratedusing Time(Seconds) 200 the standard input parameters for the rose generator. The 0 averagesequencelengthwassettobe300andtherelatedness 0 5 10 15 20 Processors was set to be 800. This relatedness value assured that the sequencesthusgeneratedwereinfactnotveryclosetoeach Fig.4. Scalabilityoftheexecutiontimewithrespectto otherandmayresembletherealdatasetofsequences. thenumberofprocessorsbeingused. Furthermore,intheseexperimentsitwasmadesurethatthek mer rank distribution for the sequences is in general evenly distributed. A sample of kmer rank distributionforN=5000sequencesusedinourexperimentsisshowninFig.3. After the Scalibility Performance sequences were generated 50 according to the experimental 40 30 5000 Sequences setup, the files Speed-up 10000 Sequences were divided into 20 20000 Sequences equal parts and 10 ‘placed’ on the 0 clusternodes’hard 1 4 8 12 16 drives prior to the Number of Procs experiments. Then an instance of Fig.6. Executiontimeonrandomlyselected Fig.5. SuperlinearspeedupforSample SampleAlignD 2000 sequences from the Methanosarcina Align Dwithincreasingnumberof Algorithm was Acetivorans genome . initiated on each ofthenodesandthetimerequiredfortheactualalignment was noted.We were able to align 20000 sequencesinjustaround25seconds.Therearenoreportsofaligningthishugenumberofsequencesin the literature to the best of authors’ knowledge. The best that the authors found was for N=5000 sequences[12].Tcoffee[28]isreportedtonotabletohandlemorethan10 2sequences.MAFFT[23] scriptFFTNSIisreportedtoalign5000sequencesin10minutesandMUSCLEitselfwithoutrefinement in7minutes.Itisanticipatedthatwithrefinementincluded,MUSCLEisboundtotakethesameamount asFFTNSI.ItisalsoestimatedthatCLUSTALW[27]wouldtakeapproximately1yeartoalignthese manysequences[12].AsshowninFig.4,inthecase of Sample–Align, the execution time decreases sharplywiththeincreaseinthenumberofprocessors.WegotsuperlinearspeedupfortheSample AlignDandtheobservedspeedupcurvesareshowninFig.5.Thisisprimarilybecausethecomputation complexitydecreasesbyO( p4)withtheincreaseinnumberofprocessor.Itcanbeobserved,however, thatforthedatasetsofN=5000and10000,thespeedupcurvegoesupfor4,8and12processorsbut deteriorateswhenallthe16processorsareused.Theslowdownsindicatethatthegranularityofwork assigned to each processor decreases. With the increase of the dataset to 20000, we get much better speedupcurves. WehavealsoexperimentedwithrealproteinsequencesfromtheMethanosarcina acetivorans genome. Fig.6depictstheexecutiontimeofaligningrandomlyselected2000sequencesfromthe Methanosarcina acetivorans genome usingdifferentnumberofprocessors.. Notethatittookmorethan23hourstoalign thissetofsequencesusingthesequentialMUSCLEsystemonasingleclusternode,whereasittookonly 9.82minutestoalignthesamenumberofsequencesusingtheproposedSampleAlignDalgorithm.This isa142foldspeedupusing16processors!

4.1.QualityAssessment WehaveusedthePREFABbenchmarktoassessthequality of the alignments produced by Sample AlignD AlgorithmWe haveusedtheaccuracy measureQ[3],definedas,thenumberofcorrectly alignedresiduepairsdividedbythenumberofresiduepairs METHOD QScore inthereferencealignment.Theresultsfromthisbenchmark SampleAlignD 0.544 are presented belowinTable 2. Forthe SampleAlgorithm, MUSCLE 0.645 results correspond to execution on a 4 processor cluster MUSCLEp 0.634 system. TCoffee 0.615 Ascanbeseenfromthevaluesabovethatourmethod NWNSI 0.615 providesqualityofalignmentcomparabletotheother well FFTNSI 0.591 known methods. For PREFAB, SampleAlignD Algorithm CLUSTALW 0.563 yieldsqualityveryclosetothatofCLUSTALW. Table2.QScoresobtainedforeachmethod usingPREFAB. Itshouldbenotedthatwearegettingquality 1comparabletoCLUSTALWandexecutiontimebetter thanMUSCLEitself.Webelievethatqualityismuchhigherinourcasewhenthesequencesarelargein number. However, due to the absence of large size benchmark datasets we cannot report supporting resultsatthetimeofthispublication.InthecaseofPREFAB,itcontains1000setofapprox.2030 sequenceseach.Eachsetisalignedindependentlyoftheothersetstoaccessthequalityofthealignment programonmultiplesetofsequencesofvaryingdivergence.InthecaseoftheproposedSampleAlignD Algorithm, partitioning each set of 20 to 30 sequences even on a 4 processorsystemistoofinegraintoaccessthetruequalityofalignment. Moredetailedqualityanalysisresultswillbepresentedinthefullversion of this paper. A snap shot of the alignment produced by the Sample AlignDAlgorithmforthegenomesequencesisgiveninFig.7.

5 ConclusionsandFutureResearch Wehaveaddressedthelongstandingproblemofaligninglargenumber ofmultiplesequencesinareasonableamountoftime.Wehaveaddressed the problem using a completely distributed approach to the problem, usinghighscalabletechniquessimilartosamplesort.Thesequencesare distributedamongtheprocessorsaccordingtokmerrankandarealigned in a distributed manner independently of the other sequences. The independently aligned sequences are then aligned with the global ancestorasisdescribedinthepaper.Thesequencesarethenjoinedina root processor giving a meaningful alignment. Our results show super linearspeedupwithcomparablequalityofalignment. Currentlyweareworkingonaccessingthequalityofthemethodusing otherstandardbenchmarkssuchasBAliBASE,SMARTandSABmark. It must be noted however, that these benchmarks are not designed to accessthequalityofthealignmentsproducedinadistributedmannerand the size of these benchmarks may limit the accuracy of the quality accessed.Therefore,it would be desirableto develop benchmarks that maybeusedtoaccessthequalityofthesealignmentsformulatedusing distributedsystemsasours.Weareworkingonsequentialheuristicsto improve the quality of the alignment produced from the methods discussed above. These heuristics may then be further parallelized to incorporateinthedistributedapproachpresentedinthispaper. Anotherareathatisinterestingtoexploreisthekindofsequencesand Fig. 7. A Snap shot of the alignment families that may be aligned using the methods discussed, with producedbySampleAlignAlgorithmfor reasonableaccuracy.Itcanbeseenhoweverthattheremightalwaysbea the sequences in the Methanosarcina needtorefinethe‘global’multiplesequencealignmentforsomeofthe acetivorans genome most divergent familiesand sequences. An efficientmethod to dothat with small time complexity would be necessary for some families of sequences being aligned, thus makingthesystemworkingforawiderangeoffamiliesbutstillkeepingtheadvantageofhighscalability andperformance.

1Someofthesequencescoreswerediscardedintheautomaticqualityestimationprocess. References:

[1] L. Wang and T. Jiang, “On the Complexity of Multiple Sequence Alignment,” Journal of Computational Biology ,1(4):337–348,1994 [2] V. Kumar, A.Grama, A. Gupta, G. Karypis, Introduction to Parallel Computing, Design and AnalysisofAlgorithms,TheBenjamin/CummingsPublishingCompany,Inc.,2 nd edition2006. [3] R..Edgar,MUSCLE:MultipleSequenceAlignmentwithHighAccuracyandHighThroughput, Nucleic Acids Research ,Vol.32,No.5,2004. [4] Jeorslav Zola: Parallel Server for Multiple Sequence Alignment, Thèse de Doctorat, spécialité informatique,InstitutNationalPolytechniquedeGrenoble,December2005. [5] J. Cheetham et al, “Parallel CLUSTALW for PC Clusters,” Computational Science and Its Applications – ICCSA,Springerlink,2003. [6] D.Mikhailov,H.Cofer,andR.Gomperts,ParallelClustalW,HTClustal,andMULTICLUSTAL, www.sgi.com/solutions/sciences/chembio [7] X.Deng,E.Li,J.Shan,andW.Chen,“ParallelImplementationandPerformanceCharacterization ofMUSCLE, International Parallel and Distributed Processing Symposium (IPDPS) ,2006. [8] J.Zola,D.Trystram,A.Tchernykh,andC.Brizuela,“ParallelMultipleSequenceAlignmentwith Local Phylogeny Search by Simulated Annealing,” Sixth IEEE International Workshop on High Performance Computational Biology .March2007. [9] K. Chaichoompu, S. Kittitornkun, and S. Tongsima, “MTClustalW: Multithreading Multiple Sequence Alignment,” Sixth IEEE International Workshop on High Performance Computational Biology, March2007. [10] M.Schmollinger,K.Nieselt,M.Kaufmann,andB.Morgenstern,“DIALIGNP:Fastpairwiseand MultipleSequenceAlignmentusingParallelProcessors,”BMC ,2004. [11] I.M. Wallace et al, “Evaluation of Iterative Algorithms for Multiple Sequence Alignment,” Bioinformatics Oxford Journal ,Vol.21No.8,2005. [12] R.C.Edgar,“MUSCLE:AMultipleSequenceAlignmentMethodwithReducedTimeandSpace Complexity,”BMC Bioinformatics ,5:113,2004. [13] W.D. Frazer and A.C. McKellar, “Samplesort: A Sampling Approach to Minimal Storage Tree Sorting,”J. ACM ,17(3):496–507,1970. [14] J. Stoye, D. Evers, and F. Meyer, “Rose: Generating Sequence Families,” Bioinformatics , 14, 157±163,1998 [15] R.C. Edgar, “Local Recognition and Distance Measures in Linear time using CompressedAminoAcidAlphabets,”Nucleic Acids Res., 32(1):380385.2004. [16] S.Hambrusch,F.Hameed,andA.Khokhar,“CommunicationOperationsonCoarsegrainedMesh Architectures,”Parallel Computing ,Vol.21,pp.731751,1995 [17] R.C. Edgar, “Local Homology Recognition and Distance Measures in Linear time using CompressedAminoAcidAlphabets,”Nucleic Acids Res., 32(1):380385.2004. [18] S.F. Altschul, “Amino Acid Substitution Matrices from an Information Theoretic Perspective,” Journal of Molecular Biology ,219(3):555565,1991. [19] S.F. Altschul, T,L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman, “Gapped BLAST and PSIBLAST: A New Generation of Protein Search Programs,” Nucleic Acids Research, Vol.25,No.17,3389–3402,1997. [20] S. Hambrusch, and A. Khokhar, “C3: An Architecture Independent Model for Coarsegrained ParallelMachines,”Proceedings of IEEE Symposium on Parallel and Distributed Processing ,pp. 544551,October,1994 [21] T.F.SmithandM.S.Waterman,“IdentificationofCommonMolecularSubsequences,”Journal of Molecular Biology ,147(1):195197.1981 [22] F.Zhang,X.Z.Qiao,andZ.Y.Liu,“AParallelSmithWatermanAlgorithmBasedonDivideand Conquer,”Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP),2002. [23] K. Katoh,K.Misawa,K.Kuma,andT.Miyata,“MAFFT:A Novel Method for Rapid Multiple Sequence Alignment based on Fast Fourier Transform,” Nucleic Acids Res., 30(14):30593066, 2002. [24] Shahid Bokhari and David M. Nicol, “Balancing Contention and Synchronization on the Intel Paragon”,IEEEConcurrency,1997 [25] J.S.Huang,andY.C.Chow,“ParallelSortingandDataPartitioningbySampling,”7th international Computer Software and Applications Conference ,1983,627631 [26] H. Shi and J. Schaeffer, “Parallel Sorting by Regular Sampling,” Journal of Parallel and Distributed Computing ,vol.14,no.4,pp.361372,1992. [27] J.D. Thompson, D.G. Higgins, and T.J Gibson, CLUSTALW: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Positionsspecific Gap PenaltiesandWeightMatrixChoice,”Nucleic Acids Research ,22:46734680,1997. [28] C.Notredame,D.Higgins,andJ.Heringa,“TCoffee: A Novel Method for Multiple Sequence Alignments,”Journal of Molecular Biology ,Vol.302,pp205217,2000. [29] C.B. Do, M.S.P. Mahabhashyam, M. Brudno, and S. Batzoglou, “PROBCONS: Probabilistic ConsistencybasedMultipleSequenceAlignment,”Genome Research 15:330340,2005. [30] M. Brudno, A. Poliakov, A. Salamov, G.M. Cooper, A. Sidow, E.M. Rubin, V. Solovyev, S. Batzoglou,andI.Dubchak,“AutomatedWholeGenomeMultipleAlignmentofRat,Mouse,and ,” Genome Res .,14:685–692,2004. [31] J.E.Galaganet.al,“TheGenomeofM.AcetivoransRevealsExtensiveMetabolicandPhysiological Diversity,”Genome Res .,12:532542,Apr2002. [32] J.Felsenstein,“EvolutionaryTreesfromDNASequences:AMaximumLikelihoodApproach,”J. Mol. Evol. 17: 368–376,1981. [33] N. Bray and L. Pachter, “MAVID: Constrained Ancestral Alignment of Multiple Sequences”, GenomeResearch,14:693699,2004. [34] Fahad Saeed and Ashfaq Khokhar, “SampleAlign: A Highly Parallel Approach to Multiple SequenceAlignmentonDistributedComputingPlatforms”SubmittedtoRECOMB2008 [35]FahadSaeedandAshfaqKhokhar,“Sampler:Aphylogeneticsamplingbasedsystemforhigh performancemultiplesequencealignment”,submittedtoISBRA2008