<<

Fast Sequence Database Search Using Intra-Sequence Parallelized Smith-Waterman Algorithm

Chung-E Wang Department of Computer Science California State University, Sacramento Sacramento, CA 95819-6021 [email protected]

Abstract - In this paper, we describe a fully Gotoh [2]. Gotoh’s algorithm can find an optimal pipelined VLSI architecture for sequence database alignment of two biological sequences in O(mn) search using Smith-Waterman algorithm. The computational steps. It was a great improvement for architecture makes use of the principles of aligning two sequences. However, it’s not fast parallelism and pipelining to the greatest extent in enough for sequence database searches. A sequence order to take advantages of both intra-sequence and database search is to compare a query sequence with inter-sequence parallelization and to obtain high a database of sequences, and identify database speed and throughput. First, we describe a parallel sequences that resemble the query sequence above a Smith-Waterman algorithm for general SIMD certain threshold. computers. The parallel algorithm has an execution Due to substantial improvements in time of O(m+n), where m and n are lengths of the multiprocessing systems and the rise of multi-core two biological sequences to be aligned. Next, we processors, parallel processing became a trend of propose a VLSI implementation of the parallel accelerating Smith-Waterman’s algorithm and algorithm. Finally, we incorporate a pipeline sequence database searches. Many enhancements of architecture in the proposed VLSI circuit and result Smith-Waterman algorithm based on the idea of in a pipeline processor that can do sequence parallel processing have been presented [3-18]. database searches at the speed of O(m+n+L), where However, all these enhancements can only speed up L is the number of sequences in the database. Smith-Waterman algorithm by a constant factor. That is, all these enhancements still require an execution Keywords: , sequence database time of O(mn) to align two biological sequences. search, Smith-Waterman algorithm, parallel Besides parallel processing, heuristic methods are algorithm, VLSI circuit, pipelined architecture. other commonly used approaches for speeding up sequence database searches. A heuristic method is a 1. Introduction method which is able to produce an acceptable solution to a problem but for which there is no proof A sequence alignment is a way of matching two that it’s an optimal solution. Heuristic methods are biological sequences to identify regions of similarity intended to gain computational performance, that may be a consequence of functional, structural or potentially at the cost of accuracy or precision. evolutionary relationships between the two biological Popular alignment search tools such as FASTA [19], sequences. Sequence alignment is a fundamental BLAST [20] and BLAT [21] are in this category. operation of many applications such They did successfully gain some speed. However, the as genome assembly, sequence database search, sensitivity is compromised. For sequence database multiple sequence alignment, and short read searches, sensitivity refers to the ability to find all mapping. database sequences that resemble the query sequence The Smith-Waterman algorithm [1] is the most above a threshold. sensitive but slow algorithm for performing sequence Without sacrificing any sensitivity, in this paper, alignment. Here sensitivity refers to the ability to find we first describe a parallel Smith-Waterman the optimal alignment. Smith-Water algorithm algorithm for general SIMD (Single Instruction requires O(m2n) computational steps, where m and n stream - Multiple Data stream) computers. This are lengths of the two sequences to be aligned. parallel algorithm requires an execution time of Smith-Waterman algorithm was later improved by O(m+n) to align two biological sequences. Second, we propose a VLSI (Very-Large-Scale Integration) implementation of the parallel algorithm. Finally, we cells of the same diagonal and thus can be computed use the pipeline technique to overlap the execution simultaneously. Furthermore, cells of a diagonal only times of alignment checking of database sequences. depend on cells of the previous two diagonals. So, the The resulting pipeline has a throughput rate of O(1) basic idea of the parallel algorithm is to fill the execution time per database sequence. Consequently, matrix diagonal by diagonal starting from the top-left the time complexity of the proposed pipeline corner. Moreover cells of a diagonal are filled processor is O(m+n+L), where L is the number of simultaneously with multiple processors. 2 3 4 5 6 7 l al al l l l sequences in the database. a n n a a a gn g g gn gn gn ia ia ia ia ia ia D D D D D D 8 al gn E1,1 E1,2 E1,3 E1,4 E1,5 E1,6 ia 2. The Smith-Waterman Algorithm D H1,1 H1,2 H1,3 H1,4 H1,5 H1,6 F1,1 F1,2 F1,3 F1,4 F1,5 F1,6 9 al E E E E E E gn 2,1 2,2 2,3 2,4 2,5 2,6 ia D The Smith-Waterman algorithm is used to H2,1 H2,2 H2,3 H2,4 H2,5 H2,6 0 F2,1 F2,2 F2,3 F2,4 F2,5 F2,6 1 al compute the optimal local-alignment score. Let A = gn E3,1 E3,2 E3,3 E3,4 E3,5 E3,6 ia D a a ... a and B = b b ... b be the two sequences to H3,1 H3,2 H3,3 H3,4 H3,5 H3,6 1 2 m 1 2 n F3,1 F3,2 F3,3 F3,4 F3,5 F3,6 be aligned. A weight w(ai, bj) is defined for every E4,1 E4,2 E4,3 E4,4 E4,5 E4,6 H4,1 H4,2 H4,3 H4,4 H4,5 H4,6 pair of residues ai and bj. Usually w(ai, bj) <= 0 if ai ≠ F4,1 F4,2 F4,3 F4,4 F4,5 F4,6 bj, and w(ai, bj) > 0 if ai = bj. The penalties for starting a gap and continuing a gap are defined as ginit Fig.Fig. 1 1.. C oComputationalmputational dep edependenciesndencies in the and gext respectively. The optimal local alignment Smith-Waterman alignment matrix. score S can be computed by the following recursive relations: The pseudo code of the parallel algorithm is given in Figure 2. Note that in the pseudo code, we use the Ei,j = max { Ei,j-1-gext, Hi,j-1-ginit } (1) in-parallel statement to indicate things to be done by Fi,j = max { Fi-1,j-gext, Hi-1,j-ginit } (2) processors simultaneously. The in-parallel statement Hi,j = max {0, Ei,j, Fi,j, Hi-1,j-1+w(ai,bj) } (3) has the following syntax: S = max { Hi,j }; (4) In parallel, all processor i, lo<=i<=hi do { The values of E , F and H are 0 when i<1 and j<1. … i,j i,j i,j } Smith-Waterman algorithm is a dynamic programming algorithm. A dynamic programming Only processors with processor numbers between lo algorithm is an algorithm that stores the results of and hi are activated. Also note that in the pseudo certain calculations in a data structure, which are later code, ji is a variable for processor i only. In other used in subsequent calculations. Smith-Waterman words, different processors have different j’s. algorithm uses a m×n matrix, called alignment Furthermore, array maxH is used for processors to matrix, to store and compute H, E, and F values keep track of the largest Hi,j. Since there are m+n-1 column by column. diagonals, the loop repeats m+n-1 times and thus the parallel algorithm has an execution time of O(m+n). 3. An Intra-sequence Parallelized In parallel, all processor i, 1<= i <=m do Smith-Waterman Algorithm For E[i][0] = F[i][0] = H[i][0] = maxH[i] = 0; In parallel, all processor i, 1<= i <=n do SIMD Computers E[0][i] = F[0][i] = H[0][j] = 0; for (diag=2; diag<=m+n; ++diag) { if (diag<=m+1) first = diag-1; Intra-sequence parallelization means the else first = m; parallelization is within a single pair of sequences, in if (diag-n<=1) last = 1; contrast to inter-sequence parallelization where the else last = diag-n; In parallel, all processor i, first>= i >= last do { parallelization is carried out across multiple pairs of ji = diag – i; sequences. E[i][ji] = max {E[i][ji-1]-gext, H[i][ji-1]-ginit}; Figure 1 shows the computational dependencies of F[i][ji] = max {F[i-1][ji]-gext, H[i-1][ji]-ginit}; Smith-Waterman alignment matrix. Most of the H[i][ji] = max {0,E[i][ji], F[i][ji], H[i-1][ji-1]+w(ai,bji)}; if (H[i][ji]>maxH[i]) maxH[i] = H[i][ji]; existing parallelized Smith-Waterman algorithms in } the literature are originated from this dependency } structure. Since cells on a bottom-left to top-right S = max { maxH[1], maxH[2], …max H[m]}; diagonal have the same sum of indices, we number those diagonals with their sums of indices. Fig. 2. The pseudo code Noticeably cells of a diagonal don’t dependent on 4. A VLSI Implementation Additionally, in order to keep track of the largest hi,j value, register Hi,j of processing element PEi,j is First we design a small processing element connected to register maxHi as shown in Figures 3. according to the recursive relations (1), (2) and (3). Next, we map our algorithm into a VLSI circuit. Processing elements will be used to implement cells Figure 4 depicts our VLSI circuit at the register level of Smith-Waterman’s alignment matrix. Thus, each for m=4 and n=6. Processing elements and registers processing element will be numbered with the indices maxHi are connected together according to recursive of its corresponding cell in Smith-Waterman’s relations (1), (2), (3) and (4). As shown in Figure 4, alignment matrix. As follows, processing element processing elements are arranged into levels such that processing elements corresponding to cells of a PEi,j will be used to compute values Ei,j, Fi,j and Hi,j. As shown in Figure 3, processing element PE diagonal are placed on the same level and thus will i,j compute their values at the same time. For the clarity consists of 3 registers Ei,j, Fi,j and Hi,j and several simple combinational circuits such as adders and of the logic of the VLSI circuit, in Figure 4, levels are comparators (for finding maximum of two or three numbered with their corresponding diagonal numbers. values). Since Hi,j depends on Ei,j and Fi,j, processing element PEi,j has two computational states and thus Ein H Fin requires two clock signals to complete its gext ginit in ginit gext computation of values Ei,j, Fi,j and Hi,j. Since the - - - - longest data path in the processing element consists MAX MAX w(ai,bj) of an adder and a comparator, the clock period, i.e. Hi,j a A B b time between each clock signal, only needs to be set i j Ei,j s1 + Fi,j s1 to the time delay caused by an adder and a 0 MAX MAX comparator. s1 s2 Each processing element PEi,j has 5 input ports to maxH i Hi,j State maxHi s2 and 3 output ports. Input ports A and B are for register clock Eout Hout Fout inputting residues ai and bj from the two sequences to be aligned. Input ports Ein, Fin and Hin are for inputting corresponding values from cells on which Fig. 3. Processing element PEi,j and register PEi,j depends. Output ports Eout, Fout and Hout are for maxHi exporting values to cells depending on PEi,j .

Level 0 0 0

Ein Hin Fin A PE1,1 B 2

Eout Hout Fout 0 0 0 0 A e c Ein Hin Fin Ein Hin Fin en u A PE2,1 B A PE1,2 B eq 3 S Eout Hout Fout Eout Hout Fout S 0 0 0 0 eq u e Ein Hin Fin Ein Hin Fin Ein Hin Fin n ce A PE3,1 B A PE2,2 B A PE1,3 B B 4

Eout Hout Fout Eout Hout Fout Eout Hout Fout

0 0 0 0

Ein Hin Fin Ein Hin Fin Ein Hin Fin Ein Hin Fin A PE4,1 B A PE3,2 B A PE2,3 B A PE1,4 B 5

Eout Hout Fout Eout Hout Fout Eout Hout Fout Eout Hout Fout 0 0

Ein Hin Fin Ein Hin Fin Ein Hin Fin Ein Hin Fin A PE4,2 B A PE3,3 B A PE2,4 B A PE1,5 B 6 Eout Hout Fout Eout Hout Fout Eout Hout Fout Eout Hout Fout 0 0

Ein Hin Fin Ein Hin Fin Ein Hin Fin Ein Hin Fin A PE4,3 B A PE3,4 B A PE2,5 B A PE1,6 B 7

Eout Hout Fout Eout Hout Fout Eout Hout Fout Eout Hout Fout

Ein Hin Fin Ein Hin Fin Ein Hin Fin H1,j’s H2,j’s H3,j’s H4,j’s A PE4,4 B A PE3,5 B A PE2,6 B 8 Eout Hout Fout Eout Hout Fout Eout Hout Fout

MAX MAX MAX MAX Ein Hin Fin Ein Hin Fin A PE4,5 B A PE3,6 B 9

Eout Hout Fout Eout Hout Fout maxH1 maxH2 maxH3 maxH4

Ein Hin Fin A PE4,6 B 10

Eout Hout Fout MAX

Clock S

Fig. 4. The VLSI circuit, m=4, n=6. 5. A Pipeline Architecture for data, even though pipelining cannot speed up the process of a single datum. The space-time diagram in Sequence Database Searches Figure 6 reveals the advantages of the pipelined architecture in processing database sequences. The Since sequence database search is different from diagram shows the succession of the levels in the sequence alignment, first, we modify our processing pipeline with respect to time. From the diagram one element PEi,j as shown in Figure 5. Instead of keeping can observe how independent sequences are track of largest hi,j, the new processing element PEi,j processed concurrently in the pipeline. will generate a SELECT signal when its Hi,j value is Space above a threshold. Sequence 3 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7 Level 8 Level 9 Level 10 Sequence 2 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7 Level 8 Level 9 Level 10 Ein H Fin gext ginit in ginit gext Sequence 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7 Level 8 Level 9 Level 10 - - - - 1 2 3 4 5 6 7 8 9 10 11 Time MAX MAX w(ai,bj) Fig. 6. Pipeline space-time diagram ai A B bj Ei,j s1 Fi,j s1 + 0 Since there are m+n-1 levels in the VLSI circuit,

threshhold MAX it’s very natural to organize the entire architecture as s1 s2 a linear pipeline with m+n-1 stages. To do so, as Selectout > Hi,j s2 State shown in Figure 7, we add m+n-1 registers to the register clock Eout Hout Fout VLSI circuit to hold database sequences one for each stage, i.e. level. Additionally, each database sequence Fig. 5. Modified processing element PE . i,j register has a S flag which will be used to indicate

whether the database sequence is selected or not. As To speedup sequence database searches, a pipelined shown in Figure 7, processing elements’ select architecture is incorporated in the VLSI circuit. signals are connected to S flags of corresponding Pipelining is a natural concept for increasing the throughput of a system when processing a stream of

CLK1 Pipeline State 0 0 0 stage register CLK2 Database Sequences Ein Hin Fin A PE B select 1,1 1 S Eout Hout Fout

0 0 0 0

Ein Hin Fin Ein Hin Fin select A PE2,1 B A PE1,2 B 2 S h1,1 Eout Hout Fout Eout Hout Fout

Q u 0 0 0 0 e ry

se Ein Hin Fin Ein Hin Fin Ein Hin Fin q u A PE B A PE B A PE B e select 3,1 2,2 1,3 n 3 S h2,1 h1,2 c Eout Hout Fout Eout Hout Fout Eout Hout Fout e

0 0 0 0

Ein Hin Fin Ein Hin Fin Ein Hin Fin Ein Hin Fin A PE B A PE B A PE B A PE B select 4,1 3,2 2,3 1,4 4 S h3,1 h2,2 h1,3 Eout Hout Fout Eout Hout Fout Eout Hout Fout Eout Hout Fout

0 0

Ein Hin Fin Ein Hin Fin Ein Hin Fin Ein Hin Fin 5 A PE4,2 B A PE3,3 B A PE2,4 B A PE1,5 B select h3,2 h2,3 h1,4 S Eout Hout Fout Eout Hout Fout Eout Hout Fout Eout Hout Fout

0 0

Ein Hin Fin Ein Hin Fin Ein Hin Fin Ein Hin Fin select A PE4,3 B A PE3,4 B A PE2,5 B A PE1,6 B 6 S h3,3 h2,4 h1,5 Eout Hout Fout Eout Hout Fout Eout Hout Fout Eout Hout Fout

Ein Hin Fin Ein Hin Fin Ein Hin Fin select A PE4,4 B A PE3,5 B A PE2,6 B 7 S h3,4 h2,5 Eout Hout Fout Eout Hout Fout Eout Hout Fout

Ein Hin Fin Ein Hin Fin A PE4,5 B A PE3,6 B select h 8 S Eout Hout Fout 3,5 Eout Hout Fout

Ein Hin Fin A PE B select 4,6 9 S Eout Hout Fout Clock Fig. 7. The single-chip pipeline processor, m=4, n=6. database sequence registers. Furthermore, according a pipelined architecture into the VLSI circuit, we can to recursive relation (3), Hi,j depends on Hi-1,j-1 which speed-up sequence database searches without is not in the immediate previous level of Hi,j. So, for sacrificing the sensitivity. The resulting pipeline the purposes of buffering and synchronization, we processor can do sequence database searches at the add hi,j buffer registers to hold Hi,j values. Since a speed of O(m+n+L), where L is the number of processing element PEi,j needs two clock signals to sequences in the database. Moreover, we have compute its values, our pipeline takes 2 clock signals demonstrated the scalability of the pipeline processor. to move from one pipeline stage to another. Apparently as soon as the first database sequence 7. References completes its alignment checking, every one stage time there is a database sequence completes its [1] TF Smith, MS Waterman, Identification of common alignment checking. Consequently, the pipeline molecular subsequences. J Mol Biol. 147, 195–197 (1981). processor has a throughput rate of one database sequence per two clock signals. In other words, the [2] O. Gotoh, An improved algorithm for matching pipeline has a time complexity of O(1) time per biological sequences. J Mol Biol. 162, 705–708 (1982). database sequence. Moreover, since the first [3] M Borah, RS Bajwa, S Hannenhalli, MJ Irwin, A SIMD sequence takes O(m+n) time to completes its solution to the sequence comparison problem on the MGAP. alignment checking, the total time complexity of the [4] B Alpern, L Carter, KS Gatlin, Microparallelism and high pipeline processor is O(m+n+L), where L is the performance protein matching. Proceedings of the 1995 number of sequences in the database. For most bioinformatics applications, m and n are ACM/IEEE Supercomputing Conference, San Diego, in thousands and thus m×n is in millions. Since there California, Dec 3-8, 1995. are m×n processing elements in our VLSI circuit, the [5] R Hughey, Parallel hardware for sequence comparison pipeline processor requires millions of combinational and alignment. Comp Appl Biosci. 1996;12:473–479. circuits and registers. As of today, a VLSI microchip [6] A Wozniak, Using video-oriented instructions to speed can have billions of transistors. Since a register or a up sequence comparison. Comput Appl Biosci. 13, 145–150 simple combinational circuit such as adder or (1997) comparator doesn’t need thousands of transistors, billions of transistors should be enough for millions [7] T Rognes, E Seeberg, Six-fold speed-up of Smith- of our processing elements. As a result, it’s possible Waterman sequence database searches using parallel to implement the pipeline processor in a single VLSI processing on common microprocessors. Bioinformatics. 16, microchip. 699–706 (2000). For some applications such as genome assembly, [8] ITS Li, W Shum, K Truong, 160-fold acceleration of the the length of the query sequence may be more than Smith-Waterman algorithm using a field programmable gate thousands and thus requires multiple microchips to array (FPGA). BMC Bioinformatics. 8, 185 (2007). implement the pipeline processor. Figure 8 demonstrates the scalability of the pipeline processor. [9] M Farrar, Striped Smith-Waterman speeds database In circuit design, scalability refers to the ability to be searches six times over other SIMD implementations. expanded to cope with increased use. As shown in Bioinformatics. 23, 156–161 (2007). Figure 8, we decompose the pipeline circuit into three [10] Z. Nawaz, M. Shabbir, Z. Al-Ars, and K. Bertels, sub-circuits each of which can be implemented in a “Acceleration of smith-waterman using recursive variable microchip. Type 1 chip is for the head of a pipeline. expansion,” in DSD-2008, pp. 915–922, September 2008. Type 3 chip is for the tail of a pipeline. Type 2 chip is for the middle part of a pipeline. To handle long [11] A Szalkowski, C Ledergerber, P Krähenbühl, C query sequences, we simply use more type 2 chips in Dessimoz, SWPS3 - fast multithreaded vectorized Smith- the middle to lengthen the pipeline. Waterman for IBM Cell/B.E. and x86/SSE2. BMC Res Notes. 1, 107 (2008). 6. Conclusion [12] A Wirawan, CK Kwoh, NT Hieu, B Schmidt, CBESW: Sequence Alignment on the Playstation 3. BMC In this paper, we have described a O(m+n) time Bioinformatics. 9, 377 (2008). intra-sequence parallelized Smith-Water algorithm [13] W Rudnicki, A Jankowski, A Modzelewski, A for general SIMD computers, where m and n are Piotrowski, A Zadrożny, The new SIMD Implementation of lengths of the two sequences to be aligned. We have the Smith-Waterman Algorithm on Cell Microprocessor. shown a VLSI implementation of the parallel algorithm. We have also shown that by incorporating Fund Infor. 96, 181–194 (2009) [14] Y Liu, DL Maskell, B Schmidt, CUDASW++: optimizing Smith-Waterman sequence database searches for Scanning of Sequence Databases with the Smith-Waterman CUDA-enabled graphics processing units. BMC Res Notes. 2, Alg. GPU Comput. Gems, Emerald Edition. (Morgan 73 (2009). Kaufmann, 2011), pp. 155–157. [15] Ł Ligowski, WR Rudnicki, An efficient implementation [18] R Torbjorn, Faster Smith-Waterman database searches of Smith Waterman algorithm on GPU using CUDA, for with inter-sequence SIMD parallelization. BMC massively parallel scanning of sequence databases. Eighth Bioinformatics 12:221 2011. IEEE International. Workshop on High Performance [19] WR. Pearson, DJ. Lipman, Improved tools for biological , Rome, Italy, 2009. sequence comparison. PNAS, 85 (8): 2444-8. (1988) [16] Y Liu, B Schmidt, DL Maskell, CUDASW++2.0: [20] SF. Altschul, W. Gish, W. Miller, EW. Myers, DJ. enhanced Smith-Waterman protein database search on Lipman, "Basic local alignment search tool". J Mol Biol 215 CUDA_enabled GPUs based on SIMT and virtualized SIMD (3): 403–410. (October 1990) abstractions. BMC Res Notes. 3, 93 (2010). [21] WJ. Kent, "BLAT--the BLAST-like alignment tool". [17] Ł Ligowski, WR Rudnicki, Y Liu, B Schmidt, Accurate Genome research 12 (4): 656–664. (2002)

pipeline stage 0 0 0 Database Sequences Ein Hin Fin A PE1,1 B 1

Eout Hout Fout Q b u 1 e ry 0 0 0 0 s eq u e Ein Hin Fin Ein Hin Fin n c h1,1 e 2 A PE2,1 B A PE1,2 B b2 Microchip 1 Eout Hout Fout Eout Hout Fout

0 0 0 0 b3 Ein Hin Fin Ein Hin Fin Ein Hin Fin A PE3,1 B h2,1 A PE2,2 B h1,2 A PE1,3 B 3

Eout Hout Fout Eout Hout Fout Eout Hout Fout

0 0 Q u Ein Hin Fin Ein Hin Fin Ein Hin Fin e ry A PE3,2 B h A PE2,3 B h1,3 A PE1,4 B 2,2 b4 se 4 E H F E H F E H F q out out out out out out out out out u e n c e 0 0 b5 Microchip 2 Ein Hin Fin Ein Hin Fin Ein Hin Fin A PE B h A PE B h A PE B b2 3,3 2,3 2,4 1,4 1,5 5 Q Eout Hout Fout Eout Hout Fout Eout Hout Fout u e ry

se q u b3 e n c e

0 0

Ein Hin Fin Ein Hin Fin Ein Hin Fin A PE3,4 B h2,4 A PE2,5 B h1,5 A PE1,6 B 6 Eout Hout Fout Eout Hout Fout Eout Hout Fout

Ein Hin Fin Ein Hin Fin A PE3,5 B h2,5 A PE2,6 B 7 E H F E H F b4 out out out out out out Microchip 3

Q u e ry Ein Hin Fin se b5 q A PE3,6 B u 8 e E H F n out out out c e

b6

Fig. 8. A multiple-chip pipeline processor, m=3, n=6.