Fast Sequence Database Search Using Intra-Sequence Parallelized Smith-Waterman Algorithm
Total Page:16
File Type:pdf, Size:1020Kb
Fast Sequence Database Search Using Intra-Sequence Parallelized Smith-Waterman Algorithm Chung-E Wang Department of Computer Science California State University, Sacramento Sacramento, CA 95819-6021 [email protected] Abstract - In this paper, we describe a fully Gotoh [2]. Gotoh’s algorithm can find an optimal pipelined VLSI architecture for sequence database alignment of two biological sequences in O(mn) search using Smith-Waterman algorithm. The computational steps. It was a great improvement for architecture makes use of the principles of aligning two sequences. However, it’s not fast parallelism and pipelining to the greatest extent in enough for sequence database searches. A sequence order to take advantages of both intra-sequence and database search is to compare a query sequence with inter-sequence parallelization and to obtain high a database of sequences, and identify database speed and throughput. First, we describe a parallel sequences that resemble the query sequence above a Smith-Waterman algorithm for general SIMD certain threshold. computers. The parallel algorithm has an execution Due to substantial improvements in time of O(m+n), where m and n are lengths of the multiprocessing systems and the rise of multi-core two biological sequences to be aligned. Next, we processors, parallel processing became a trend of propose a VLSI implementation of the parallel accelerating Smith-Waterman’s algorithm and algorithm. Finally, we incorporate a pipeline sequence database searches. Many enhancements of architecture in the proposed VLSI circuit and result Smith-Waterman algorithm based on the idea of in a pipeline processor that can do sequence parallel processing have been presented [3-18]. database searches at the speed of O(m+n+L), where However, all these enhancements can only speed up L is the number of sequences in the database. Smith-Waterman algorithm by a constant factor. That is, all these enhancements still require an execution Keywords: Sequence alignment, sequence database time of O(mn) to align two biological sequences. search, Smith-Waterman algorithm, parallel Besides parallel processing, heuristic methods are algorithm, VLSI circuit, pipelined architecture. other commonly used approaches for speeding up sequence database searches. A heuristic method is a 1. Introduction method which is able to produce an acceptable solution to a problem but for which there is no proof A sequence alignment is a way of matching two that it’s an optimal solution. Heuristic methods are biological sequences to identify regions of similarity intended to gain computational performance, that may be a consequence of functional, structural or potentially at the cost of accuracy or precision. evolutionary relationships between the two biological Popular alignment search tools such as FASTA [19], sequences. Sequence alignment is a fundamental BLAST [20] and BLAT [21] are in this category. operation of many bioinformatics applications such They did successfully gain some speed. However, the as genome assembly, sequence database search, sensitivity is compromised. For sequence database multiple sequence alignment, and short read searches, sensitivity refers to the ability to find all mapping. database sequences that resemble the query sequence The Smith-Waterman algorithm [1] is the most above a threshold. sensitive but slow algorithm for performing sequence Without sacrificing any sensitivity, in this paper, alignment. Here sensitivity refers to the ability to find we first describe a parallel Smith-Waterman the optimal alignment. Smith-Water algorithm algorithm for general SIMD (Single Instruction requires O(m2n) computational steps, where m and n stream - Multiple Data stream) computers. This are lengths of the two sequences to be aligned. parallel algorithm requires an execution time of Smith-Waterman algorithm was later improved by O(m+n) to align two biological sequences. Second, we propose a VLSI (Very-Large-Scale Integration) implementation of the parallel algorithm. Finally, we cells of the same diagonal and thus can be computed use the pipeline technique to overlap the execution simultaneously. Furthermore, cells of a diagonal only times of alignment checking of database sequences. depend on cells of the previous two diagonals. So, the The resulting pipeline has a throughput rate of O(1) basic idea of the parallel algorithm is to fill the execution time per database sequence. Consequently, matrix diagonal by diagonal starting from the top-left the time complexity of the proposed pipeline corner. Moreover cells of a diagonal are filled processor is O(m+n+L), where L is the number of simultaneously with multiple processors. 2 3 4 5 6 7 l al al l l l sequences in the database. a n n a a a gn g g gn gn gn ia ia ia ia ia ia D D D D D D 8 al gn E1,1 E1,2 E1,3 E1,4 E1,5 E1,6 ia 2. The Smith-Waterman Algorithm D H1,1 H1,2 H1,3 H1,4 H1,5 H1,6 F1,1 F1,2 F1,3 F1,4 F1,5 F1,6 9 al n E E E E E E g 2,1 2,2 2,3 2,4 2,5 2,6 ia D The Smith-Waterman algorithm is used to H2,1 H2,2 H2,3 H2,4 H2,5 H2,6 0 F2,1 F2,2 F2,3 F2,4 F2,5 F2,6 1 al compute the optimal local-alignment score. Let A = gn E3,1 E3,2 E3,3 E3,4 E3,5 E3,6 ia D a a ... a and B = b b ... b be the two sequences to H3,1 H3,2 H3,3 H3,4 H3,5 H3,6 1 2 m 1 2 n F3,1 F3,2 F3,3 F3,4 F3,5 F3,6 be aligned. A weight w(ai, bj) is defined for every E4,1 E4,2 E4,3 E4,4 E4,5 E4,6 H4,1 H4,2 H4,3 H4,4 H4,5 H4,6 pair of residues ai and bj. Usually w(ai, bj) <= 0 if ai ≠ F4,1 F4,2 F4,3 F4,4 F4,5 F4,6 bj, and w(ai, bj) > 0 if ai = bj. The penalties for starting a gap and continuing a gap are defined as ginit Fig.Fig. 1 1.. C oComputationalmputational dep edependenciesndencies in the and gext respectively. The optimal local alignment Smith-Waterman alignment matrix. score S can be computed by the following recursive relations: The pseudo code of the parallel algorithm is given in Figure 2. Note that in the pseudo code, we use the Ei,j = max { Ei,j-1-gext, Hi,j-1-ginit } (1) in-parallel statement to indicate things to be done by Fi,j = max { Fi-1,j-gext, Hi-1,j-ginit } (2) processors simultaneously. The in-parallel statement Hi,j = max {0, Ei,j, Fi,j, Hi-1,j-1+w(ai,bj) } (3) has the following syntax: S = max { Hi,j }; (4) In parallel, all processor i, lo<=i<=hi do { The values of E , F and H are 0 when i<1 and j<1. … i,j i,j i,j } Smith-Waterman algorithm is a dynamic programming algorithm. A dynamic programming Only processors with processor numbers between lo algorithm is an algorithm that stores the results of and hi are activated. Also note that in the pseudo certain calculations in a data structure, which are later code, ji is a variable for processor i only. In other used in subsequent calculations. Smith-Waterman words, different processors have different j’s. algorithm uses a m×n matrix, called alignment Furthermore, array maxH is used for processors to matrix, to store and compute H, E, and F values keep track of the largest Hi,j. Since there are m+n-1 column by column. diagonals, the loop repeats m+n-1 times and thus the parallel algorithm has an execution time of O(m+n). 3. An Intra-sequence Parallelized In parallel, all processor i, 1<= i <=m do Smith-Waterman Algorithm For E[i][0] = F[i][0] = H[i][0] = maxH[i] = 0; In parallel, all processor i, 1<= i <=n do SIMD Computers E[0][i] = F[0][i] = H[0][j] = 0; for (diag=2; diag<=m+n; ++diag) { if (diag<=m+1) first = diag-1; Intra-sequence parallelization means the else first = m; parallelization is within a single pair of sequences, in if (diag-n<=1) last = 1; contrast to inter-sequence parallelization where the else last = diag-n; In parallel, all processor i, first>= i >= last do { parallelization is carried out across multiple pairs of ji = diag – i; sequences. E[i][ji] = max {E[i][ji-1]-gext, H[i][ji-1]-ginit}; Figure 1 shows the computational dependencies of F[i][ji] = max {F[i-1][ji]-gext, H[i-1][ji]-ginit}; Smith-Waterman alignment matrix. Most of the H[i][ji] = max {0,E[i][ji], F[i][ji], H[i-1][ji-1]+w(ai,bji)}; if (H[i][ji]>maxH[i]) maxH[i] = H[i][ji]; existing parallelized Smith-Waterman algorithms in } the literature are originated from this dependency } structure. Since cells on a bottom-left to top-right S = max { maxH[1], maxH[2], …max H[m]}; diagonal have the same sum of indices, we number those diagonals with their sums of indices. Fig. 2. The pseudo code Noticeably cells of a diagonal don’t dependent on 4. A VLSI Implementation Additionally, in order to keep track of the largest hi,j value, register Hi,j of processing element PEi,j is First we design a small processing element connected to register maxHi as shown in Figures 3. according to the recursive relations (1), (2) and (3). Next, we map our algorithm into a VLSI circuit.