A Memory-Efficient FM-Index Constructor for Next-Generation Sequencing Applications on Fpgas

A Memory-Efficient FM-Index Constructor for Next-Generation Sequencing Applications on FPGAs Nae-Chyun Chen, Yu-Cheng Li and Yi-Chang Lu Graduate Institute of Electronics Engineering, National Taiwan University, Taipei, Taiwan, 10617 Email: [email protected], [email protected], [email protected] Abstract—FM-index is an efficient data structure for string of FM-index has much higher time overhead in comparison search and is widely used in next-generation sequencing (NGS) with traditional algorithms. Therefore, on-chip indexing be- applications such as sequence alignment and de novo assembly. comes necessary and important. Our previous research [11] Recently, FM-indexing is even performed down to the read level, raising a demand of an efficient algorithm for FM-index has demonstrated an FM-index constructor with a lightweight construction. In this work, we propose a hardware-compatible iterative algorithm [10] with an ASIC, but the cost of the Self-Aided Incremental Indexing (SAII) algorithm and its hard- indexer is still high when compared to the work proposed ware architecture. This novel algorithm builds FM-index with here. no memory overhead, and the hardware system for realizing In this work, we propose a novel memory efficient FM- the algorithm can be very compact. Parallel architecture and a special prefetch controller is designed to enhance computational index construction algorithm, Self-Aided Incremental Indexing efficiency. An SAII-based FM-index constructor is implemented (SAII), which is suitable for hardware realization. This algo- on an Altera Stratix V FPGA board. The presented constructor rithm builds the FM-index incrementally. In each iteration, it can support DNA sequences of sizes up to 131,072-bp, which utilizes a meta-index to construct the complete index. Since is enough for small-scale references and reads obtained from SAII makes use of the FM-index itself for construction, it has current major platforms. Because the proposed constructor needs very few hardware resource, it can be easily integrated into no memory overhead for FM-index-based applications. Only different hardware accelerators designed for FM-index-based few computational logic units are needed. In our hardware applications. system, the processing speed is accelerated by a special Keywords-next-generation sequencing, FM-index, Burrows- prefetch mechanism and a parallel architecture. We choose Wheeler transform, self-aided index construction Altera Stratix V FPGA as the evaluation platform. The SAII FM-index constructor is very compact in terms of logic usage, I. INTRODUCTION so it can be integrated with other functional blocks to form a Burrows-Wheeler transform (BWT) [1] is a string rearrange- complete hardware pipeline in emerging NGS applications. ment algorithm first proposed for data compression. Making use of the properties between BWT and its original string, II. BACKGROUND FM-index [2] is designed for efficient string searching. For A. Burrows-Wheeler Transform a certain string, the data structure of its FM-index contains the BWT, the suffix array (SA) and two auxiliary tables of To construct the BWT of target sequence X, a simple the target string. With its high efficiency in both time and method is via the translation of suffix array (SA) with Eq. 1. memory, FM-index is widely adopted by many next-generation Also, BWT can be obtained by collecting all characters in the arXiv:2102.03045v1 [cs.AR] 5 Feb 2021 sequencing (NGS) applications [3], [4], [5], [6]. last column of sorted suffixes. Fig. 1 shows an example of the For FM-index-based sequence aligners such as BWA- BWT of sequence X=ACGATTG$, where character $ is the backtrack [3] and Bowtie2 [5], the reference sequence is end-of-string character. The lexical order is $<A<C<G<T. indexed for fast read-locating processes. For this kind of ( X[SA[i] − 1] if SA[i] > 0 aligners, because only the reference sequence needs to be BW T [i] = (1) indexed, the hardware accelerators ([7], [8]) usually build $ if SA[i] = 0 the FM-index externally using CPUs. Since the length of a reference sequence can be as large as three billion base pairs (bps), the memory usage of naive index constructing method B. FM-index and Backward Search Algorithm is unaffordable. Therefore, how to reduce memory usage in FM-index is extended from BWT and suffix array, with two FM-indexing becomes an important issue [9], [10]. auxiliary tables—C array and O table. The definitions for C Recently, some de novo assemblers ([6], for example) apply array and O table are shown in Eq. 2 and 3, where n is the FM-index for overlap finding of reads. Also, the reference length of target sequence X. and reads are both indexed in new sequence aligners such as BWA-SW [4]. For these applications, external construction C(a) ≡ sizef0 ≤ j ≤ n − 2 : X[j] < ag (2) Sorted suffixes table O table C array SA index BWT SA index BWT 0 $ A G O(T,0)=0 0 1 2 3 4 5 6 7 A C G T A C G T C(C)=1 0 $ A G 1 A C $ 7 $ A C G C T T G 0 0 1 0 0 1 3 5 1 A C $ 2 C G A 1+1 2 C G A 0 A C G C T T G $ 0 0 1 0 C(T)=5 R(CT) O(C,5)=1 3 C T G R(CT) 3 C T G 1 C G C T T G $ A 1 0 1 0 2 O(C,7)=2 4 G $ T 4 G $ T O(T,7)=2 3 C T T G $ A C G 1 0 2 0 5 G C C 5 G C C 0+1 6 G $ A C G C T T 1 0 2 1 R(T) 6 T G T 6 T G T 2 G C T T G $ A C 1 1 2 1 2 R(T) 7 T T C 7 T T C 5 T G $ A C G C T 1 1 2 2 (a) (b) 4 T T G $ A C G C 1 2 2 2 SA BWT Fig. 2. An example of searching query sequence CT on indexed target string X=ACGCTTG$. (a) and (b) show the first and second iteration respectively. Fig. 1. The FM-index of target string ACGCTTG$. This data structure includes SA; BW T; C array and O table. BW T is the last column of the sorted suffixes table. and follows Eq. 6. Therefore, we can insert ai−1 into the FM- index of L, generating the new FM-index of ai−1L without ( sorting the whole string all over again. sizef0 ≤ j ≤ i : BW T [j] = ag; i ≥ 0: O(a; i) ≡ The algorithm of SAII is shown in Alg. 1, and an example 0; otherwise. of a target sequence ACGCT$ is provided in Fig. 3. In the (3) first iteration, the initial character is $ and the BWT is also $. With C array and O table, the position of query aW can The corresponding O table and R are calculated. In the second be efficiently located within a lower bound R(aW ) and upper iteration, a new character T is added to the target sequence. bound R(aW ) as shown in Eq. 4 and 5. Then we use O and C obtained in the previous iteration to R(aW ) = C(a) + O(a; R(W ) − 1) + 1 (4) calculate the new R, and the updated $ is inserted to this R position to form a new BW T . With this updated BW T , we R(aW ) = C(a) + O(a; R(W )) (5) recalculate O and C for this iteration. Then SAII is ready In [2], Ferragina and Manzini have proved that R(aW ) ≤ to move on to next iteration. After all the characters in the R(aW ) if and only if aW is a substring of X. This searching target sequence are read, the corresponding FM-index of the algorithm starts from the end of the query sequence and reference is correctly built. extends iteratively. Therefore it is also called backward search Because the construction process is entirely based on FM- algorithm. index itself, nearly no extra computational resources are needed and the memory overhead is zero for FM-index- III. SELF-AIDED INCREMENTAL INDEXING (SAII) based applications. The time complexity of SAII algorithm ALGORITHM is O(n2k−1), where k is completeness of the O table. The An example of backward search algorithm is shown in details of k is given in Sec. IV-B. Fig. 2. The initial values fR(;); R(;)g are set at f0; n − 1g. Here we discuss the mathematical insights of the lower bound Algorithm 1 Self-Aided Incremental Indexing Algorithm in backward search algorithm. R(aW ) is the sum of C(a), Require: target sequence X O(a; R(W ) − 1) and 1. C(a) records the total number of Ensure: BW T , C array and O table characters lexically smaller than a in target sequence X. The 1: Initialize C array and O table O(a; R(W ) − 1) term is the occurrence of aW in X. With an 2: n length (X) additional offset, R(aW ) represents the suffix array index of 3: BW T $ lexically smallest aW sequence. Similar concept can be used 4: q 0 to account for the upper bound. If aW only occurs once in 5: for i from n − 2 to 0 do X, R(aW ) is equal to R(aW ). It is also of interest that what 6: BW T [q] X[i] would happen if aW is not a substring of X.

Load more