BASS: Approximate Search on Large String Databases
Total Page:16
File Type:pdf, Size:1020Kb
BASS: Approximate Search on Large String Databases Jiong Yang Wei Wang Philip Yu UIUC UNC Chapel Hill IBM [email protected] [email protected] [email protected] Abstract Similarity search on a string database can be classified into two categories: exact match and approximate match. The In this paper, we study the problem on how to build an index struc- search of exact match looks for substrings in the database, ture for large string databases to efficiently support various types of which is exactly identical to the query pattern while the search string matching without the necessity of mapping the substrings to of approximate match allows some types of imperfection such a numerical space (e.g., string B-tree and MRS-index) nor the re- as substitutions between certain symbols, some degree of mis- striction of in-memory practice (e.g., suffix tree and suffix array). alignment, and the presence of “wild-card” in the query pat- Towards this goal, we propose a new indexing scheme, BASS-tree, tern. We shall mention that supporting approximate match is to efficiently support general approximate substring match (in terms very important to many applications. For instance, biologists of certain symbol substitutions and misalignments) in sublinear time have observed that mutations between certain pair of amino on a large string database. The key idea behind the design is that all acids may occur at a noticeable probability in some proteins positions in each string are grouped recursively into a fully balanced and such a mutation usually does not alter the biological func- tree according to the similarities of the subsequent segments starting tion of the proteins. at those positions. Each node is labeled with a regular expression Until very recently, many researchers have suggested us- that describes the commonality of the substrings indexed through the ing the suffix tree or the suffix array as an index of the string subtree. Any search can then be properly directed to the portion database [10, 14, 19, 25, 30]. However, this approach may not in the database with a high potential of matching quickly. With the be the ideal choice due to the high I/O costs associated with BASS-tree in place, wild card(s) in the query pattern can also be constructing disk-reside structures [10, 14] and performing handled in a seamless way. In addition, search of a long pattern can truly approximate match queries. Recently proposed indexes be decomposed into a series of searches of short segments followed such as the String B-tree [11], the String R-tree [15], and by a process to join the results. It has been demonstrated in our the MRS-index [17] transformed the strings into a numerical experiments that the potential performance improvement brought by space where a traditional multidimensional index structure is BASS-tree is in an order of magnitude over alternative methods. applied. Even though these techniques have been proved to be successful in their own application domains, the complicated relationship between symbols has not been taken into full con- sideration. The String B-tree and String R-tree were designed 1 Introduction for exact match and hence did not provide efficient support for approximate match allowing symbol substitutions while String data naturally exists in many applications includ- the MRS-index did not allow the effects of different substitu- ing web documents, E-Commerce data, event sequences, tions to be distinguished by associating with different scores. and biological sequences, which generally involve very large Therefore, in this paper, we focus on the issue of support- databases and the database size still grows exponentially. ing symbol substitution in searching approximate match. A Searching on these large string databases has become a com- score is assigned to each pair of symbols to represent the func- mon practice and the need for effective string indexes has been tional similarity between the pair of symbols. The higher the urgent. For example, biologists frequently need to search for score, the more the similarity. A so-called score matrix is similar samples of a certain DNA or protein region in a mas- used to organize the set of scores. The score matrix is usually sive database of decoded sequences. Nowadays, the amount obtained through ample empirical studies and proper analyti- of mapped sequences exceeds 30 Giga base pairs and still cal inferences1. Figure 1 shows a score matrix, BLOSUM 50 grows at an exponential pace. However, the lack of an effec- [13], defined on the set of amino acids. Since their publica- tive index makes the flat files continue to be used as the stan- tion [13], the BLOSUM (BLOcks SUbstitution Matrix) score dard format to store the huge biological sequences. Searching matrices have been the most popular scoring schemes used in on these sequences is usually carried out by sequentially scan- evaluating the similarities between protein sequences. The en- ning the entire database using a screening approach to iden- tries on the diagonal correspond to identical amino acid pairs tify the set of desired sequences (e.g., BLAST [1, 2]). This approach inevitably suffers from a prolonged response time 1There is an extensive literature [8, 12] concerning the way how credible when dealing with a large amount of sequences. scores should be obtained and justified. and have high positive scores. Entries in the rest of the ma- We have applied the BASS-tree index on a protein database trix specify scores for substitution operations2. The similarity and have been able to achieve at least an order of magni- tude speed-up over other schemes. In this paper, we will use ARNDCQEGHI LKMFPS TWYV the protein database as an example to present the structure of A 5-2 -1 -2 -1 -1 -1 0-2 -1 -2 -1 -1 -3 -1 1 0 -3 -2 0 R -27 -1 -2-4 1 0 -3 0 -4 -3 3 -2 -3 -3 -1 -1-3 -1 -3 BASS-tree and its potential advantage in supporting approxi- N -1 -1 7 2 -20 0 0 1 -3 -4 0 -2 -4 -21 0 -4-2-3 mate match search, although the BASS-tree can be applied to D -2 -2 2 8 -4 0 2 -1 -1 -4 -4 -1 -4 -5 -1 0 -1 -5 -3 -4 other type of strings. C -1 -4 -2 -4 13 -3 -3 -3 -3 -2 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1 The remainder of this paper is organized as follows. Sec- Q -1 1 0 0 -3 7 2 -2 1 -3 -2 2 0 -4 -1 0 -1 -1 -1 -3 E -1 0 0 2 -3 2 6 -3 0 -4 -3 1 -2 -3 -1 -1 -1 -3-2 -3 tion 2 gives an overview of some related work. The approx- G 00-3 -1 -3-2 -3 8-2-4-4-2-3-4-20-2-3-3 -4 imate match query is formally introduced in Section 3 and H -2 0 1 -1 -3 1 0 -2 10 -4 -3 0 -1 -1 -2 -1 -2 -32 -4 the model of BASS-tree is presented in Section 4. Section 5 I -1 -4-3 -4 -2 -3 -4 -4 -4 5 2 -3 20 -3 -3-1 -3 -1 4 L -2 -3 -4 -4 -2 -2 -3 -4 -3 2 5 -3 3 1 -4 -3 -1 -2-1 1 describes the process of approximate queries using the BASS- K -13 0 -1 -3 2 1 -2 0 -3 -3 6 -2 -4 -1 0 -1 -3-2 -3 tree. An extensive experimental study is provided in Section M -1 -2 -2 -4 -2 0 -2 -3 -1 2 3 -2 7 0 -3 -2 -1 -1 0 1 6 before the final conclusions are drawn in Section 7. F -3 -3 -4 -5 -2 -4 -3 -4 -1 01 -4 0 8 -4 -3 -2 14 -1 P -1 -3 -2 -1 -4 -1 -1 -2 -2 -3 -4 -1 -3 -4 10 -1 -1 -4 -3 -3 S 1-1 1 0-1 0-1 0-1-3-3 0-2 -3 -1 52-4 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -12-2 5 -3 0 2 Related Work W -3 -3 -4 -5 -5 -1 -3 -3 -3-3 -2 -3 -1 1 -4 -4 -3 15 2 -3 Y -2 -1 -2 -3 -3 -1 -2-3 2 -1 -1 -2 0 4 -3 -2 -228 -1 In this section, we give a brief overview of previous V 0-3-3 -4 -1 -3 -3 -4 -4 41 -3 1 -1 -3 -2 0 -3 -1 5 research on approximate matching and on indexing string databases. Interested readers please refer to the individual pa- Figure 1. The BLOSUM 50 Score Matrix pers for detail discussions. Most previous work on approximate string match falls into search can then be posed by specifying a string/pattern and three categories. The first category of approaches utilize a fil- a similarity threshold, where the similarity threshold is used tering algorithm to quickly eliminate large parts of the strings. to define the separation between the approximate matches of The filter used is usually a simpler and necessary condition of the pattern and mismatches. The higher the similarity thresh- the matching criterion. A family of algorithms that fall in this Õ old, the more restrict the approximate match. This model of category use the notion of Õ -gram [22, 26, 27, 28].