<<

-Join: Local Similarity Join on String Collections

Jiaying Wang, Xiaochun Yang,Member,IEEE, Bin Wang, and Chengfei Liu,Member,IEEE

Abstract—String similarity join, as an essential operation in applications including data integration and data cleaning, has attracted significant attention in the research community. Previous studies focus on global similarity join. In this paper, we study local similarity join with edit distance constraints, finds string pairs from two string collections that have similar substrings. We study two kinds of local similarity join problems: checking local similar pairs and locating local similar pairs. We first consider the case where if two are locally similar to each other, they must share a common gram of a certain length. We show how to do efficient local similarity verification based on a matching gram pair. We propose two pruning techniques and an incremental method to further improve the efficiency of finding matching gram pairs. Then we devise a method to the longest similar substring pair for two local similar strings. We conducted a comprehensive experimental study to evaluate the efficiency of these techniques.

Index Terms—Local Similarity Join, Edit Distance, Similar Substrings, Filtering. !

1INTRODUCTION ID Strings r1 Samsung DV150F 16.2MP Smart Camera HE problem of similarity join, which is to find similar r2 Canon EOS ELAN 7E(35mm) SLR Camera string pairs from two string collections, is relevant to ::::::::::::::::::: T r3 Canon PowerShot SX170 IS MP Digital Camera many data cleaning and data integration applications [11], r4 Sony W800/B 20 MP Digital Camera [24]. Various functions can be used to quantify the similarity (a) Purchase of goods. between two strings, such as edit distance and Jaccard. ID Strings Many approaches [2], [15], [17], [19], [22], [28], [31] are s1 New Samsung’s DV150FX camera $85.00 developed to solve this problem. s2 Best for Beginners: Canon:::::::::EOS::::::ELAN:::::7/7E $449.99 Existing studies focus on global similarity join. However, s3 Memory Card for Canon:::::::::EOS::::::ELAN:::::7/7E $15.99 in applications such as data integration [13], and bioinfor- s4 New Canon’s PowerShot SX170IS Camera $99.95 matics [1], it is often important to find similar substring pairs, (b) Supply of goods. even if two strings are not similar globally. The following are Fig. 1: Two product tables. two motivating examples. Example 1. In data integration, users often want to match that are globally dissimilar but share similar substrings. In s3 s5 the same entity from different sources. Fig. 1 shows an particular, the underlined substrings of and are similar. example of an online shopping mall’s purchase list and a supply list from its supplier. Record r1 in Fig. 1(a) and ID Strings s1 record s1 in Fig. 1(b) describe the same Samsung camera DCCADGGCRAARDCRCDD model. In particular, they have two substrings that have s2 AGACAGCRRAARCDRAGG slightly different representations. Finding this of pairs s3 GCAGTACTCAACGATAGC s4 can us locate records related to the same product, so ::::::::::::::::::::::::GGATTACCTAGGCATTCT that we can do a deeper analysis to remove duplicates and s5 ATCATGCACTACTGAACG s6 integrate information from different sources. ::::::::::::::::::::::::GGATTACCTAAGCATTCT Example 2. A fundamental problem in protein sequence Fig. 2: Bio-sequences. comparison is to decide whether two sequences share com- In this paper, we study the problem of local similarity mon structural and functional features based on similarity join, which finds pairs from two string collections such that observed in their amino acid sequences. The decision can they share similar substrings. To evaluate local similarity, help scientists detect biologically similar living organisms we follow the way in [3] to evaluate local similarity by using in a large genome bank. Fig. 2 shows two bio-sequences length and edit distance constraints, since edit distance has been widely used for evaluating string similarity. In [3], the • Jiaying Wang, Xiaochun Yang and Bin Wang are with School of local similarity matching problem is defined as matching Computer Science and Engineering, Northeastern University, Liaoning any l-length pattern with k errors. It finds all locations in 110819, China. Xiaochun Yang is the corresponding author. E-: [email protected];{yangxc,bwang}@mail.neu.edu.cn. the text where an l-length substring of P ocucrs, with • Chengfei Liu is with Department of Computer Science and Software k differences. In real applications, l can be set to be the Engineering, Swinburne University of Technology, Australia. minimal entity length or phrase length, and k should satisfy E-mail: [email protected]. k l. Different from the similarity matching problem in

Digital Object Identifier no. 10.1109/TKDE.2017.2687460 1041-4347 ß 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for information. 2

[3], we focus on local similarity join problem. To the best of We say α, β is a longest local similar pair of r and s if our knowledge, this is the first study of the local similarity min(|α|, |β|) is maximum among all the similar substring join problem under edit distance constraint. It contains two pairs of r and s. sub-problems: checking local similar pairs and locating local Consider the example in Fig. 1. Let l =10and τ =2. similar pairs. We develop techniques to solve the problem Then r1,s1 is a local similar pair under τ .r.t. l, since sub- efficiently. We the following contributions: string r1[1, 14] = Samsung_DV150F and substring s1[5, 20] 2 • We develop a local similarity join framework for = Samsung’s_DV150F have an edit distance and their 10 [1 14] [5 20] string collections in Section 3. It consists of three lengths are not than . r1 , , s1 , is also their steps: (i) finding and pruning matching gram pairs longest similar substring pair. to generate candidate gram pairs, (ii) verifying candi- Below we define two sub-problems. dates by extending candidate gram pairs to substring Checking local similar pairs (also called LS-JOIN CHECK- pairs and calculating edit distance of substring pairs, ING problem or the “LJC problem”): Given two string and (iii) updating index. The framework can support collections R and S, let l be a length threshold and τ be self-join as well as join between two collections of an edit distance threshold. The problem of checking local ∈ × strings. similar pairs is to find all similar string pairs r, s R S, ( ) ≤ C • We first focus on the sub-problem of “checking local such that edl r, s τ. We use “Rl,τ S” to represent the similar pairs”. Local similarity verification based on a operation of checking local similar pairs. matching gram pair is not trivial. Naively it needs to Example 3. For the two string sets R and S in Fig. 1, we enumerate all the substring pairs. Existing extension- C have Rl=10,τ=2 S ={r1,s1 , r2,s2 , r2,s3 , r3,s4 }. based method, which conducts exact extension in Similarly, for the string set S shown in Fig. 2, we have one string and similar extension in another string, C Sl=10,τ=2 S ={s3,s5 , s4,s6 }. cannot solve the problem. How to do efficient veri- fication without enumerating all the substring paris? Locating local similar pairs (also called LS-JOIN LOCATING We propose techniques in Section 4 to tackle this problem or the “LJL problem”): For each local similar string ∈ × ( ) ≤ question. pair r, s R S with edl r, s τ, locate the longest local 1 L • Furthermore, there are many matching grams, some similar pair α, β of r, s under τ w.r.t. l. We use “Rl,τ of which could not extend to the final results, where- S” to represent the operation of locating local similar pairs. as others could cause duplicate verifications. Existing Example 4. Consider the two string sets R and S in Fig. 1. pruning methods do not work for local similarity L = [1 14] [5 20] [1 17] We have Rl=10,τ=2 S { r1 , , s1 , , r2 , ,s2 problem, thus in Section 5 we propose two new [21, 39], r2[1, 17], s3[17, 35], r3[1, 21], s4[5, 27]}. Simi- orthogonal pruning techniques to reduce candidates, L = larly, for the string set S in Fig. 2, Sl=10,τ=2 S and an incremental method to boost the process. {s3[1, 13],s5[6, 18], s4[1, 18],s6[1, 18]}. The underlined • We extend the techniques of “checking local similar substrings in Figs. 1 and 2 are answers of the examples. pairs” to the sub-problem of “locating local similar pairs” in Section 6. We show that our techniques are 2.2 Edit Distance Matrix general enough to cover these two sub-problems. Edit distance can be computed using matrix-filling dynamic • We conduct extensive experiments on real and syn- programming algorithm[23], which reserves a matrix to thetic datasets to show the efficiency of the proposed hold the edit distances between all the prefixes of two techniques in Section 7. strings. We use i, j to denote a cell at the i-th row and j-th column in an edit distance matrix, and use D(i, j) to denote 2PRELIMINARIES the cell value. The edit distance matrix is constructed based on the recurrence relation given in Equation 1, in which 2.1 Problem Description cij =1if s1[i] = s2[j]; otherwise cij =0. Initially, D(i, 0) = (0 )= Let Σ be an alphabet of characters. For a string s consisting i and D ,j j. For two strings s1 and s2, we have (s1[1,i],s2[1,j]) = D(i, j) ed(s1,s2)=D(|s1|, |s2|) of characters in Σ,weuse|s| to denote its length, s[i] to ⎧ ,so . [ ] denote the i-th character (starting from 1), and s i, j to ⎨⎪ D(i, j − 1)+1, denote the substring from position i to position j (a.k.a., −1 D(i, j)=min D(i − 1,j)+1, (1) i j − i +1 s ⎪ “gram” at position with length ). We use to ⎩ ( − 1 − 1) + denote the reversed string of s. For example, if s =abc, D i ,j cij , −1 s =cba.Weuseed(r, s) to denote the edit distance be- Fig. 3 shows an example of edit distance computation tween string r and string s, which is the minimum number between strings ARDCRC and ARCDRA. Their edit distance is of single-character edit operations (insertion, deletion, and D(6, 6) = 3. substitution) to transform r to s. The computation can be improved if we just want to check whether the edit distance of two strings is within a Definition 1. [Local similar pair] Given an edit distance threshold τ. Since |i − j|≤D(i, j) ≤ τ, the computation of threshold τ and a length l (ττ. For example, let “locally similar under τ w.r.t. l,” denoted by edl(r, s) ≤ τ, τ =2, the grey cells in Fig. 3 do not need to be computed. iff there exists a substring α of r and a substring β of s, such | |≥ | |≥ ( ) ≤ that α l, β l, and ed α, β τ. Very often we omit 1. Notice that there could be multiple pairs of longest similar sub- “under τ w.r.t. l” if the parameters are in the context. strings. In that case, we only find one of them. 3

01234 56 s1 Gram Inverted Index A R C D R A (2) … s ~ ~ ~ 2 2 0 0 1 3 4 5 6 Processed (1)

1 A 110 2 3 4 5 … s 2 R 2 1 0 1 2 3 4 k-1 3 D 3 221 1 1 3 Current s 4 C 4 3 221 2 3 k (3) 5 R 5 4 3 2223 s 1. Generate candidates ( and prune k+1 matching gram pairs) 6 C 6 5 4 3 3 3 3 2. Verify candidates (extend candidate

… pairs and calculate edit distance) Fig. 3: Edit distance matrix. s n 3. Update inverted index

We say two cells i1,j1 and i2,j2 are in the same Fig. 4: Process of LS-JOIN. diagonal if i1 − j1 = i2 − j2. It is straightforward to verify a property of edit distance matrix: D(i, j) ≥ D(i − 1,j− 1), ALGORITHM 1. LS-JOIN(S, l, τ) which is called edit distance diagonal property. Input: S: A string set l: A given length threshold τ: A given similarity threshold 3AFRAMEWORK FOR CHECKING LOCAL SIMI- Output: A ={si,sj ∈S × S | edl(si,sj ) ≤ τ } LAR PAIRS 1 I ←∅; // initialize index I to be empty In this section, we study how to solve the LJC problem, i.e., 2 foreach si ∈ S do ∈ ( ) checking local similar pairs. 3 foreach u grams si,q do ∈ Naively we can solve the LJC problem using a global 4 foreach sj ,pv Iu do s ,s ∈A (s ,s , u, v) similarity join method (e.g. Pass-Join [17]) as follows. It 5 if i j and Verify i j then 6 A.add(si,sj ); first builds an index based on string set R, then performs similarity search on string set S. For each string r ∈ R, 7 foreach u ∈ grams(si,q) do r C it converts into a substring collection r, which contains 8 Iu.add(si,pu); all the substrings of r with a length of at least l. Then it A inserts all the substrings into the index. For each string 9 return ; s ∈ S, it makes the same conversion on s to get Cs, and does similarity search for each substring in Cs with τ edit distance. If a substring in Cr is found to be similar with a For the join between two different sets R and S, we first substring in Cs, it takes string pair r, s as a result. insert grams of string r ∈ R into index I, then search each (|s|−l+1)×(|s|−l+2) ∈ This naive approach generates 2 sub- string s S using the index to find local similar pairs. (|s|−l+1)×(|s|−l+2) strings for a string s, and s∈S 2 sub- strings for the entire string set S. Clearly this approach is 4LOCAL SIMILARITY VERIFICATION inefficient. Now we present our framework LS-JOIN based on the In this section, we show how to do efficient local similarity assumption that if two strings are locally similar, they must verification using a matching gram pair u, v. In Section 4.1, share at least one common q-gram. In Theorem 1 we show we introduce a basic method. Then in Section 4.2 we propose that this assumption always holds when the gram length several optimization techniques to improve the process. q satisfies a special condition. And we guarantee that our approach will not miss any local matching results. 4.1 Verification on a Matching Gram Pair Fig. 4 illustrates the process in LS-JOIN framework. The whole computation is an iterative process, which consists Consider two strings si and sj. Suppose there is a q-gram = [ + − 1] = of three steps: (i) generate candidates, (ii) verify candidates, u si pu,pu q matching with another q-gram v [ + −1] and (iii) update inverted index. Without loss of generality, sj pv,pv q . Naively we need to extend u to a substring = [ ] 1 ≤ ≤ + − 1 ≤ ≤| | we first focus on the self-join case, i.e., R = S. Algorithm 1 α si a, b ( a pu,pu q b si ) and v to a = [ ] 1 ≤ ≤ + − 1 ≤ ≤| | shows the process. I is initialized with empty. We start from substring β sj c, d ( c pv,pv q d sj ). |α|≥l |β|≥l ed(α, β) ≤ τ the first string s1 in the string collection. For each q-gram Then we check if , , and . For u = v = RA u in string s1, we find its matching grams in an inverted example, Fig. 5 shows an example, where .We GCRAAR GCRRAAR index I. Since I is empty, no matching grams can be found. first extend it to a substring pair , . Then So we directly update the index I as follows. For each gram we check if they satisfy the local similarity constraints. u at pu in s1, if it is not an index key in I, we build an u inverted list Iu for u. We then append s1,pu in the list s  i: D C C A D G G C R A A R D C R C D D Iu. Then we process the second string s2. For each gram u in s2, we find its matching grams in inverted index I. Let s  j: A G A C A G C R R A A R C D R A G G v = s1,pv be a matching gram of u . We verify the local v Substring pair on the left Substring pair on the right similarity of s1,s2 using Verify function. If the verification passes, the algorithm inserts s1,s2 into the result set A. Fig. 5: Extension from a matching gram pair. After processing all the grams of the current string s2,we update the index I using all grams in s2. We keep processing The naive method needs to enumerate all the substring all strings in the collection as above. pairs around a matching gram pair to do verification, which 4

can be improved based on the following observation: the According to Lemma 1, to verify if two strings are locally matching gram pair u, v separates α, β into two parts: a similar to each other, a conservative length to get the result substring pair αl,βl on the left (if any) and a substring pair is l + τ. Since we have found a matching gram of length q, αr, βr on the right (if any). A basic verification process is the extensions of matching grams can be restricted within to enumerate all possible substrings on the left and right to e = l + τ − q characters to the left and the right. The length- 4 check if any of the combinations satisfies |αl| + |αr|≥l − q based approach requires O(e ) combinations of αl,βl and and |βl|+|βr|≥l −q, as well as ed(αl,βl)+ed(αr,βr) ≤ τ. αr,βr. The correctness and completeness of the basic verification For example, consider the same example in Fig. 5, let method are formalized in Theorem 1. l =6, τ =2and q =2, thus e =6. In Fig. 6(a), all the cells in grey area do not need to be computed.There are 49 cells Theorem 1. When 1 ≤ q ≤ l , the basic verification method τ+1 in the backward and forward matrices respectively. So the can find all local similar string pairs correctly and completely. number of cell pairs is reduced from 6561 to 2401. Proof. Please refer to Appendix A. 4.2.2 A Threshold-based Approach We could easily compute edit distance for all substring Edit distance threshold is widely used to reduce verification pairs on the left and right by building edit distance forward cost [7], [35]. Next we show that it can be used to reduce and backward matrices as shown in Fig. 6(a). The formal cell number. Since we are only interested in pairs with edit definition of the edit distance forward and backward matri- distance not greater than τ, any pair with a value greater ces is given in Definition 2. than τ will not be considered. Consider the same example Definition 2 (Edit Distance Forward and Backward Matri- in Fig. 6. In the backward matrix DB and forward matrix ces). Given two strings s1 and s2, the edit distance forward , all the cells in grey color or whose values are marked matrix is the edit distance matrix of s1 and s2, while the edit in grey color do not need to be computed. There are 14 cells −1 distance backward matrix is the edit distance matrix of s1 left in the backward matrix and 24 cells left in the forward −1 and s2 . matrix, so the number of cell pairs is reduced from 2401 to 336. We utilize two strings αr = si[pu + q, |si|] and βr = s [p + q, |s |] D We can further early terminate the process as follows. j v j to build an edit distance forward matrix F • (Early termination 1, ET1 for short) We stop the D (i2,j2) (see Fig. 6(c)). The cell F represents the edit distance process if a cell i, j satisfies i ≥ l − q, j ≥ l − q of a substring pair si[pu + q, pu + q + i2],sj[pv + q, pv + −1 and D(i, j) ≤ τ, since it is enough to claim that the q +j2] α α = . Similarly we construct a reverse string l for l two strings are locally similar. s [1,p − 1] β−1 β = s [1,p − 1] i u and a reverse string l for l j v . • (Early termination 2, ET2 for short) We stop the D We then build an edit distance backward matrix B (see process if all the cells in a row are greater than τ, D (i1,j1) Fig. 6(b)). The cell B represents the edit distance of since all subsequent cells must be greater than τ s [p − i1,p − 1],s [p − j1,p − 1] a substring pair i u u j v v . according to the edit distance diagonal property. We verify the two strings’ local similarity by checking if a cell i1,j1∈DB and a cell i2,j2∈DF satisfy i1 + i2 ≥ Consider the example in Fig 7. The process in the l − q, j1 + j2 ≥ l − q and DB(i1,j1)+DF (i2,j2) ≤ τ. forward matrix can be stopped when we reach the cell In this way, the number of combinations of cell pairs 4, 4 (ET1), and the process in the backward matrix can be equals to (pu +1)× (pv +1)× (|si|−pu − q +1)× (|sj|− stopped when all the cells in the 4-th row are greater than pv − q +1). In the worst case where the matching gram pair 2 (ET2). Therefore, the number of cell pairs is reduced from |si|−q 336 280 is located in the middle of two strings, it requires ( 2 + to . 2 |sj |−q 2 2 2 In this way, the number of cells left in a matrix is O(l×τ), 1) × ( 2 +1) combinations, which is O(|si| ×|sj| ). so the complexity of the threshold based approach is 2 2 O(l × τ ). 4.2 Optimization Techniques Now we show three techniques to optimize the verification 4.2.3 A Skyline-Based Approach process. Ideally we hope to further reduce the number of cell pairs. The challenge is whether we can find a smallest subset of 4.2.1 A Length-based Approach cells for combination without losing any local join results. First we show that the verifications can be restricted in Next, we propose techniques to address this challenge. a small region based on length threshold, denoted by l We first use an example as shown in Fig. 7 to illustrate constraint. Notice that l constraint is different from length the idea of our approach. Consider the two cells 4, 3 and filtering used in [14], [35]. 4, 2 in the forward matrix DF . The values DF (4, 3) = 1 and DF (4, 2) = 2 show that the substring pair ARDC, ARC is Lemma 1. Given a length l and an edit distance threshold τ,if “better” than ARDC, , since the former provides a longer two strings si and sj satisfy edl(si,sj) ≤ τ, there must exist a substring pair with smaller edit distance. We call the cell substring α of string si and a substring β of string sj, such that 4, 3 dominates the cell 4, 2. The formal definition of cell ed(α, β) ≤ τ and one of the following conditions holds: domination is given in Definition 3. (i) |α| = l and l ≤|β|≤l + τ; (ii) |β| = l and l ≤|α|≤l + τ. Definition 3 (CELL DOMINATION). Given two cells i1,j1 Proof. Please refer to Appendix A. and i2,j2 in an edit distance matrix D. The cell i1,j1 5

s p p +q j v v R C G A C A GA A R C D R A GG s 0 1 2 0 1 2 i e C 111 2 A 110 2 D G 2 2 2 1 2 R 2 1 0 1 2

e B G 33 D 22 p 3 2 2 1 1 1 u D 43443 C 221 2 3 p +q u A 4343 R 2223 D F e C 4 3 4 C 3 3 3 C D e D D

(a) Whole matrix. (b) Backward matrix DB . (c) Forward matrix DF .

Fig. 6: Calculating forward and backward matrices.

0123456 0123456 a pair of cells i1,j1∈SKY (DB) and i2,j2∈SKY (DF ) R CAG CA A R C D R A where DB(i1,j1)+DF (i2,j2)=τ. The reason is that for 2 2 0 0 1 Combinations 0 0 1 two strings si and sj,ifedl(si,sj) ≤ τ, there must be 1 C 1112 D (0,0), D (3,5) 1 A 110 2 〈〉B F | |≥ 2 G 2 2 2 1 2 D (0,0), D (4,4) 2 R 2 1 0 1 2 a substring pair α, β of si,sj which satisfies α l, 〈〉B F 3 G 3 3 2 2 3 D (2,3), D (3,4) 3 D 221 1 1 |β|≥l and ed(α, β) ≤ τ. Then we can extend α or β to 〈〉B F    D 3 D (2,3), D (4,3) C | |≥ 4 3 〈〉B F 4 221 find a new substring pair α ,β which satisfies α l, D (3,4), D (2,2)    5 A 〈〉B F 5 R |β |≥l and ed(α ,β )=τ. Therefore, we only care about 6 C 6 C a cell pair when their value summation is τ. Notice that if D Forward matrix D the extensions reach the boundaries of the strings, we may Backward matrix B F not be able to find such cell pair, since all the cell value Fig. 7: Reducing cell combinations. summation could be less than τ. In this case, the two strings are locally similar if they satisfy |si|≥l and |sj|≥l. In order to do it efficiently, we group skyline cells with ≺ ≥ dominates i2,j2 (denoted by i1,j1 i2,j2 ), if i1 i2, the same value in a group and rank them in ascending order. ≥ ( ) ≤ ( ) j1 j2, and D i1,j1 D i2,j2 . Then we combine cells in two groups only when their edit We can see that verifying those pairs with dominating distance summation is τ. Consider the same example in ( ) {0 0} {2 3} cells is enough to get all join results. The reason is that Fig. 7. In SKY DB , cell values in group , , , , {3 4} 0 1 2 verifying a longer pair with smaller edit distance can always and , are , , and , respectively. Similarly, in ( ) {2 2} {3 4 4 3} cover the results of their substring pairs with larger edit SKY DF cell values in groups , , , , , and {3 5 4 4} 0 1 2 distances. , , , are , , and , respectively. We do not need to combine cell 3, 4 in the third group in SKY (DB) Lemma 2. In an edit distance matrix, a cell can be discarded with cells in the third group in SKY (DF ), since their edit without affecting the result of local similarity verification if it is distance summation is larger than 2. And cell 2, 3 in the dominated by another cell. second group in SKY (DB) does not need to be combined 2 2 ( ) According to Lemma 2, to verify the local similarity, we with cell , in the first group in SKY DF , since their 2 only need to consider those cells that are not dominated edit distance summation is smaller than . In this way, by other cells. The goal is to compute the skyline (i.e. the the number of combinations of cell pairs is reduced from 15 5 minimum cell dominating set) of cells, denoted as SKY (D), to as depicted in Fig. 7. The time complexity of cell O( 3) which includes all the cells in an edit distance matrix D that combination is τ , and the total time complexity of the O( × + 3) are not dominated by any other cell, namely: skyline based approach is l τ τ .   SKY (D)={ c ∈ D | ∃c ∈ Dsuchthatc ≺ c } (2) 5PRUNING MATCHING GRAM PAIRS Example 5. Consider the backward matrix in Fig. 7. All the In this section, we show that not every matching gram pair cells except those in red box are dominated by other cells, needs to be verified using the technique in Section 4. Some thus can be discarded. Those arrowed lines represent the matching gram pairs either cause duplicate verifications or dominating relationships. The skyline of cells in the back- generate a substring pair that cannot pass the verification. SKY (D )={0, 0, 2, 3, 3, 4} ward matrix is B . Similarly However, no existing pruning methods can be directly used SKY (D )= the skyline of cells in the forward matrix is F in LJC problem. To solve the problem, we propose two new {2, 2, 3, 4 3, 5, 4, 3, 4, 4} , . orthogonal pruning techniques to reduce candidates and an Theorem 2. Given an edit distance threshold τ, the skyline incremental method to boost the process. (τ+1)(τ+2) SKY (D) contains at most 2 cells. 5.1 Discarding Consecutive Matching Gram Pairs Proof. Please refer to Appendix A. Next we show that not all the cells in SKY (DB) and First we show that not all the matching gram pairs need SKY (DF ) need to be combined. Consider the three cells in to be verified. It is based on an observation that the ver- SKY (DB) and the five cells in SKY (DF ) in Example 5. ifications on several gram pairs always produce duplicate We will get 15 combinations of cell pairs. However, this results. We call those gram pairs consecutive matching gram number could be further reduced. We only need to consider pairs. 6

Definition 4 (CONSECUTIVE MATCHING GRAM PAIRS). We could utilize an l-length sliding window to de- Consider two strings si and sj. Let u = si[i1,i1 + q − 1] termine which matching gram pair needs to be verified.  and u = si[i1 +1,i1 + q] be two grams of si, and Matching grams can also be found by using inverted index.  v = sj[i2,i2 + q − 1] and v = sj[i2 +1,i2 + q] be two Consider two strings si and sj, suppose that the q-grams   grams of sj.Ifu = v and u = v , the pairs of u, v and in string sj are indexed, so that we can scan the q-grams   u ,v are called consecutive matching gram pairs. within an l-length window of string si to find their matching grams. And we verify the matching gram pairs only when Given a group of consecutive matching gram pairs, only the count number is not less than LB. one of them needs to be extended for further verification. For an l-length window in string si, we utilize a min Theorem 3. If two strings si and sj have a group of consecutive heap to get a sorted matching gram list, so that counting matching gram pairs, verifying any of them will not lose any local grams in an l-length window can stop if it finds a gram similarity result compared with verifying all of them. outside the current window. Fig. 9 shows an example. Let l =5, τ =1and q =2, Proof. Please refer to Appendix A. we have LB =2. Consider the first window in string si. It contains four grams g1, g2, g3, and g4. We first get the It is easy to check if a matching gram pair gi has a con- sorted matching gram list of string sj as shown on the right. secutive matching gram pair by checking its left characters. Counting matching grams in the window at position 2 in If the left characters are the same, we can safely discard sj stops when it finds that gram g2 is outside the current gi without verification since there must exist a consecutive window. So there is only one matching gram pair in this matching gram pair on the left of gi. For example, con- window. Since LB =2, the matching gram pair g1 from sider two strings si and sj as shown in Fig. 8. Let gram si and sj does not need to be verified. While counting length q =2. si[9, 10],sj[9, 10], si[10, 11],sj[10, 11], matching grams in the window at position 8 will find three and si[11, 12],sj[11, 12] are consecutive matching gram matching grams g2, g3, and g4. Since the gram count is pairs. In this case, we only need to verify the gram pair larger than 2, these gram pairs need to be verified. Notice si[9, 10],sj [9, 10], since only its left characters C and R are that we can utilize both Theorem 3 and Theorem 4. In this different. example, the matching grams g3 and g4 do not need to be

1 2 3 4 5 6 7 8 9 101112131415161718 verified. s i: D C C A D G G C R A A R D C R C D D

s 5.3 Incremental Local Count Filtering j: A G A C A G C R R A A R C D R A G G R A In this section, we show that local count filtering can be A A used incrementally since neighbor windows share several A R common grams. By doing so, the performance of choosing Fig. 8: Pruning consecutive matching gram pairs. matching gram pairs can be improved significantly.

5.3.1 Incremental Techniques 5.2 Local Count Filtering We first consider the case for a fixed l-length window in a string si. We slide an l-length window through all the Count-filtering principle [8] has been widely used in global positions of the matching grams in sj. During the process, similarity join. It mandates that if a string x and a string y we update the count of matching grams. We subtract one = are within edit distance τ, they must share at least LB when a matching gram leaves the current window and add (| | | |) − +1− × max x , y q q τ common q-grams. one when a new matching gram arrives in the window. In this section, we show how to introduce count filtering Reconsider the example in Fig. 9. For the window at LJC in the scenario of problem to prune those matching position 1 in si, the window at position 8 in sj contains three gram pairs that could not generate the final results. matching grams (g2, g3, and g4). When sliding the window 9 s g2 Theorem 4. If two strings si and sj satisfy edl(si,sj) ≤ τ, to position in j, gram leaves from the window and there must exist an l-length substring x in si and an l-length no new matching gram arrives, so the number of matching 2 substring y in sj, such that x and y share at least LB = l +1− grams in this window is . × ( +1)≥ 1 1 ≤ ≤ l The process can be further improved based on the obser- q τ common q-grams, in which q τ+1 . vation: every time we slide the window for one step in sj, Proof. Please refer to Appendix A. only one gram leaves from the window, which results in the number of matching grams reducing one. Let the number Local count filtering. Given a length threshold l, an edit of grams in a window be n and n>LB, we can skip τ q 1 ≤ q ≤ l ) distance threshold , and a gram length ( τ+1 , n − LB steps sliding, since if we slide the window within if none of an l-length substring α in si and an l-length n − LB steps, the gram count must be greater or equal to ( = +1− ×( +1)) substring β in sj share LB LB l q τ common LB. Consider the same example in Fig. 9, let LB =2. For q s s ed (s ,s ) ≤ τ -gram, then i and j cannot satisfy l i j . the window at position 1 in si, when we slide the window 8 Theorem 5. Local count filtering will not cause any false dis- to position , the number of matching grams in the window 3 10 missal. is . We can directly slide the window to position . When sliding the window from position 1 to 2 in string Proof. Please refer to Appendix A. si, we hope to avoid counting matching grams in every 7

s 1 2 3 ... 8 9 10 ... 13 14 15 16 17 18 s : 1 2 3 4 5 6 ...... j: i j g1 g2 g1 g2 g5 g1 ... g5 g3 g2 g4 g3 g1 g2 g3 g4 l g4 ~ ~ ~ ~ g5 s s s s < j,2> < j,8> < j,9> < j,10> [1-1+3 = 3] l s s < j,13> < j,15> ...... [3-1+0 = 2] [3-2+0 = 1] ... [2-1+0 = 1] Fig. 9: Local count filtering (l =5,τ =1,q =2). window in sj as what we did for the first window in si. ALGORITHM 2. LS-JOIN-WINDOW(S, l, τ) +1 In general, when the window is slid from position p to p Input: S: A string set in si, the leftmost gram gl in the window at position p leaves l: A given length threshold and rightmost gram gr in the window at position p +1 τ: A given similarity threshold A ∈ × | ( ) ≤ arrives. So we need to delete the matching grams of gl and Output: ={si,sj S S edl si,sj τ } insert the matching grams of gr into the sorted matching 1 I ←∅; ← − × ( +1)+1 gram list of sj. Consider the same example in Fig. 9. When 2 LB l q τ ; s ∈ S we slide the window in si from position 1 to position 2, 3 foreach i do w ← s [1,l] gram g1 leaves and gram g5 arrives. Correspondingly, in 4 i ; 5 G ← sorted_gramLists(w); sj, we need to remove the matching grams g1 at positions 2 6 foreach Gsj ∈ G do 13 g5 3 and and insert the matching grams at positions and 7 x ← y ← 1; 17 . 8 while x ≤|Gsj | do m [ +1] − [ ] ≤ − Let gr be a matching gram of gr in sj. We only need to 9 while Gsj y .v Gsj x .v l q do m ← +1 verify the matching gram pair gr,gr if the count number 10 y y ; m − + of the window at position gr l q in sj is greater than 11 if y − x +1≥ LB then or equal to LB. Notice that we do not need to verify other 12 for k ← x to y do [ [ ] − 1] = [ [ ] − 1] gram pairs. The reason is as follows. Before inserting the 13 if si Gsj k .u sj Gsj k .v Verify(G [k]) new gram pair, if other gram pairs belong to a window And sj then A s ,s containing at least LB matching grams, then they have 14 .add( i j ); already be verified. Otherwise, it means those gram pairs 15 x ← y +1− LB; cannot pass local count filtering without the new gram pair. 16 x ← x +1; In this case, verifying the new gram pair is enough. And the maximal count number arises when we align the ends 17 for x ← 2 to |si|−l +1do m 18 IncrementalFiltering(x); of an l-length window to the end of gr , thus we only need m − + ∈ ( ) to check the window at position gr l q in sj. 19 foreach u grams si,q do Consider the example in Fig. 9. For the first matching 20 Iu.add( si,pu ); gram g5 in sj, the count number is 1, so the matching gram 21 return A; pair si[5,q +4],sj[3,q +2] does not need to be verified. While for the second matching gram g5 in sj, the count number is 2, so the matching gram pair si[5,q+4],sj[17,q+ 16] needs to be verified. A special case: Consider a special case when LB =1, the local count filtering becomes unnecessary. In this case, one 5.3.2 Detailed Algorithm matching gram pair is enough to pass the local count filter- ing, and all the matching gram pairs except the consecutive Algorithm 2 shows the detailed algorithm. It visits each matching gram pairs need to be verified. We can modify the string si ∈ S. For the first l-length window of string si, process as follows. We start the process by finding a match- it finds the matching grams using an inverted index I, and ing gram pair of length q. If their left characters are different, gets the sorted matching gram lists (line 5). For a string sj, we directly do the local similarity verification. In this way, which contains several matching grams, the function first we save the costs of both maintaining matching gram lists sets the current window aligned with the first gram (line 7), and counting grams in an l-length sliding window. then checks the rest grams until the next gram is out of LB =1 q = the current window (line 9). Then it checks the gram count Notice that, is the sufficient condition of l LB =1 in the current window. If the number is not less than LB τ+1 but not a necessary condition. That is, leads = l = l =1 (line 11), it prunes the consecutive matching gram pairs and to q τ+1 , whereas q τ+1 cannot lead to LB . verifies the rest gram pairs (line 13). Notice that since we use two cursors x and y to denote the first gram and the last Complexity: We first analyze the space complexity. The gram of the current window, the gram count can be directly inverted index includes grams and their inverted lists. For computed based the cursors, which is y − x +1. After the string s, it generates |s|−q +1grams. Suppose we can verification, it skips y − x +1− LB steps sliding (line 15). encode a string in a constant space, the space complexity of O( (| |− +1)) Then it does incremental local count filtering on the other l- the grams is s∈S s q . As each gram has an length windows (line 18). In the end, it updates the inverted string id and position, the space complexity of the inverted O( (| |− +1)) index (line 20). list is also s∈S s q . And for an l-length window 8

at position i in string s, the space complexity of the sorted 6LOCATING LOCAL SIMILAR PAIRS O( ( )) matching gram lists is x∈grams(s[i,i+l−1],q) w x ,in In this section, we focus on the second problem, locating ( ) which w x is list length of gram x. So the maximum space local similar pairs (LJL problem). We show how to extend cost of the sorted matching gram lists is the LS-Join approach to efficiently compute the longest O( max w(x)). similar substring pairs and construct matching gram pairs s∈S,i∈[1,|s|−l+1] for the LJL problem. x∈grams(s[i,i+l−1],q)

Next, we analyze the time complexity. Recall that for 6.1 Locating Technique each string s ∈ S, we use three steps to get all its local simi- To find the longest similar substring pairs, we slightly lar pairs. The first step is to find candidates, which includes modify the proposed method in Section 4 as follows. When two parts. The first part is for generating matching gram computing the skyline of cells, the extension length of a pairs and its time complexity is O( ∈ ( ) w(x)), x grams s,q matching gram pair is not restricted by the length e,in since for each gram x decomposed from s, we need to scan which e = l + τ − q. Thus all the techniques proposed the inverted list of x to generate all matching gram pairs. in Section 4 can be used except the l-constraint and early The second part is for generating candidates and its time termination 1 (ET1). complexity can be estimated as O( w(x) × x∈grams(s,q) When combining the left and right cells, we can get log(L)), where L represents the average matching gram several similar matching substrings. For two strings si and number in the processed strings for the grams in an l-length sj, we record the first local similar substring pair α, β, and window of string s, and log(L) is the cost to keep these store the current longest matching length as min(|α|, |β|). matching grams in ascending order. The time complexity Then we update it if a new matching substring pair is longer for the first step is than the current one. O( ( ) × ( )) w x log L . Example 6. Consider the example in Fig. 10. Let l =6 x∈grams(s,q) and τ =2. In the backward matrix, the skyline of cells ( )={0 0 2 3 3 4} The second step is for verifying candidates. As we know, a SKY DB , , , , , , and in the forward ma- ( )={2 2 3 4 4 3 matching gram pair u, v is a candidate only if there exists trix, the skyline of cells SKY DF , , , , , , 5 5} a pair of substrings extended from u, v such that they , . Among them, the longest matching pair is the com- 2 3 4 3 min(| | | |)=8 contain enough number of common grams (see Theorem 4). bination of cells , and , , where α , β . = = Let (X ≥ LB) be the probability of a matching gram pair The longest similar substrings are α GCRAARDC and β GCRRAARC being a candidate. Then the candidate number can be esti- . ( ) × ( ≥ ) mated as x∈grams(s,q) w x Pr X LB . According to s the analysis of verification in Section 4, the time complexity i: D C C A D G G C R A A R D C R C D D 3 of verification for a gram pair is O(l × τ + τ ), so the time s complexity of the second step is j: A G A C A G C R R A A R C D R A G G 3 A R C D R A G G O( w(x) × Pr(X ≥ LB) × (l × τ + τ )). R C G A C A 0 1 2 x∈grams(s,q) 0 1 2 A 1 0 1 2 C 111 2 R 2 110 2 The third step is for updating inverted index. Since the G 2221 2 D 2 1 1 1 2 update cost for a gram is O(1) and there are |s|−q +1 G 332 2 3 C 2221 D 33 3 grams for string s. The time complexity of the third step is R 2223 O(|s|−q +1). C 333 Therefore the total time complexity for the string set S is Fig. 10: Computing longest matching pair. O( w(x)×(Pr(X ≥ LB)×(l×τ+τ 3)+log(L))). s∈S x∈grams(s,q) The time complexity of computing the skyline of cells in   matrix D is O(l × τ), in which l is the longest extension Analysis of gram length q. Below we show that the gram length on one direction. It depends on the string length and length q can affect the performance. Given a string s ∈ S, when the computing process can be early terminated using for a q-gram x ∈ grams(s, q), we can use each gram y in the  ET2. In the worst case, we have l = min(|si|, |sj|) − q. inverted list of x to construct a matching gram pairs x, y. The time complexity of combining skyline of cells in both So a larger q requires less time to generate all matching gram 3 backward matrix and forward matrix is still O(τ ).Sothe pairs since it associates with a shorter inverted list of x (i.e. a time complexity of computing the longest matching pair for smaller w(x)). The number of candidates for gram x equals  3 a matching gram pair is O(l × τ + τ ). to w(x) × Pr(X ≥ LB). The lower bound of local count filtering is LB = l +1− q × (τ +1). It is clear to see that a larger q leads to a smaller LB, and further a larger Pr(X ≥ 6.2 Local Count Filtering With Incremental Window LB). So the time for verfiying candidates could be varied Length since number of candiates w(x) × Pr(X ≥ LB) does not To find matching gram pairs for the LJL problem, naively change monotonically with q. we can still utilize the proposed techniques in Section 5. It We report the effect of gram length q for different data can further be improved based on the following observation. sets in Section 7.3. For two strings si and sj, if we have already found their 9

substrings α and β satisfy edh(α, β) ≤ τ, in which h ≥ l,we compare with local similarity join, a large proportion of can improve the filter condition by discarding the matching similar substrings cannot be identified by global similarity gram pairs which cannot extend to similar substring pairs join methods. with length longer than h. The problem is transformed to finding matching gram pairs for edh+1(si,sj) ≤ τ. TABLE 2: Comparison of locating similar strings. # Therefore the lower bound number of matching grams in Data sets global join results ×100%  = +2− × ( +1) #local join results Section 5.2 is changed to LB h q τ for an IMDB (l =15,τ =2) 0.4% (h +1)-length substring pair. DBLP (l =50,τ =5) 19% Notice that, when processing string si using the gram Protein (l = 100,τ =7) 4.4% inverted index I, we could find different lengths of local similar pairs. For example, we need to keep the window length l in si for one string sj, whereas increase the window 7.2 Overall Performance length in si from l to l +1for another string sk since we are We evaluated the overall performance of the proposed aiming at locating the longest local similar pair for each pair methods. Two methods were evaluated. (1) Check method, of strings. which utilizes the skyline-based verification method in Sec- tion 4 and incremental local count filtering method in Sec- 7EXPERIMENTS tion 5.3 to find local similar string pairs. (2) Locate method, We implemented the proposed techniques and conducted which utilizes local count filtering with incremental window an extensive experimental study using three real data sets2 length and locating method in Section 6 to find the longest local matching pairs. The whole time cost was spent on (i) (1) IMDB, which contains 1, 568, 893 film names taken constructing matching gram pairs, (ii) verifying matching 3 from the IMDB website . The average length is 26, gram pairs, and (iii) updating the inverted index. We show and the alphabet size is 161. these time costs separately and use subscripts “C”, “V”, and (2) DBLP, which contains 1, 158, 648 bibliography “U” to denote them, respectively. Notice that the cost of 4 records taken from DBLP Website . The average updating inverted index is very little compared with the length is 74, and the alphabet size is 93. other two steps. (3) Protein, which contains 508, 038 protein sequences Fig. 11 shows the performance using different edit dis- 5 obtained from the uniprot database . The average tance thresholds. Apparently for a fixed l, the time increased length is 347, and the alphabet size is 25. when we increased the threshold τ, since more candidates To better study the scalability of the proposed method, were generated for a larger edit distance, which required we also performed experiments using synthetic string more time to do verification. collections. Table 1 shows the parameter ranges. Fig. 12 shows the performance using different length thresholds. We fixed the threshold τ and varied length TABLE 1: Parameters in synthetic data set. threshold l. When we increased the threshold l, the time cost decreased, since we found fewer candidates for a larger Parameters Range Number of strings (×1, 000, 000) 1, 2, 3, 4, 5 threshold l. The performance of the Check method was Average string length 30, 60, 120, 240, 480 better than the Locate method, since the latter one needs Alphabet size 4, 10, 26, 36, 100 to take more time to locate the longest matching substrings. Take the protein data set as an example, when l = 120, All the algorithms were implemented in C++ and com- it took 522.8 seconds to join 508, 038 strings, and 576.1 piled with G++ 4.7 with a “-O3” flag. The experiments were seconds to compute their longest substring pairs. run on a machine with 2.93 GHz Intel Core CPU, 8 GB main memory, and a Ubuntu . 7.3 Gram Pair Construction Methods We evaluated three gram pair construction methods: (1) 7.1 Comparison with Global Similarity Join The basic pruning method, which generates every matching Compared to global similarity join, local similarity join gram pair as discussed in Section 3, denoted by GenGram- can identify more local similar results. For example, Basic; (2) The local count filtering method, which checks similar IMDB records Mission: Impossible II and the gram count in an l-length substring pair as discussed Mission: Impossible - Operation Surma can be in Section 5.2, denoted by GenGram-LocalCount; and (3) detected using local similarity join but failed to be found The GenGram-LocalCount method combined with consec- using global similarity join. Similarly, DBLP record Data utive matching grams pruning technique as described in structures and algorithms for approximate Section 5.1, denoted by GenGram-NoRedundant. We used string matching and Filter algorithms for the GenGram-Basic method as the baseline to show the approximate string matching can be detected using pruning power of GenGram-LocalCount and GenGram- local similarity join but not global similarity join. The same NoRedundant. The setting was the same as Fig. 11. Notice ≤ l situation occurred on Protein data set. Table 2 shows that that the gram length q must satisfy q τ+1 according to the analysis in Section 5. 2. The data sets are also used in [35] 3. http://www.imdb.com/ =1− #gram 4. http://www.informatik.uni-trier.de/~ley/db Pruning power: We use #gramb to represent 5. http://www.uniprot.org/ pruning power, in which #gram is the number of gram 10

5 5 2.5 CheckC CheckC CheckC

Sec.) Check Check Sec.) Check

V Sec.) 4 V 2 V 4 3

4 4 CheckU CheckU CheckU Locate Locate Locate 3 C 3 C 1.5 C LocateV LocateV LocateV Locate Locate Locate 2 U 2 U 1 U

1 1 0.5 Runing Time(x10 Running Time(x10 0 0 Running Time(x10 0 1 2 3 4 1 3 5 7 1 3 5 7 9 11 Edit distance threshold τ Edit distance threshold τ Edit distance threshold τ (a) IMDB (l =15). (b) DBLP (l =50). (c) Protein (l = 100). Fig. 11: Performance with different edit distance thresholds. 18 4 3 CheckC CheckC CheckC 15 2.5

Sec.) Check Check Sec.) Check

V Sec.) V V 3 3

4 3 CheckU CheckU CheckU 12 2 LocateC LocateC LocateC Locate Locate Locate 9 V 2 V 1.5 V LocateU LocateU LocateU 6 1 1 3 0.5 Runing Time(x10 Running Time(x10 0 0 Running Time(x10 0 11 13 15 17 19 42 46 50 54 58 70 80 90 100 110 120 Length threshold l Length threshold l Length threshold l (a) IMDB (τ =2). (b) DBLP (τ =5). (c) Protein (τ =7).

Fig. 12: Performance with different length thresholds.

GenGram-AllPairs GenGram-AllPairs GenGram-AllPairs GenGram-NoRedundent GenGram-NoRedundent GenGram-NoRedundent 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 Pruning power Pruning power Pruning power 0.2 0.2 0.2 0 0 0 1 2 3 4 1 3 5 7 1 3 5 7 9 11 Edit distance threshold τ Edit distance threshold τ Edit distance threshold τ (a) IMDB (l =15). (b) DBLP (l =50). (c) Protein (l = 100).

Fig. 13: Pruning power of gram pair construction methods. 7 10 NoIncremental NoIncremental 10 NoIncremental 6

Sec.) Incremental Sec.) 8 Incremental Sec.) Incremental 4 4 4 8 5 4 6 6 3 4 4 2 2 2 1 GenCan Time(x10 0 GenCan Time(x10 0 GenCan Time(x10 0 1 2 3 4 1 3 5 7 1 3 5 7 9 11 Edit distance threshold τ Edit distance threshold τ Edit distance threshold τ (a) IMDB (l =15). (b) DBLP (l =50). (c) Protein (l = 100).

Fig. 14: Performance of incremental gram pair construction method.

8 7 Verify-Basic 10 Verify-Basic 107 Verify-Basic 10 7 6 Verify-Length 10 Verify-Length 6 Verify-Length 10 Verify-Threshold 6 Verify-Threshold 10 Verify-Threshold 5 10 10 Verify-Dominate 5 Verify-Dominate 105 Verify-Dominate 4 10 10 4 10 104 3 3 10 10 3 2 10 10 102 1 2 10 101 10 Verification Time(Sec.) Verification Time(Sec.) Verification Time(Sec.) 100 100 101 1 2 3 4 1 3 5 7 1 3 5 7 9 11 Edit distance threshold τ Edit distance threshold τ Edit distance threshold τ (a) IMDB (l =15). (b) DBLP (l =50). (c) Protein (l = 100). Fig. 15: Performance of different verification methods. 11

pairs constructed by GenGram-LocalCount or GenGram- 7.4 Evaluating Verification Methods #gram NoRedundant, b is the number of gram pairs con- We evaluated four verification methods. (1) The Verify-Basic structed by GenGram-Basic. Fig. 13 shows the pruning method proposed in Section 4.1 (it enumerates all cell pairs power of the two methods. We found that GenGram-Basic in the backward and forward matrices); (2) The Verify- generated many unnecessary gram pairs in all the settings. Length method proposed in Section 4.2.1 (it reduces cell τ =1 99.91% Take the IMDB data as an example. When , pairs based on length threshold); (3) The Verify-Threshold of the gram pairs were pruned using GenGram-LocalCount method proposed in Section 4.2.2 (it further reduces cell 99.97% method; while of the gram pairs were pruned using pairs based on edit distance threshold); and (4) the Verify- τ =4 93.65% GenGram-NoRedundant. When , of the gram Skyline method proposed in Section 4.2.3 (it only utilizes 96.61% pairs were pruned using GenGram-LocalCount, and dominating cells to do verification). of gram pairs were pruned using GenGram-NoRedundant. Efficiency: We compared the verification time of the four With the increase of threshold τ, the pruning power of methods in Fig. 15. Take the IMDB data set for example, GenGram-LocalCount decreased. The reason is that when the performance of Verify-Basic was the worst. Verify-Length τ increased, the gram count bound LB decreased, so the was 2 − 6 times faster than Verify-Basic. Verify-Threshold pruning power became weaker. For the protein data set, further reduced the time cost (3−45 times faster than Verify- =11 52 31% when τ , only . of the gram pairs were pruned Basic), while Verify-Skyline achieved the best performance 88 29% by GenGram-LocalCount, while . of the gram pairs (20 − 90 times faster than Verify-Basic). On other data sets, were pruned by GenGram-NoRedundant. with the increase of average string length and length thresh- old l, the gap between those methods became bigger. In TABLE 3: Effect of gram length q. particular, Verify-Basic could not finished the job in 100, 000 Data sets Pruning power seconds, thus was stopped. The results are consistent with IMDB q =2 q =3 q =4 q =5(LB =1) our early analysis. As the time complexity of Verify-Basic is (l =15,τ =2) 84.2% 99.8% 98.7% 79.1% 2 2 O(|s1| ×|s2| ), the performance was worst. Verify-Length DBLP q =5 q =6 q =7 q =8 4 (l =50,τ =5) reduced it to O(e ). Verify-Threshold improved the time 83.2% 94.6% 98.4% 80.8% 2 2 Protein q =7 q =8 q =9 q =10 complexity to O(l × τ ). While Verify-Skyline further re- 3 (l =100,τ =7) 86.1% 92.7% 95.5% 84.9% duced the time complexity to O(l × τ + τ ).

TABLE 4: Time cost for different gram lengths. 7.5 Scalability Data sets Total running time (Sec.) To better study the scalability of the proposed method, IMDB q =2 q =3 q =4 q =5(LB =1) we performed experiments using synthetic data sets. We ( =15 =2) 18 707 8 4 742 4 3 036 9 l ,τ , . , . 2,896.4 , . varied the sizes of the data sets, average string lengths, and DBLP q =5 q =6 q =7 q =8 (l =50,τ =5) 18, 443.6 12, 914.9 11,316.6 13, 126.8 alphabet sizes to the performance of our method. For Protein q =7 q =8 q =9 q =10 each data set, we run three local similar join locating jobs. ( = 100 =7) 1685 8 1177 5 1 112 9 l ,τ . . 971.5 , . We set the length thresholds as 20, 50, and 100 respectively, and the edit distance threshold as 10% of the according Effect of gram length q: We evaluated the pruning power of length threshold. The results are shown in Fig. 16. We first GenGram-NoRedundant method for different gram lengths set the average string length to be 120, the alphabet size to q. The results are shown in Table 3. As q increased, the be 26. We varied the data sizes from 1 million to 5 million pruning power first increased, then decreased. The reason and tested the time cost of our method. The results are is that when q increased, the inverted lists became shorter, shown in Fig. 16(a). We can see that the performance of our but LB also decreased. These two factors were “competing” method had nearly linear relationship with data size. Then in terms of their effect on the number of candidates. We we set the string number to be 1 million, and varied the also evaluated the total running time for different gram average string lengths from 30 to 480 as shown in Fig. 16(b). lengths. Table 4 shows the results. Similar with the results We find that it took more time on longer strings. The reason shown in Table 3, with the increasing of q, the time cost first is that longer strings generated more candidates and results. decreased, then increased. The total time cost depends on Finally we varied the alphabet sizes from 4 to 100 as shown both the filtering and verification processes. And we find in Fig. 16(c). We can see that our method took more time on that in IMDB data set, the highest pruning power happened data sets with smaller alphabets. The reason is that strings =3 when q , but it did not take the shortest time. The result with smaller alphabet have higher probability to be locally is consistent with our analysis in Section 5.3.2. Although similar. it took less time in verification step, it took more time in filtering step. 7.6 Index Sizes Performance of incremental gram pair construction method: We compared two methods: (1) the basic gram We evaluated index sizes on the three real data sets. Table 5 pair construction method, denoted by NoIncremental. (2) the shows the index sizes using different gram lengths. We can incremental method as discussed in Section 5.3, denoted by find that with the increasing gram length q, the index size Incremental. Fig. 14 shows that Incremental method avoided decreased, which is consistent with our space analysis in a lot of redundant computations compared to NoIncremen- Section 5.3.2. The index sizes for Check and Locate methods tal method, thus reduced the time cost. And the gap became are the same. We also tested the index size on synthetic bigger for larger length thresholds. data sets. The results are shown in Fig. 17. We can see that 12

4 3 4 l = 20, τ = 2 l = 20, τ = 2 l = 20, τ = 2 τ τ τ

Sec.) l = 50, = 5 Sec.) l = 50, = 5 Sec.) l = 50, = 5 4 3 l = 100, τ = 10 4 l = 100, τ = 10 4 3 l = 100, τ = 10 2

2 2

1 1 1

Running Time(x10 0 Running Time(x10 0 Running Time(x10 0 1 2 3 4 5 30 60 120 240 480 4 10 26 36 100 Number of strings(x1000,000) Average string length Alphabet size (a) Avg Len = 120, |Σ| =26. (b) |S| = 1000,000, |Σ| =26. (c) |S| = 1000,000, Avg Len = 120. Fig. 16: Scalability.

5 4 l = 20, τ = 2 l = 20, τ = 2 14 l = 20, τ = 2 τ τ τ 4 l = 50, = 5 l = 50, = 5 12 l = 50, = 5 l = 100, τ = 10 3 l = 100, τ = 10 l = 100, τ = 10 10 3 2 8 2 6 1 4 1 2 Index Size(x100 MB) Index Size(x1000 MB) Index Size(x1000 MB) 0 0 0 1 2 3 4 5 30 60 120 240 480 4 10 26 36 100 Number of strings(x1000,000) Average string length Alphabet size (a) Avg Len = 120, |Σ| =26. (b) |S| = 1000,000, |Σ| =26. (c) |S| = 1000,000, Avg Len = 120. Fig. 17: Index size. 6 LPassJoin LPass-Join LPassJoin 5 10 Sec.) LTaste Sec.) LTaste Sec.) LTaste 4 LASM 4 LASM 4 LASM 4 8 LSSM LSSM 10 LSSM 3 LS-Join 6 LS-Join 8 LS-Join 6 2 4 4 1 2 2 Running Time(x10 0 Running Time(x10 0 Running Time(x10 0 1 2 3 4 1 3 5 7 1 3 5 7 9 11 Edit distance threshold τ Edit distance threshold τ Edit distance threshold τ (a) IMDB (|S| =90, 000, l =15). (b) DBLP (|S| =80, 000, l =50). (c) Protein (|S| =30, 000, l = 100). Fig. 18: Comparison with other methods. the index size of our method had nearly linear relationship with length not less than l, and α is similar to an entity β in with the number of strings and average string length. The s, we add r, s into the result set. (3) The approach [27] is alphabet size has little effect on the index size. the most efficient approximate substring matching method reported in the literature. We modified it by searching all the TABLE 5: Index size. l-length substrings in each string, denoted LASM. For an l- Data sets Index size (MB) length substring α in string r, if it finds a similar substring β IMDB q =2 q =3 q =4 q =5 with length not less than l in string s, we add r, s into the (l =15,τ =2) 313.89 298.13 287.13 277.84 result set. (4) We modified the (l, τ) local similarity search =5 =6 =7 =8 DBLP q q q q method [3] to support local similarity join, denoted LSSM. (l =50,τ =5) 683.49 661.37 648.92 632.28 We modified it to a loop process which searches all the Protein q =7 q =8 q =9 q =10 (l = 100,τ =7) 1, 591.83 1, 543.28 1517.73 1, 498.96 strings and check if the matching substrings satisfy the local similarity constraints.

7.7 Comparison with Other Methods We chose four existing well-known approximate string search and join methods and made the following modifi- The comparison results are shown in Fig. 18. Since all the cations to make them support local similarity join. (1) Pass- other methods require long time to find all similar substring Join [17] is the most efficient global similarity join method pairs, we had to use smaller data sets to do the comparison. reported in the literature. We modified it by searching sub- We can find the performance of all the other methods strings with lengths between l and l+τ, denoted LPass-Join. decreased significantly with the increasing edit distance (2) The approach [6] is the most efficient approximate entity threshold. For all the three data sets, our method performed extraction method reported in the literature. We modified much better than the other methods. The comparison results it by enumerating all the l-length substrings as entities, show that modifying existing approximate string methods denoted LTaste. If it finds a string r containing a substring α cannot efficiently solve the local similarity join problem. 13

7.8 Evaluating Parallelism This similarity scoring scheme is specially defined for bio- We also implemented a parallel local similarity join method. applications based on a given input Expectation value (a.k.a. To execute the local similarity join operation in parallel, E-value) [1], which is not suitable for other applications like we follow the same way in [10] to first the dataset S web documents, records in product lists, or English texts. into several small datasets, and use threads to process them The edit distance and length constraints defined in this pa- as follows. Consider m threads are used, we first split the per can support local similarity join for various applications   as shown in our experiments. Under our definition, we whole dataset into m small datasets S1 ...Sm. We conduct  propose an efficient approach which guarantees to get all self join on each small dataset Si. Then we utilize each   results with neither false positive nor false negative. thread to conduct join on pairwise datasets Si and Sj,in which i

Sec.) 4 Sec.) 4 Sec.) 2 4 τ = 2 4 τ = 3 3 τ = 7 τ τ τ 3 = 1 3 = 1 1.5 = 5 τ = 3 τ 2 2 1 = 1

1 1 0.5

Running time(x10 0 Running time(x10 0 Running time(x10 0 1 2 4 8 1 2 4 8 1 2 4 8 Number of Threads Number of Threads Number of Threads (a) IMDB (l =15). (b) DBLP (l =50). (c) Protein (l = 100). Fig. 19: Performance of the parallel method.

[10] Y. Jiang, D. Deng, J. Wang, G. Li, and J. Feng. Efficient parallel [30] C. Xiao, J. Qin, W. W. 0011, Y. Ishikawa, K. Tsuda, and K. Sadakane. partition-based algorithms for similarity search and join with Efficient error-tolerant query autocompletion. PVLDB, 6(6):373– edit distance constraints. In Joint 2013 EDBT/ICDT Conferences, 384, 2013. EDBT/ICDT ’13, Genoa, Italy, March 22, 2013, Workshop Proceedings, [31] C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for pages 341–348, 2013. similarity joins with edit distance constraints. PVLDB, 1(1):933– [11] Y. Jiang, G. Li, J. Feng, and W.-S. Li. String similarity joins: An 944, 2008. experimental evaluation. PVLDB, 7(8):625–636, 2014. [32] X. Yang, H. Liu, and B. Wang. ALAE: accelerating local align- [12] Y. Kim and K. Shim. Efficient top-k algorithms for approximate ment with affine gap exactly in biosequence databases. PVLDB, substring matching. In SIGMOD, pages 385–396. ACM, 2013. 5(11):1507–1518, 2012. [13] M. Lenzerini. Data integration: A theoretical perspective. In Pro- [33] X. Yang, B. Wang, C. Li, J. Wang, and X. Xie. Efficient direct search ceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Sympo- on compressed genomic data. In ICDE, pages 961–972, 2013. sium on Principles of Database Systems, June 3-5, Madison, Wisconsin, [34] X. Yang, Y. Wang, B. Wang, and W. Wang. Local filtering: Improv- USA, pages 233–246, 2002. ing the performance of approximate queries on string collections. [14] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms In SIGMOD 2015, pages 377–392, 2015. for approximate string searches. In Proceedings of the 24th Interna- [35] Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed- tional Conference on Data Engineering, ICDE 2008, April 7-12, 2008, : an all-purpose index structure for string similarity search Cancún, México, pages 257–266, 2008. based on edit distance. In SIGMOD Conference, pages 915–926, [15] C. Li, B. Wang, and X. Yang. Vgram: Improving performance of 2010. approximate queries on string collections using variable-length grams. In VLDB, pages 303–314, 2007. Jiaying Wang is currently a PhD candidate in [16] G. Li, D. Deng, and J. Feng. Faerie: efficient filtering algorithms the School of Computer Science and Engineer- for approximate dictionary-based entity extraction. In SIGMOD ing, Northeastern University, Liaoning, China. Conference, pages 529–540, 2011. His research interests mainly include approxi- [17] G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based mate string matching, query processing and op- method for similarity joins. VLDB, 5, 2011. timization. [18] G. Li, S. Ji, C. Li, and J. Feng. Efficient fuzzy full-text type-ahead search. VLDB J., 20(4):617–640, 2011. [19] J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity query processing with the asymmetric signature scheme. In SIGMOD Conference, pages 1033–1044, 2011. Xiaochun Yang received the PhD degree in [20] L. A. Ribeiro and T. Härder. Efficient set similarity joins using min- computer science from Northeastern University, prefixes. In Advances in Databases and Information Systems, pages China, in 2001. She is a professor in the De- 88–102. Springer, 2009. partment of Computer Science at Northeastern [21] T. F. Smith and M. S. Waterman. Identification of common University, China. Her research interests include molecular subsequences. In Journal of Molecular Biology, pages 195– data quality, and data privacy. She is a member 197, 1981. of the ACM, the IEEE Computer Society, and a [22] B. S. T. Bocek, E. Hunt. Fast Similarity Search in Large Dictionaries. senior member of the CCF. Technical report, Department of Informatics, University of Zurich, April 2007. [23] E. Ukkonen. Algorithms for approximate string matching. Infor- mation and control, 64(1):100–118, 1985. Bin Wang received the PhD degree in computer [24] S. Wandelt, D. Deng, S. Gerdjikov, S. Mishra, P. Mitankin, M. Patil, science from Northeastern University in 2008. E. Siragusa, A. Tiskin, W. Wang, J. Wang, and U. Leser. State- He is currently an associate professor in the of-the-art in string similarity search and join. SIGMOD Record, Computer System Institute at Northeastern Uni- 43(1):64–76, 2014. versity. His research interests include design and [25] J. Wang, G. Li, and J. Feng. Trie-join: Efficient trie-based string analysis of algorithms, queries processing over similarity joins with edit-distance constraints. PVLDB, 3(1):1219– streaming data, and distributed systems. He is a 1230, 2010. member of the CCF. [26] J. Wang, X. Yang, and B. Wang. Cache-aware parallel approx- imate matching and join algorithms using BWT. In Joint 2013 EDBT/ICDT Conferences, EDBT/ICDT ’13, Genoa, Italy, March 22, Chengfei Liu received his PhD degrees in com- 2013, Workshop Proceedings, pages 404–412, 2013. puter science from Nanjing University, China, in [27] J. Wang, X. Yang, B. Wang, and C. Liu. An adaptive approach of 1988. Currently, he is a professor in the Swin- approximate substring matching. In Database Systems for Advanced burne University of Technology, Australia. His Applications - 21st International Conference, DASFAA 2016, Dallas, research interests include keywords search on TX, USA, April 16-19, 2016, Proceedings, Part I, pages 501–516, 2016. structured data, query processing, and refine- [28] W. Wang, J. Qin, C. Xiao, X. Lin, and H. T. Shen. Vchunkjoin: An ment for advanced database applications, query efficient algorithm for edit similarity joins. IEEE Trans. Knowl. Data processing on uncertain data and big data, and Eng., 25(8):1916–1929, 2013. data-centric workflows. He is a member of the [29] W. Wang, C. Xiao, X. Lin, and C. Zhang. Efficient approximate IEEE and the ACM. entity extraction with edit distance constraints. In SIGMOD Conference, pages 759–770, 2009. 15

APPENDIX A In summary, the number of cells in SKY (D) is at most (τ+1)(τ+2) 1+2+...+(τ +1)= 2 . The proof Theorem 1. We first prove the correctness. Ac- cording to the edit distance definition, we have ed(α, β) ≤ The proof Theorem 3. We prove it by contradiction. First = ed(αl,βl)+ed(u, v)+ed(αr,βr). If the algorithm checks a consider two consecutive matching gram pairs u1 [ + − 1] = [ + − 1] = combination satisfies |αl|+|αr|≥l−q and |βl|+|βr|≥l−q si i1,i1 q ,v1 sj i2,i2 q and u2 [ +1 + ] = [ +1 + ] and ed(αl,βl)+ed(αr,βr) ≤ τ, since both u and v are q- si i1 ,i1 q ,v2 sj i2 ,i2 q . Suppose a pair grams, we have |α|≥l, |β|≥l and ed(α, β) ≤ τ. Therefore, of local similar substrings can only be extended based on we get ed(α, β) ≤ τ. u2,v2 but not u1,v1. The case is only possible when We next prove the completeness. It can be proved by the similar substrings start from u2,v2 . Suppose the sim- [ +1 ] [ +1 ] contradiction. Assume there exists a local similar substring ilar substrings are si i1 ,b and sj i2 ,d, and we ( [ +1 ] [ +1 ]) ≤ − ≥ pair α, β with length ≥ l satisfying ed(α, β) ≤ τ and have ed si i1 ,b,sj i2 ,d τ, b i l and − ≥ = [ ]= [ ] they share at least one common q-gram (q ≤ l ), but by d j l. Since u1 v1, we have si i1 sj i2 , thus τ+1 ( [ ] [ ]) ≤ using our verification method, we cannot find this answer. ed si i1,b,sj i2,d τ. There is a contradiction since we According to Definition 1, we know l>τ,so l ≥1.Itis assume that a local similar matching substring pair can only τ+1 a reasonable assumption, since if l ≤ τ, any two strings with be found based on u2,v2 . Similarly we can prove that if length not less than l are locally similar, because the allowed the local similarity can be verified based on u1,v1 , it can edit distance exceeds the length constraint. According to this also be verified based on u2,v2 . Next, consider a group of assumption, we cannot find this local similar pair α, β consecutive matching gram pairs. If the local similarity can from any matching gram pair, therefore, in the alignment be verified based on one of them ux,vx , according to the corresponding to ed(α, β) the longest matching substring proof above, it can also be verified based on ux−1,vx−1 could not be longer than q − 1. In such alignment, totally and ux+1,vx+1 . We can continue the process to find that there are at most τ edit operations and (τ +1)matching the verification on any of the consecutive matching gram substrings, each of which has length of (q−1). So the longest pairs is enough to check the local similarity. length of the string α (or β)is|α|≤τ +(τ +1)(q − 1). Since 1 |α|≥l, we get l ≤ τ +(τ +1)(q − 1). However, based on The proof Theorem 4. According to Case in Lemma 1, ( + ) q ≤ l , we get τ +(τ +1)(q−1) = (τ +1)q−1 ≤ l−1 l, we can construct two substrings α and  common q-grams. Similarly we can prove the correctness for β by removing the last |α|−l characters from α and β   case 2. Next we prove that we can always promise LB ≥ 1 respectively, so |α | = l, l ≤|β |≤l + τ. According to the l   when 1 ≤ q ≤ +1 . Since l>τ(see Definition 1), we have edit distance diagonal property, we have ed(α ,β ) ≤ τ,so τ l ≥1. And as long as 1 ≤ q ≤ l , we can promise condition (i) holds. Similarly, when |α|≥|β|, condition (ii) τ+1 τ+1 LB = l +1− q × (τ +1) ≥ 1. Notice that LB decreases holds. l with the increasing of gram length q, thus τ+1 is also the maximum value that q can be chosen. The proof Theorem 2. We first prove that there cannot be The proof Theorem 5. Apparently τi, the-fly, therefore, any local similar pair must not be pruned ≺   ≺ we know i, j i ,j ; otherwise i ,j i, j . In either using the local count filtering since they must share at least ( ) case, SKY D cannot contain both of them. Next consider LB common q-grams. the case when two cells are on neighbor diagonals. Suppose   there is a cell i, j and another cell i ,j , which is on the   same diagonal with cell i +1,j. Let D(i, j)=D(i ,j ).If      i ≥ i +1, then i ,j ≺i, j; otherwise i, j≺i ,j . In either case, they cannot be both in SKY (D). The case   where cell i ,j is on the same diagonal with cell i − 1,j can be proved in the same way. Next we prove that the number of dominating cells with value t cannot be more than t +1. Since we have |i − j|≤D(i, j) ≤ t, the diagonals containing cells with value t cannot be more than 2t +1.As there cannot exist two cells with the same value t on two neighbor diagonals, there can be at most t +1dominating cells with value t on the 2t +1diagonals.