Top-K String Similarity Search with Edit-Distance Constraints

Home , Edit distance

Top-k String Similarity Search with Edit-Distance Constraints Dong Deng† Guoliang Li† Jianhua Feng† Wen-Syan Li‡ †Department of Computer Science, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China. [email protected]; {liguoliang, fengjh}@tsinghua.edu.cn ‡SAP Labs, Shanghai, China. [email protected]

Abstract— String similarity search is a fundamental operation For example, the edit distance between “schwarzenegger” in many areas, such as data cleaning, information retrieval, and and “shwarseneger” is 3. Edit distance can be used to k bioinformatics. In this paper we study the problem of top- string capture typographical errors for text documents and evaluate similarity search with edit-distance constraints, which, given a collection of strings and a query string, returns the top-k strings similarities for homologous proteins or genes [18], and is with the smallest edit distances to the query string. Existing widely adopted in many real-world applications. methods usually try different edit-distance thresholds and select We can extend existing threshold-based similarity search an appropriate threshold to find top-k answers. However it is methods [10] to support our problem as follows. We increase rather expensive to select an appropriate threshold. To address the edit-distance threshold by 1 each time (initialized as 0). this problem, we propose a progressive framework by improving the traditional dynamic-programming algorithm to compute For each threshold, we use existing methods to find those edit distance. We prune unnecessary entries in the dynamic- strings with edit distances to the query string no larger than the programming matrix and only compute those pivotal entries.We threshold. If there are smaller than k similar strings, we check extend our techniques to support top-k similarity search. We the next threshold; otherwise we compute top-k similar strings develop a range-based method by grouping the pivotal entries with this threshold. However this method is rather expensive to avoid duplicated computations. Experimental results show that our method achieves high performance, and significantly because it executes multiple similarity search operations for outperforms state-of-the-art approaches on real-world datasets. different thresholds. To address this problem, we propose a progressive framework to efficiently find top-k answers. I. INTRODUCTION A well-known method to compute the edit distance between String similarity search takes as input a set of strings two strings employs a dynamic-programming algorithm using and a query string, and outputs all the strings in the set a matrix (see Section III-A). Notice that we do not need that are similar to the query string. It is an importan- to compute all entries of the matrix. Instead we propose a t operation and has many real applications such as data progressive method which prunes unnecessary entries and only cleaning, information retrieval, and bioinformatics. For exam- computes some entries. We extend this technique to support ple, most search engines support query suggestions, which top-k similarity search (see Section III-B). To further improve can be implemented using the similarity search operation. the performance, we propose pivotal entries and only need to Consider a query log {“schwarzenegger,”“russell,”...} compute the pivotal entries in the matrix (see Section IV). We and a query “shwarseneger.” String similarity search returns develop a range-based method to group the pivotal entries to “schwarzenegger” as a suggestion. It has attracted signifi- avoid duplicated computations (see Section V). To summarize, cant attention from the database community recently [10]. we make the following contributions. Existing similarity search methods [10] require users to • We devise a progressive framework to address the prob- specify a similarity function and a similarity threshold. They lem of top-k string similarity search. find those strings with similarities to the query string within • We propose pivotal entries to find top-k answers, which the given threshold. However it is rather hard to give an can avoid many unnecessary computations by pruning appropriate threshold, as a small threshold will involve many large numbers of useless entries. dissimilar answers and a large threshold may lead to few results. To address this problem, in this paper we study the • We develop a range-based method to group the pivotal problem of top-k string similarity search, which, given a entries so as to avoid duplicated computations. collection of strings and a query string, returns the top-k most • Experimental results show that our method significantly similar strings to the query string. outperforms existing approaches. There are many similarity functions to quantify the similari- Paper Structure. We formalize the top-k string similarity ty of two strings, such as Jaccard similarity, Cosine similarity, search problem and review related works in Section II. A and edit distance. In this paper, we focus on edit distance. progressive framework is proposed in Section III. We devise The edit distance of two strings is the minimum number a pivotal-entry based method in Section IV and a range- of single-character edit operations (i.e. insertion, deletion, based method in Section V. Experiment results are provided and substitution) needed to transform one string to another. in Section VI. We conclude the paper in Section VII.

978-1-4673-4910-9/13/$31.00 © 2013 IEEE 925 ICDE Conference 2013 II. PRELIMINARIES string, and proved that two strings are similar only if their gram A. Problem Formulation sets share enough common grams. They used inverted indexes to index the grams. Given a query string, they generated its Given a collection of strings and a query string, top-k string grams, retrieved the corresponding inverted lists, and merged similarity search finds top-k strings with the highest similar- the inverted lists to find similar answers. We can extend these ities to the query string. In this paper we use edit distance methods to support the top-k similarity search problem as to quantify the similarity of two strings. The edit distance of follows. We increase the edit-distance threshold by 1 each time two strings is the minimum number of single-character edit (initialized as 0). For each threshold, we find similar strings operations (i.e. insertion, deletion, and substitution) needed to of the query using existing methods [10]. If the size of the transform one string to another. Given two strings r and s, similar string set is not smaller than k, we terminate and return we use ED(r, s) to denote their edit distance. For example, ED k strings with the minimum edit distances. However these (“seraji”, “sraijt”) = 3. Next we formulate the problem of methods have to enumerate different edit-distance thresholds top-k string similarity search with edit-distance constraints. and involve many unnecessary computations. Navarro studied k Definition 1 (Top- String Similarity Search): Given a the approximate string matching problem [14], which, given a S q k string set and a query string , top- string similarity query string and a text string, finds all substrings of the text R⊆S |R| = k search returns a string set such that and for that are similar to the query. any string r ∈Rand s ∈S−R, ED(r, q) ≤ ED(s, q). Similarity Joins: There are many studies on string similarity Example 1: Consider the string set in Table I and a joins [5], [1], [2], [4], [15], [18], [19], [16], [18], [11], [17]. query “srajit”. Top-3 string similarity search returns Given two string sets, a similarity join finds all similar string {“surajit”, “seraji”, “sarit”}. The edit distances of the pairs. Xiao et al. [19] studied top-k similarity joins by using three strings to the query are respectively 1, 2 and 2. The edit a prefix filtering based technique. Their problem is different distances of other strings to the query are not smaller than 2. from ours which finds top-k similar pairs from two sets. They dynamically tuned the thresholds to find top-k pairs. Jestes et TABLE I al. [7] extended similarity joins to support probabilistic strings. A STRING SET S AND A QUERY q =“srajit” ID s1 s2 s3 s4 s5 s6 Fuzzy Prefix Search: Ji et al. [8] and Li et al. [13], [12] String sarit seraji suijt suit surajit thrifty utilized trie structures to support fuzzy prefix search. They B. Related Works specified a threshold and computed results based on the thresholds. Wang et al. [16] proposed a trie-based framework Top-k String Similarity Search: Yang et a. [20] proposed to support similarity joins. Although they used a trie structure a gram-based method to support top-k similarity search. It to support approximate search, they focused on prefix search increased thresholds by 1 each time from 0 and tuned the gram and their methods cannot support top-k similarity search. length dynamically. However it needed to build redundant inverted indexes for different gram lengths and resulted in low III. A PROGRESSIVE FRAMEWORK efficiency. Zhang et a. [21] indexed signatures (e.g., grams) of We first propose a method to progressively compute edit B+ B+ strings using a -tree and utilized the -tree to compute distance (Section III-A) and then develop a progressive frame- k B+ top- answers. It iteratively traversed the -tree nodes, work to support top-k search (Section III-B). computed a lower bound of edit distances between the query A. Progressively Computing Edit Distance and strings under the node, and used the lower bound to update the threshold. However this method had to enumerate many We first consider the problem of computing the edit distance strings to adjust the threshold. Kahveci et a. [9] transformed between two strings. The traditional method uses a dynamic r s a set of contiguous substrings into a Minimum Bounding programming method. Given two strings and , it utilizes a D |r| +1 |s| +1 Rectangle (MBR) and used the MBR to estimate the edit- matrix with rows and columns to compute |s| s s[j] distance threshold of top-k answers. However the MBR-based their edit distance. Let denote ’s length, denote the j s s[i, j] s estimation usually estimated a large threshold and thus this -th character of , and denote ’s substring from the i j D[i][j] method resulted in low efficiency. -th character to the -th character. is the edit distance r[1,i] s[1,j] Different from existing methods, we first identify the strings between the prefix and the prefix . Obviously D[i][0] = i 0 ≤ i ≤|r| D[0][j]=j with smaller edit distances to the query string and use the edit for each and for 0 ≤ j ≤|s| D[i][j] distances of these strings as bounds to facilitate computing each . Then it iteratively computes for 1 ≤ i ≤|r| 1 ≤ j ≤|s| top-k answers. We propose a progressive method to efficiently and as follows: identify such strings and compute the tight bounds. D[i][j]=min(D[i−1][j]+1,D[i][j−1]+1,D[i−1][j−1]+δ), String Similarity Search with Thresholds: There are many (1) studies on approximate string search [3], [10], [6], [21], which, where δ =0if r[i]=s[j]; otherwise δ =1. D[|r|][|s|] is given a set of strings, a query string, a similarity function, exactly the edit distance between r and s. The time complexity and a threshold, finds all strings with similarities to the query is O(|r|×|s|) and the space complexity is O(min(|r|, |s|)). string within the threshold. Existing methods usually employed For example, Figure 1(a) shows the matrix D to compute the a gram-based method. They first generated q-grams of each edit distance of r =“seraji” and s =“srajit”.

926 may exist entries i+t, j +t such that D[i+t][j +t]=x+1 t ≥ 2 (i +1,j+1) for . We use FINDMATCH find such entries. (2) Insertion: As we can insert r[i+1] after s[j], D[i+1][j] ≤ D[i][j]+1 = x +1.Ifi +1,j ∈ H, then D[i +1][j]= x +1, thus we add i +1,j into Ex+1. Similarly, we use (i +1,j) x +1 FINDMATCH to find entries whose values are . (3) Deletion: As we can delete s[j +1] from s, D[i][j +1]≤ D[i][j]+1 = x +1 i, j +1 ∈ H D[i][j +1]= (a) Traditional method (b) Progressive method .If , then x +1 i, j +1 E Fig. 1. Computing edit distance of two strings , thus we add into x+1. Similarly, we use FINDMATCH(i, j +1) to find entries whose values are x +1. A progressive method: We propose a progressive method to We can prove that our progressive method correctly com- compute the edit distance, which only computes some entries putes Ex as formalized in Lemma 1. in the matrix, instead of all entries. We use i, j to denote E the entry of the i-th row and the j-th column of the matrix. Lemma 1: The entry set x computed by our method satisfies (1) completeness: if D[i][j]=x, i, j must be in Let Ex denote the set of entries in the matrix whose values Ex; and (2) correctness: if i, j∈Ex, D[i][j]=x. are x. Given two strings r and s,if|r|, |s| ∈ Ex,wehave ED(r, s)=x. To compute the edit distance of r and s,we Example 3: Table II illustrates how to compute the edit iteratively compute Ex from x =0. Initially we compute E0. distance between “seraji” and “srajit”. Firstly, we com- If |r|, |s| ∈ E0, the edit distance between r and s is 0 and we pute E0 = FINDMATCH(−1, −1) = {0, 0, 1, 1}. Then terminate the computation; otherwise we compute E1 based on we compute E1 based on E0. Consider 0, 0∈E0.We E0. Iteratively we compute the value x such that |r|, |s| ∈ Ex want to add 1, 1 (substitution), 1, 0 (insertion), and 0, 1 and return x as the edit distance. (deletion), into E1.As1, 1∈H = E0, we do not add it Example 2: Figure 1(b) shows the matrix D to compute into E1.For1, 1, we want to add 2, 2 (substitution), 2, 1 the edit distance between “seraji” and “srajit”. We first (insertion), and 1, 2 (deletion), into E1.For2, 1∈E1,we 2, 1 3, 2 4, 3 5, 4 compute E0 = {0, 0, 1, 1}.As|r|, |s| ∈ E0, we compute use FINDMATCH operation to add , , , , 6, 5 E E 6, 6∈E E1 = {0, 1, 1, 0, 1, 2, 2, 1, 2, 2, 3, 2, 4, 3, 5, 4, into 1. Similarly we compute 2.As 2,we 6, 5}.As|r|, |s| ∈ E1, we compute E2 = {0, 2, 1, 3, return 2 as the edit distance of the two strings. 2, 0, 2, 3, 3, 1, 3, 3, 4, 2, 4, 4, 5, 3, 5, 5, 6, 4, Complexity: Given two strings r and s, suppose their edit 6, 6}.As|r| =6, |s| =6∈E2, the edit distance distance is τ. For any entry i, j∈Ex,wehave|i − j|≤ between “seraji” and “srajit” is 2 and we terminate the ED(r[1,i],s[1,j]) ≤ x. For each i, i − x ≤ j ≤ i + x, thus computation. Notice that we can prune many useless entries. |Ex|≤(2x+1) ×min(|r|, |s|). The space complexity is O τ × min(|r|, |s|) . In addition, each entry in Ex+1 is computed Algorithm to compute Ex: We employ an iterative method to from at most three entries (left entry, top entry, top-left entry) compute Ex. Initially we compute E0, and then we compute E O( τ |E |)∗ E E in x. Thus the time complexity is x=0 x . As there x+1 based on x. For ease of presentation, we first introduce τ are at most (2τ +1)× min (|r|, |s|) entries in ∪ Ex, the an operation called FINDMATCH, which, given two prefixes x=0 O τ × min (|r|, |s|) of two strings, finds the entries with matching characters after time complexity is . r s the two prefixes. Formally, given two strings and , and two B. Progressive Similarity Search integers i and j,FINDMATCH(r, s, i, j) returns a set of entries k i + t, j + t such that s[i +1,i+ t]=r[j +1,j+ t]. If the We extend the progressive method to support top- similar- S two strings are clear in the context, FINDMATCH(r, s, i, j) ity search. Given a collection of strings, , and a query string q s ∈S is abbreviated as FINDMATCH(i, j). For example, consider , for each string , we compute its entry set, denoted by E (s) E s, i, j s ∈S two strings “seraji” and “srajit”. For i =2,j =1, x . Let x denote the set of triples where i, j E (s) s, i, j∈E i = |s| as s[i +1,i +4]= r[j +1,j +4]= “raji”, i +4,j + and in x . For each triple x,if j = |q| s q x 4 will be returned by the FINDMATCH operation. We have and , the edit distance between and is and we R |R| ≥ k FINDMATCH(2, 1) = {3, 2, 4, 3, 5, 4, 6, 5}. add it into the result set .If , we have found the top-k answers and terminate the iteration. Initially we use the FINDMATCH operation to compute E0. E x =0 For each entry i, j∈E0,asD[i][j]=0,wehavei = j and Next we discuss how to compute x.For ,we s ∈S IND ATCH r[1,i]=s[1,j]. Obviously E0 = FINDMATCH(−1, −1).Next enumerate each string and use the F M E operation to generate the entry set for s. For each entry we use an EXTENSION operation to compute x+1 based on x i, j∈E0(s) s, i, j E x +1 Ex. Let H=∪ Et. For each i, j∈Ex,EXTENSION applies , we add triple into 0.For ,we t=0 s, i, j∈E the following operations. enumerate each triple x and use the EXTENSION operation to compute Ex+1(s) based on Ex(s). For each pair (1) Substitution: As we can substitute r[i +1] for s[j +1], i ,j in Ex+1(s),weadds, i ,j into E . D[i+1][j +1] ≤ D[i][j]+1 = x+1.Ifi+1,j+1 ∈ H (i.e., x+1 D[i+1][j +1] >x D[i+1][j +1]= x+1 ∗ ), then , thus we add We use a hash table to implement H, and thus the complexity to check whether an i +1,j+1 into Ex+1.Asr[i +2] may match s[j +2], there entry is in H is O (1).

927 TABLE II `5 a PROGRESSIVELY COMPUTING EDIT DISTANCE (“srajit”,“seraji”) `5 a ` 5 a (a) E = (−1, −1) = {0, 0, 1, 1} 0 FINDMATCH `5a `5a `5 a (b) E1 E0 Computing based on E0 0,0 1,1 `5a ` 5 a Substitution Insertion Deletion Substitution Insertion Deletion `5a `5a EXTENSION 1, 1 1, 0 0, 1 2, 2 2, 1 1, 2 3, 2, 4, 3, 5, 4, 6, 5 E1 1, 0, 0, 1, 2, 2, 2, 1, 3, 2, 4, 3, 5, 4, 6, 5, 1, 2 (c) Computing E2 based on E1 E1 1,0 0,1 2,2 2,1 3,2 4,3 5,4 6,5 1,2 Substitution 2,1 1,2 3,3 3,2 4,3 5,4 6,5 2,3 EXTENSION Insertion 2,0 1,1 3,2 3,1 4,2 5,3 6,4 2,2 7 7 7 77 7 Deletion 1,1 0,2 2,3 2,2 3,3 4,4 5,5 6, 6 1,3 E 6, 6 2 2,0 , 0,2 , 2,3 , 3,1 , 4,2 , 3,3 , 5,3 , 4,4 , 6,4 , 5,5 , , 1,3 Fig. 2. A trie structure for the strings in Table I However this method is expensive as it needs to enumerate FINDMATCH(n, q, j +1). every strings in S. Notice that we can share computations We can prove that Tx computed by our method satisfies on common prefixes of different strings. Consider two strings completeness and correctness as formalized in Lemma 2. s1,s2 and s1[1,i]=s2[1,i].Fort ≤ i,ifs1,t,j∈Ex, then Tx s2,t,j∈Ex, and vice versa. To share the computations on Lemma 2: The set computed by our method satisfies (1) (n, q[1,j]) = x n, j T common prefixes, we propose a trie-based method. completeness: If ED , is in x; and (2) n, j T (n, q[1,j]) = x We use a trie structure to index the strings in S. Each node correctness: If is in x, ED . on the trie is associated with a character. The character of the Example 4: Table III illustrates how to compute top-3 re- root is . The characters on a path from the root to a leaf node srajit n0, 0 T0 † sults of query “ ”. First, we add into . Then correspond to a string . Strings with the same prefixes share a we call FINDMATCH(n0,q,0) and add n1, 1 into T0 as n1 n common trie node. For simplicity, node is interchangeably matches q[1]. Next we extend each entry in T0 to generate T1. used with its corresponding prefix (the string composed of For n0, 0, we want to add n1, 1, n21, 1 (substitution), n characters from the root to node ). For example, Figure 2 n1, 0, n21, 0 (insertion), and n0, 1 (deletion). As n1, 1 shows a trie structure for strings in Table I. “suijt”, “suit”, is in T0, we do not add it into T1.Forn1, 1 in T0, we want n and “surajit” share a common prefix “su” (node 11). to add n2, 2n6, 2n11, 2 (substitution), n2, 1n6, 1 k Next we discuss how to use the trie structure to find top- n11, 1 (insertion), and n1, 2 (deletion) into T1.Forn1, 2, T answers efficiently. Let x denote the set of trie-based entries as its child n2 matches q[3], we use the FINDMATCH operation n, j (n, q[1,j]) = x n j with ED , where is a trie node and is to add n2, 3 into T1. Similarly we compute T1 and T2. n, j T n j = |q| an integer. For in x,if is a leaf node and , As there are 3 answers in T2, we terminate and return n20 n R |R| ≥ k we add into .If , we terminate the iteration. Next (“surajit”) with edit distance 1, and n5 (“sarit”) and n10 we discuss how to compute Tx. (“seraji”) with edit distance 2. For x =0, T0 is the set of entries n, j, where n Complexity: Given a query string q, an integer k and a string matches q[1,j]. We compute T0 as follows. As the root r set S, suppose the maximum edit distance between the top- matches the empty string ,weaddr, 0 into T0. Then k q τ n, j∈T we use the operation FINDMATCH(r, q, 0) to find the entries. results and is . For any entry x,wehave |n|−j ≤ (n, s[1,j]) ≤ x n If the root has no child with character q[1],FINDMATCH ED . Thus for each node ,we have |n|−x ≤ j ≤|n| + x. Let |T | denote the number of terminates; otherwise, there exists a child nc with character trie nodes in the first |q| + τ levels. We have |Tx|≤(2x + q[1],FINDMATCH adds nc, 1 into T0. Next for node nc, 1)|T | |T | FINDMATCH checks whether it has a child with label q[2] . In practice, we can prune many trie nodes, and x T is much smaller than (2x+1)|T |. Thus the worst-case space and repeats the above steps. Iteratively we get 0. τ O τ ×|T| n, j∈∪ Tx Next we use the EXTENSION operation to compute Tx+1 complexity is . For any entry x=0 , x based on Tx. Let H=∪ Tx.Forn, j in Tx,EXTENSION we compute it from at most three entries, thus the worst-case t=0 O |∪τ T | = τ ×|T| applies the following operations. time complexity is x=0 x . n n (1) Substitution: For each child c of node ,wecan IV. PIVOTAL ENTRY BASED METHOD substitute the character of nc for q[j +1].Ifnc,j+1 ∈ H, In this section, we propose a method to reduce the size of we add nc,j +1 into Tx+1.Asnc may contain a child Ex in order to improve the performance of computing the edit with label q[j +2], next we call the FINDMATCH(nc,q,j+1) operation to find those entries with matching characters. distance between two strings (Section IV-A) and then extend this technique to support similarity search (Section IV-B). (2) Insertion: For each child nc of node n, we can insert character of nc after q[j].Ifnc,j ∈ H,weaddnc,j into A. Using Pivotal Entries to Compute Edit Distance Tx+1. We also call the operation FINDMATCH(nc,q,j) . Based on the complexity analysis in Section III-A, if we (3) Deletion: We can delete q[j +1] from q.Ifn, j +1 ∈ H, can reduce the size of Ex, we can improve the performance. we add n, j+1 into Tx+1. We also need to call the operation Here we discuss how to reduce the size of Ex. Consider an † i, j Ex i, j Ex To make each tire leaf node corresponds to a string and vice versa is to add a special entry in . The goal of keeping in is to add mark to the end of each string. For simplicity we do not show the mark in the figure. i+1,j, i, j +1, and i+1,j+1 into Ex+1.Ifi+1,j+1

928 TABLE III AN EXAMPLE FOR TOP-3 SIMILARITY SEARCH “srajit” ON S USING THE PROGRESSIVE SEARCH FRAMEWORK

(a) T0 = n0, 0∪FINDMATCH(0, 0) = {n0, 0, n1, 1} (b) Computing T1 based on T0 T0 n0, 0 n1, 1 Substitution Insertion Deletion Substitution Insertion Deletion n1, 1n21, 1 n1, 0n21, 0 n0, 1 n2, 2n6, 2n11, 2 n2, 1n6, 1n11, 1 n1, 2 EXTENSION n3, 2n7, 2n8, 3n9, 4n10, 5 n2, 3 n16, 2n17, 3n18, 4n19, 5n20, 6 n21, 1n1, 0n21, 0n0, 1n2, 2n6, 2n11, 2n2, 1n6, 1n11, 1n3, 2n7, 2n8, 3 T1 n9, 4n10, 5n16, 2n17, 3n18, 4n19, 5n20, 6n1, 2n2, 3 (c) Computing T2 based on T1 T1 n21, 1 n0, 1 n1, 0 n21, 0 n2, 2 n6, 2 n11, 2 n1, 2 n2, 1 n6, 1 n11, 1 Substitution n22, 2 n1, 2 n2, 1 n22, 1 n3, 3 n7, 3 n12, 3 n2, 3 n3, 2 n7, 2 n12, 2 EXTENSION n21, 2 n6, 1 n16, 3 n6, 3 n16, 2 n11, 1 n13, 4 n11, 3 Insertion n22, 1 n1, 1 n2, 0 n22, 0 n3, 2 n7, 2 n12, 2 n2, 2 n3, 1 n7, 1 n12, 1 n23, 2 n21, 1 n6, 0 n16, 2 n6, 2 n16, 1 n11, 0 n11, 2 Deletion n21, 2 n0, 2 n1, 1 n21, 1 n2, 3 n6, 3 n11, 3 n1, 3 n2, 2 n6, 2 n11, 2 n22, 2n21, 2n22, 1n23, 2n0, 2n2, 0n6, 0n11, 0n22, 0n3, 3 T2 n7, 3n6, 3n12, 3n16, 3n11, 3n12, 2n13, 4n1, 3n3, 1n7, 1n12, 1n16, 1

T1 n2, 3 n3, 2 n7, 2 n8, 3 n9, 4 n10, 5 n16, 2 n17, 3 n18, 4 n19, 5 n20, 6 Substitution n3, 4 n4, 3 n8, 3 n9, 4 n10, 5 n17, 3 n18, 4 n19, 5 n20, 6 EXTENSION n4, 5 n5, 6 Insertion n3, 3 n4, 2 n8, 2 n9, 3 n10, 4 n17, 2 n18, 3 n19, 4 n20, 5 Deletion n2, 4 n3, 3 n7, 3 n8, 4 n9, 5 n10, 6 n16, 3 n17, 4 n18, 5 n19, 6 n3, 4n2, 4n5, 6n4, 3n4, 2n8, 2n8, 4n9, 3 T2 n18, 3n19, 4n20, 5n9, 5n10, 6n10, 4n16, 3n17, 2n17, 4n18, 5n19, 6 is also in Ex, we can remove i, j from Ex. The main reason distance. For example, in Figure 3, 0, 0 is not a pivotal entry p is as follows. First, i +1,j +1 is already in Ex, thus it as D[1][1] =D[0][0]. 1, 1 is a pivotal entry and E0 = {1, 1}. cannot be added into Ex+1. Second we prove that i+1,j and Although 2, 1, 3, 2, 4, 3, 5, 4 are in E1, they are not p i, j+1 are not needed to add into Ex+1. Consider i+1,j.If pivotal entries, as they are dominated by 6, 5.WehaveE1 p D[i+1][j] |r| j>|s| support or as follows. r[1,m]=s[1,m] by checking whether r[i]=s[i] for i ∈ [1,m]. p p If i>|r| or j>|s|, Next we discuss how to compute Ex+1 based on Ex. We ﬁrst prove that there are at most 2x +1 pivotal entries in Ep . First, D[i][j]=min(D[i][j−1]+1,D[i−1][j]+1,D[i−1][j−1]+1); x for any entry i, j,if|i − j| >x,wehaveD[i][j] >x(As r[1,i] s[1,j] If i ≤|r| and j ≤|s|, the edit distance between and is not smaller than their length difference, i.e., D[i][j] ≥|i − j|). Thus for any p D[i][j]=min(D[i][j−1]+1,D[i−1][j]+1,D[i−1][j−1]+δ), entry i, j in Ex,wehave−x ≤i−j≤ x. Second, let Ex[y] denote the set of entries whose i-value minus j-value is y, δ =0 r[i]=s[j] δ =1 D[i][j]= where if ; otherwise .If i.e., y = i−j. For any y∈[−x, x], there is at most one pivotal D[i+Δ][j +Δ] i, j i+Δ,j+Δ p , we call is dominated by . entry in Ex[y]. We can prove it by contradiction. Suppose there p Then we introduce a concept called pivotal entry. are two entries i, j and i ,j in Ex[y]. Without loss of i >i D[i][j]=D[i][j]=x Deﬁnition 2 (Pivotal Entry): An entry i, j in Ex is called generality, suppose .As ,we a pivotal entry, if D[i +1][j +1]= D[i][j]. can prove that for 0 ≤ ≤i − i, D[i + ][j + ]=x, p as formalized in Lemma 4. Thus i, j is not a pivotal entry, Obviously |r|, |s| is a pivotal entry. Let Ex denote the set p which contradicts with the assumption. of pivotal entries in Ex.If|r|, |s| ∈ Ex, ED(r, s)=x.To p compute the edit distance of r and s, we iteratively compute Lemma 4: Given any two entries i, j and i ,j in Ex[y] p p Ex from x =0.If|r|, |s| ∈ Ex, we return x as their edit (i

929 Ep [y] TABLE IV Based on Lemma 4, we prove that x has at most one srajit seraji Ep 2x +1 PIVOTAL ENTRIES TO COMPUTE EDIT DISTANCE(“ ”,“ ”) pivotal entry and x has at most pivotal entries, as p (a) E0 = {1, 1} formalized in Lemma 5. p p (b) Computing E1 based on E0 p Ep Ep[0] = 1, 1 Lemma 5: Ex[y](−x ≤ y ≤ x) has at most one pivotal 0 0 p Substitution Insertion Deletion E 2x +1 p p p entry and x has at most pivotal entries. EXTENSION E [0] = 2, 2 E [1] = 2, 1 E [−1] = 1, 2 1 p1 1 E1 [1] = 6, 5 2x +1 Ep p Based on Lemma 5, we only keep entries in x. E1 1, 2 , 2, 2 , 6, 5 p (c) Ep Ep Ep={1, 3 2, 3 6, 6 7, 5 7, 6} For each entry Ex[y], it may be computed from three entries, Computing 2 based on 1 . 2 , , , , Ep Ep[−1] = 1, 2 i 1 1 and we only need to keep the entry with the maximal -value. Substitution Insertion Deletion p p p Using this property, next we propose an extension operation EXTENSION E2 [−1]=2, 3 E2 [0]=2, 2 E2 [−2]=1, 3 p p Ep[0]=6, 6 E E 2 called PIVOTALEXTENSION to compute x+1 based on x. Ep Ep[0] = 2, 2 p p ‡ 1 1 For any entry i, j = Ex[|i−j|] ∈ Ex ,PIVOTALEXTENSION Substitution Insertion Deletion p p p EXTENSION E [0]=3, 3 E [1]=3, 2 E [−1]=2, 3 applies the following operations. p2 p2 p2 E2 [0]=6, 6 E2 [1]=7, 6 E2 [−1]=2, 3 Ep Ep[1] = 6, 5 (1) Substitution: We can substitute r[i +1] for s[j +1], and 1 1 Substitution Insertion Deletion i+1,j+1 D[i+2][j +2] EXTENSION p p p may be a pivotal entry. As may be E2 [1]=7, 6 E2 [2]=7, 5 E2 [0]=6, 6 equal to D[i +1][j +1], we use the FINDPIVOTAL operation p to find entry i ,j =FINDPIVOTAL(i+1,j+1). If there is no 6, 6∈E2, we return 2 as the edit distance. p p entry in Ex+1[i−j], we add i ,j into Ex+1[i−j]; otherwise |Ep |≤2x +1 p Complexity: As x and each row (colum- suppose i ,j has been added into Ex+1[i−j].Ifi >i ,we |r| |s| p n) has at most ( ) entries, the space complexity is use i,j to update i,j and E [i−j]=i,j. x+1 Omin(τ,|r|, |s|) . The worst-case time complexity is still (2) Insertion: We can insert r[i +1] after s[j], and i +1,j O τ ×min(|r|, |s|) . Since we can prune many useless entries, may be a pivotal entry. As D[i +2][j +1] may be equal to this method can improve the performance. D[i +1][j], we use the FINDPIVOTAL operation to find the entry i ,j =FINDPIVOTAL (i +1,j) . If there is no entry in B. Using Pivotal Triples to Support Similarity Search p p Ex+1[i +1− j],weaddi ,j into Ex+1[i +1− j]; otherwise The definition of pivotal entries depends on the two given p suppose i ,j has been added into Ex+1[i+1−j].Ifi >i , strings. Consider a trie node n, a query string q, and an integer i,j i,j Ep [i+1−j]=i,j we use to update and x+1 . j. Suppose nc and nc are two children of n.IfED(nc,q[1,j+ 1]) = (n, q[1,j]) n, j (3) Deletion: We can delete s[j +1] from s, and i, j +1 ED , is a pivotal entry for strings under n (n ,q[1,j +1])= (n, q[1,j]) n, j may be a pivotal entry. As D[i +1][j +2] may be equal to node c.IfED c ED , is not n D[i][j +1], we use the FINDPIVOTAL operation to find the a pivotal entry for strings under node c. Thus pivotal entries entry i ,j =FINDPIVOTAL (i, j +1). If there is no entry in cannot apply to support multiple strings. To address this issue, p p we introduce a new concept. Ex+1[i−(j+1)], we add i ,j into Ex+1[i−(j+1)]; otherwise p suppose i ,j has been added into Ex+1[i−(j+1)].Ifi >i , Definition 3 (Pivotal Triple): Given an entry n, j, one of i,j i,j Ep [i−(j+1)]=i,j we use to update and x+1 . n’s children nc, and a query q, triple n, j, nc is called a p § (n ,q[1,j+1])= (n, q[1,j]) Iteratively, we can compute Ex+1 . Lemma 6 proves that pivotal triple, if ED c ED . our method can correctly compute the pivotal set. The pivotal triple n, j, nc means that for all strings under p p Lemma 6: Ex computed by our method satisfies (1) com- node nc, n, j is a pivotal entry. Let Tx denote the pivotal p pleteness: if i, j is a pivotal entry, i, j∈Ex; and (2) triple set of pivotal triples n, j, nc such that ED(n, q[1,j]) = p p correctness: if i, j∈Ex, i, j must be a pivotal entry. x. We use Tx[y] to denote the subset of pivotal triples in Tp with y = |n|−j. For example, consider the trie in Example 5: Table IV illustrates how to use pivotal entries to x Figure 2 and query “srajit”. Consider node n0 and its child compute the edit distance of “srajit” and “seraji”. First, p n1 (“s”) and child n21 (“t”). Let D[n][j]=ED(n, q[1,j]). E0 = FINDPIVOTAL(−1, −1) = {1, 1}. Then we compute p p p p As D[n0][0] = D[n21][1], n0, 0,n21 is a pivotal triple. As E1 based on E0. Consider 1, 1∈E0. We add E1[0] = 2, 2 p p D[n0][0] = D[n1][1], n0, 0,n1 is not a pivotal triple. Sim- (substitution), E1[1] = 2, 1 (insertion) and E1[−1] = 1, 2 p p ilarly n1, 1,n2, n1, 1,n6, n1, 1,n11 are pivotal triples. (deletion) into E1.For2, 1∈E1, we use FINDPIVOTAL p p Thus T0={n0, 0,n21, n1, 1,n2, n1, 1,n6, n1, 1,n11}. operation to find pivotal entries and set E1[1] = 6, 5.For p p p p We still iteratively compute Tx from x =0. For each triple 2, 2∈E1, we set E1[0] = 2, 2.For1, 2∈E1,we p p n, j, nc in Tx,ifn is a leaf node and j = |q|, the string set E1[−1] = 1, 2. Then we apply PIVOTALEXTENSION p p p corresponding to n is an answer and we add it into the result operation on E . We add E [0] = 6, 6 into E instead of 1 2 2 set R. If there are k answers in R, we terminate the iteration. 2, 2 and 3, 3 which are not pivotal entries. Finally, as Iteratively, we can compute the top-k answers efficiently. Next p ‡ p p we discuss how to compute Tx. As Ex[y] has at most one pivotal entry, we refer to Ex[y] as the corresponding pivotal entry. p § p Algorithm to Compute Tx: For x =0, from the root r, We can keep entries in Ex in order sorted by i-values and visit the entries in descending order to avoid unnecessary operations. for each of its children, nc, we find pivotal triples as follows.

930 RR RR RR RR RR RR 888 888 888 888 888 888 888 888 888 888 888 888 888 888 888 888 888 888 V]C:HV V]C:HV 11 . 11 . 888 888 888 888 888 888

(a) FINDTRIPLE/FINDQUADRUPLE (b) UPDATETRIPLE/UPDATEQUADRUPLE Fig. 4. Operations of the pivotal triple based method and the range-based method

Algorithm 1:TOPKPIVOTALSEARCH(S,q,k) Suppose nc is the child of n = r with label q[1] and let Input: S: A string set; q: A query; k: No of answers; n−nc denote the set of other children of n except nc. For each Output:R: A set of top-k answers for S and q; node ns ∈ n − nc,asED(n, q[0]) = ED(ns,q[1]) (q[0] = ), p 1 x =0; n, 0,ns is a pivotal triple and added into T0 [0] (the entry 2 for each child rc of root r do n, 0 is a pivotal entry for all strings under node ns). For p 3 Tx = FINDTRIPLE(r, x, rc,q) ; node nc,asED(n, q[0]) = ED(nc,q[1]), entry n, 0 is not a |R| ≥ k R pivotal entry for strings under node nc. Next for each child 4 if then return ; of node nc, we repeat the above step to ﬁnd pivotal triples 5 while true do p Tp Tp x under node nc. Iteratively we can compute T0. This is an 6 x+1 =TRIPLEEXTENSION ( x, ); iterative method and next we introduce an operation called 7 if |R| ≥ k then return R ; ++x FINDTRIPLE to directly compute the pivotal entries by calling 8 ; FINDTRIPLE(n = r, 0,nc,q) for every child nc of r. FINDTRIPLE(n, j, nc,q) is extended from the FINDPIV- Function FINDTRIPLE(n, j, nc, q) OTAL operation (Figure 4(a)). Let n1 = nc and nm denote Input: n:A node; q:A query; j:An integer; nc:n’s child p the last matching node, that is the label of nm is q[j + m] Output: T : A set of matching entries; p p and none of its children has a label of q[j + m +1].FIND- 1 if nc.label=q[j+1] then T ←n, j, nc;return T ; PIVOTAL adds the following possible pivotal triples (we will 2 n1 = nc and nm is the last matching node ; p use UPDATETRIPLE to remove the non-pivotal triples later). 3 for i∈[1,m−1] do T ←ni,j+i−1,ns∈ni−ni+1; Algorithm 1 shows the pseudo-code of FINDTRIPLE. 4 if nm is a leaf and |q| = j + m then R←nm ; p (1) For each node ni (1≤i≤m−1), its child ni+1 matches 5 for each child ns of nm do T ←nm,j+ m, ns; p q[j + i]. Thus ED(ni,q[1,j + i]) = ED(ni+1,q[1,j + i +1]) 6 return T ; and ni,j+i, ni+1 is not a pivotal triple. For ns ∈ ni −ni+1, p ns does not match q[j + i +1] and ni,j+ i, ns is a possible Function TRIPLEEXTENSION(Tx, x) p pivotal triple. Thus FINDTRIPLE adds ni,j+ i, ns. Input: Tx: A set of entries; x: An integer; n n n q[j + p (2) For m and each of its child s, s does not match Output: Tx+1: A set of entries; p m +1] and nm,j+ m, ns is a possible pivotal triple. Thus 1 foreach n, j, nc∈Tx do FINDTRIPLE adds nm,j+ m, ns. 2 for each child node nd of nc do p For example, consider the query “srajit” and the trie 3 Tx+1=UPDATETRIPLE(nc,j+1,nd) ; n p structure in Figure 2. For the root, as its child 21 does 4 Tx+1=UPDATETRIPLE(nc,j,nd) ; not match q[1], n0, 0 is a pivotal entry for node n21. Thus p 5 Tx+1 = UPDATETRIPLE (n, j +1,nc); n0, 0,n21 is a pivotal triple. As the child n1 matches q[1], Tp n0, 0 is not a pivotal entry for node n1. Thus n0, 0,n1 6 return x+1; is not a pivotal triple. Next for n1, all of its children do not p match q[2], n1, 1 is a pivotal entry for nodes n2, n6, n11. by) other triples in Tx+1[y]. We use function UPDATETRIPLE Thus n1, 1,n2, n1, 1,n6, n1, 1,n11 are pivotal triples. (nc,j+1,nd) to update pivotal triples as follows. p Next we discuss how to compute T based on Tp .We x+1 x UPDATETRIPLE (Figure 4(b)): We use function FIND- propose a new extension operation TRIPLEEXTENSION.For p TRIPLE (nc, j +1, nd, q) to ﬁnd possible pivotal triples in each pivotal triple n, j, nc in Tx[y = |n|−j],TRIPLEEX- p ˜p ˜p Tx+1[y], denoted by Tx+1. For each triple n ,j ,nc in Tx+1, TENSION applies the following operations. p let U denote the set of triples n ,j ,nc in Tx+1[y] such that (1) Substitution: For the child nc of node n, we can substitute |nc |−j = y and nc is a descendant or an ancestor of nc. the character of nc for q[j +1]. Thus for each child nd of If U = φ, n ,j ,nc does not affect (and is not affected by) p nc, nc,j+1 may be a pivotal entry for strings under node other triples, and we add n ,j ,nc into Tx+1[y]; otherwise p nc and we want to add triple nc,j +1,nd into Tx+1[y = for each triple n ,j ,nc , we check whether it affects (or is p |nc|−(j +1)].Asnc,j+1,nd may affect (or be affected affected by) n ,j ,nc and update Tx+1 as follows.

931 TABLE V AN EXAMPLE FOR TOP-3 SIMILARITY SEARCH “srajit” ON S USING THE PIVOTAL-BASED SEARCH FRAMEWORK p (a) T0 = {n0, 0,n21, n1, 1,n2, n1, 1,n6, n1, 1,n11} p p (b) Computing T1 based on T0 p p T0 T0 [0] n0, 0,n21 n1, 1,n2 n1, 1,n6 n1, 1,n11 p Substitution T [0]: n21, 1,n22 n2, 2,n3 n6, 2,n7 n11, 2,n12 n11, 2,n16n20, 6,φ p1 EXTENSION Insertion T [1]: n21, 0,n22 n2, 1,n3n3, 2,n4 n6, 1,n7n10, 5,φ n11, 1,n12n11, 1,n16 p1 Deletion T1 [−1]: n0, 1,n21 n1, 2,n2n2, 3,n3 n1, 2,n6 n1, 2,n11 p n21, 1,n22n2, 2,n3n6, 2,n7n11, 2,n12n0, 1,n21n1, 2,n6n1, 2,n11 T1 n21, 0,n22n11, 1,n12n11, 1,n16n3, 2,n4n2, 3,n3n10, 5,φn20, 6,φ p p (c) Computing T2 based on T1 p p T1 T1 [0] n21, 1,n22 n2, 2,n3 n6, 2,n7 n11, 2,n12 n20, 6,φ p Substitution T [0]: n22, 2,n23 n3, 3,n4 n7, 3,n8 n12, 3,n13n12, 3,n15 φ, 7,φ p2 EXTENSION Insertion T [1]: n22, 1,n23n23, 2,n24 n3, 2,n4 n7, 2,n8 n12, 2,n13n12, 2,n15 φ, 6,φ p2 Deletion T2 [−1]: n21, 2,n22 n2, 3,n3 n6, 3, 7 n11, 3,n12 n20, 7,φ p n22, 2,n23n3, 3,n4n12, 3,n15φ, 7,φn21, 2,n22 T2 n11, 3,n12n20, 7,φn12, 2,n13n12, 2,n15φ, 6,φn23, 2,n24n13, 4,n14 p p T1 T1 [-1] n0, 1,n21 n2, 3,n3 n1, 2,n6 n1, 2,n11 p Substitution T [−1]: n21, 2,n22 n3, 4,n4n5, 6,φ n6, 3,n7 n11, 3,n12n11, 3,n16 2p EXTENSION Insertion T [0]: n21, 1,n22 n3, 3,n4 n6, 2,n7 n11, 2,n12 n11, 2,n16 p2 Deletion T2 [−2]: n0, 2,n21 n2, 4,n3 n1, 3,n6 n1, 3,n11 p T2 n11, 3,n16n0, 2,n21n2, 4,n3n1, 3,n6n1, 3,n11n5, 6,φ p p T1 T1 [1] n21, 0,n22 n3, 2,n4 n10, 5,φ n11, 1,n12 n11, 1,n16 p Substitution T [1]: n22, 1,n23 n4, 3,n5 φ, 6,φ n12, 2,n13 n12, 2,n15 n16, 2,n17 p2 EXTENSION Insertion T [2]: n22, 0,n23 n4, 2,n5 φ, 5,φ n12, 1,n13n12, 1,n15 n16, 1,n17 p2 Deletion T2 [0]: n21, 1,n22 n3, 3,n4 n10, 6,φ n11, 2,n12 n11, 2,n16 p T2 n4, 3,n5φ, 6,φn10, 6,φn22, 0,n23n4, 2,n5φ, 5,φn12, 1,n13n12, 1,n15n16, 1,n17 (i) If j >j, nc must be a descendant of nc. For strings Complexity: Let |B| denote the number of trie nodes at the p p under node nc , n ,j is a pivotal entry, and we still keep (|q|+τ)-th level. As we only keep Tx to compute Tx+1, the triple n ,j ,nc . Let n1,n2,...,nm denote the nodes on the space complexity is O(τ|B|). As the update operation (e.g., path from n1 = n to nm = n .WehaveED(n1,q[1,j ]) = using a hash table) takes O(1) time, the worst-case time ED(n2,q[1,j +1])= ... = ED(nm,q[1,j ]) = x +1. complexity is O(τ ×|T|). As the method prunes many useless Thus n ,j ,nc is not a pivotal triple and we need to add trie nodes, it improves the performance(Section VI-A). Tp [y] n ,j + i +1,n the following triples into x+1 : i s for V. A R ANGE-BASED METHOD i ∈ [1,m−1] n ∈ n −n n n and s i i+1 ( i’s children except i+1). As there may be multiple triples with the same entry (ii) If j ≤ j , nc must be an ancestor of nc. Let n, j, we want to group them to improve the performance. p n1,n2,...,nm denote the nodes on the path from n1 = n to Consider an entry n, j in Ex. Node n may have multiple nm = n .WehaveED(n1,q[1,j ]) = ED(n2,q[1,j +1])= children such that n, j, nc is a pivotal triple. It is expensive ... = ED(nm,q[1,j ]) = x +1. Thus n ,j ,nc is not a to keep all such triples. For example, n1, 1 is a pivotal pivotal triple. As for strings under node nc, n ,j is a pivotal entry for nodes n2,n6,n11, and we need to keep three triples entry, we replace triple n ,j ,nc with n ,j ,nc. We also n1, 1,n2, n1, 1,n6, n1, 1,n11. In addition, it is expensive p add the following triples into Tx+1[y]: ni,j + i +1,ns for to enumerate the nodes in n − nc (the set of children of n i ∈ [1,m−1] and ns ∈ ni −ni+1 (ni’s children except ni+1). except nc). To address this issue, we propose a range-based (2) Insertion: For the child nc of node n, we can insert method by grouping trie nodes. character of nc after q[j]. Thus for each child nd of nc, nc,j We encode the trie structure as follows. For each leaf node, may be a pivotal entry for strings under node nd. We call we assign an ID in a pre-order, which is also the ID of its n function UPDATETRIPLE (nc,j,nd) to add triples. corresponding string. For each internal node , we maintain an ID range [ln,un], where ln (un) is the minimum (maximum) (3) Deletion: We can delete q[j +1] from q. Thus n, j +1 ID of strings under the node. In Figure 2, the ID range of may be a pivotal entry for strings under node nc. We call node n1 is [1, 5] which denotes all strings with a prefix of n1 function UPDATETRIPLE (n, j +1,nc) to add triples. (“s”) have IDs in [1, 5] and all IDs in [1, 5] have a prefix “s”. Iteratively we can compute the pivotal triple set and Al- The basic idea of the range-based method is as follows. gorithm 1 shows the pseudo-code. The correctness of the Consider node n and its child node nc with label q[j +1]. algorithm is formalized in Lemma 7. The previous method needs to enumerate all nc’s siblings in Tp Lemma 7: x computed by our method satisfies (1) com- n − nc. Instead the range-based method is to use a range [l, u] n, j, n Tp pleteness: If c is a pivotal triple, it is in x; and (2) n − nc d p to denote the nodes in and use an integer to denote correctness: If n, j, nc is in Tx, it is a pivotal triple. |n|. Suppose the range of node n(nc) is Rn =[ln,un](Rnc =

Example 6: Consider the trie in Figure 2. Table V shows [lnc ,unc ]).AsRnc ⊆ Rn,weuseRn − Rnc =[ln,lnc − 1] ∪ [u +1,u ] n − n the example to find top-3 answers of query “srajit”. For nc n to denote the nodes in c. To this end, we p n1, 1,n2 in T0, we do substitution and add n2, 2,n3.For propose a concept, called pivotal quadruple. insertion, n2, 1,n3 is not a pivotal entry as n3 matches q[2]. [l, u],d,j n , 2,n Definition 4 (Pivotal Quadruple): A quadruple Thus we call FINDPIVOTAL operation and add 3 4 .For is a pivotal quadruple¶, if it satisfies (1) l, u is a sub-range of deletion, we also extend n1, 2,n2 to n2, 3,n3. Similarly we compute all pivotal triples in Table V. ¶The quadruple should be l, u, d, j. For clarity, we use [l, u],d,j.

932 TABLE VI AN EXAMPLE FOR TOP-3 SIMILARITY SEARCH “srajit” ON S USING THE RANGE-BASED METHOD r (a) T0 = FINDMATCH(0, 0) = {[6, 6], 0, 0, [1, 5], 1, 1} r (b) T1 = {[1, 5], 2, 2, [6, 6], 1, 1, [6, 6], 0, 1, [1, 1], 2, 3, [2, 5], 1, 2, [1, 1], 3, 2, [2, 2], 6, 5, [3, 4], 2, 1, [5, 5], 7, 6, [6, 6], 1, 0} r T0 [1, 5], 1, 1 [6, 6], 0, 0 Substitution E1[0] Insertion E1[1] Deletion E1[−1] Substitution E1[0] Insertion E1[1] Deletion E1[−1] [1, 5], 2, 2 [1, 5], 2, 1[1, 1], 3, 2 [1, 5], 1, 2 [6, 6], 1, 1 [6, 6], 1, 0 [6, 6], 0, 1 EXTENSION [2, 2], 6, 5[3, 4], 2, 1 [1, 1], 2, 3 [5, 5], 7, 6 [2, 5], 1, 2 r T1 [1, 5], 2, 2, [6, 6], 1, 1, [6, 6], 0, 1, [1, 1], 2, 3, [2, 5], 1, 2, [1, 1], 3, 2, [2, 2], 6, 5, [3, 4], 2, 1, [5, 5], 7, 6, [6, 6], 1, 0 r r (c) Computing T2 based on T1 r r r T1 T1[−1] [1, 1], 2, 3 [2, 5], 1, 2 [6, 6], 0, 1 T1[0] [1, 5], 2, 2 [6, 6], 1, 1 Substitution E2[−1] [1, 1], 3, 4[1, 1], 5, 6 [2, 5], 2, 3 [6, 6], 1, 2 E2[0] [1, 5], 3, 3 [6, 6], 2, 2 EXTENSION [3, 4], 3, 3 [3, 3], 4, 4 Insertion E2[0] [1, 1], 3, 3 [2, 5], 2, 2 [6, 6], 1, 1 E2[1] [1, 5], 3, 2 [6, 6], 2, 1[6, 6], 3, 2 Deletion E2[−2] [1, 1], 2, 4 [2, 5], 1, 3 [6, 6], 0, 2 E2[−1] [1, 5], 2, 3 [6, 6], 1, 2 [4, 4], 3, 3 r T2 [2, 5], 2, 3[6, 6], 0, 2[1, 1], 2, 4[2, 5], 1, 3[6, 6], 2, 2[1, 1], 3, 3[1, 1], 5, 6[3, 3], 4, 4[4, 4], 3, 3[6, 6], 3, 2[6, 6], 1, 2 r r T1 T1[1] [1, 1], 3, 2 [2, 2], 6, 5 [3, 4], 2, 1 [5, 5], 7, 6 [6, 6], 1, 0 Substitution E2[1] [1, 1], 4, 3 [2, 2], 7, 6 [3, 4], 3, 2 [5, 5], 8, 7 [6, 6], 2, 1 EXTENSION Insertion E2[2] [1, 1], 4, 2 [2, 2], 7, 5 [3, 4], 3, 1 [5, 5], 8, 6 [6, 6], 2, 0 Deletion E2[0] [1, 1], 3, 3 [2, 2], 6, 6 [3, 4], 2, 2 [5, 5], 7, 7 [6, 6], 1, 1 [1, 1], 4, 3[2, 2], 7, 6[3, 4], 3, 2[5, 5], 8, 7[2, 2], 6, 6 Tr 2 [5, 5], 7, 7[6, 6], 2, 0[1, 1], 4, 2[2, 2], 7, 5[3, 4], 3, 1[5, 5], 8, 6 r a d-th level node’s range; (2) for any string s with ID in [l, u], in Tx[y = d − j], it applies the following extensions. (s[1,d +1],q[1,j +1]) = (s[1,d],q[1,j]) ED ED ; (3) strings (1) Substitution: For any strings with ID in [l, u], we can l − 1 u +1 with ID or do not satisfy conditions (1) or (2). substitute the (d+1)-th character of these strings with q[j +1]. The quadruple [l, u],d,j means that for each string s in Thus d +1,j+1 may be a pivotal entry for strings in [l, u], r range [l, u], d, j is a pivotal entry for s and q. Let Tx [l, u],d+1,j+1 is a potential pivotal quadruple and we want r denote the set of pivotal quadruples [l, u],d,j such that to add it into Tx+1. However, there may be some strings with ED(s[1,d],q[1,j]) = x where s is a string with ID in [l, u]. the (d+2)-th character matching q[j +2]. For such strings, r r We use Tx[y] to denote the subset of Tx with y = d − j. d +1,j+1 is not a pivotal entry and d +2,j+2 may be For example, consider the trie in Figure 2 and query a pivotal entry. To address this issue, we propose a function “srajit”. For any string s in [1,5], D[1][1]=D[2][2], UPDATEQUADRUPLE ([l, u],d+1,j+1) to find quadruples. [1, 5], 1, 1 D[0][0]=D[1][1] is a pivotal quadruple. As , UPDATEQUADRUPLE (Figure 4(b)): Based on the above [1, 5], 0, 0 [6, 6], 0, 0 is not a pivotal quadruple. Similarly is observation, we want to find the nodes in the (d+2)-th level Tr={[1, 5], 1, 1, [6, 6], 0, 0} also a pivotal quadruple. Thus 0 . with character q[j +2] within the range [l, u]. As there may Tr x =0 k We still iteratively compute x from .Ifwefind be multiple such nodes, to accelerate this operation we build Tr results from x, our algorithm terminates. an inverted index I, where each entry is an integer d and a r Algorithm to Compute Tx: For x =0, from the root r, character c and the corresponding value is a set of d-th level for each of its children, nc, we find pivotal quadruples as nodes with label c. Thus we first find all the nodes in the follows. Suppose nc is the child of n = r with label q[1].For (d+2)-th level with character q[j +2] using the index I and s [l ,u ] − [l ,u ] (s[0],q[0]) = any string with ID in n n nc nc , ED then use a binary search to find the nodes within [l, u].For (s[1],q[1]) [l ,l −1], 0, 0 [u +1,u ], 0, 0 ED . Thus n nc and nc n each of these nodes, n, we use the aforementioned function r are pivotal quadruples and added into T0 [0]. Next for node nc, FINDQUADRUPLE (n, j +2, q) to find all possible pivotal ˜r we repeat the above step to find pivotal quadruples under node quadruples, denoted by Tx+1. r nc. Iteratively we can compute T . This is an iterative method ˜r 0 Notice that the quadruples in Tx+1 may affect (or be and we introduce a function FINDQUADRUPLE(n, j, q) to di- r affected by) quadruples in Tx+1. Thus we need to update (r, 0,q) rectly compute the pivotal entries (using parameters ). quadruples as follows. For each quadruple [l ,u],d,j in FINDQUADRUPLE extends FINDTRIPLE by grouping nodes ˜r Tx+1, let U denote the set of quadruples [l ,u ],d ,j in (Figure 4(a)). Algorithm 2 shows the pseudo-code. It first finds r Tx+1[y] such that d − j = y and [l ,u ] has an overlap n = n, n , ···,n n the matching nodes 1 2 m, where m is the last with [l ,u].IfU = φ, we do not need to update and add matching node. For each node ni for 1 ≤ i ≤ m − 1, its r [l ,u],d,j into Tx+1[y]; otherwise for each quadruple n q[j + i] r child i+1 matches the query character . Instead of [l ,u ],d ,j in U, we update T as follows. n − n x+1 enumerating each node in i i+1, we group the siblings n n [l ,l − 1] (i) j >j and d >d: For any string s in [l ,u ],as of i+1 into two groups based on i+1: ni ni+1 [u +1,u ] [l ,l − ED(s[d ],q[j ]) = ED(s[d +1],q[j +1]), d ,j is still and ni+1 ni .FINDQUADRUPLE adds ni ni+1 1], |n |,j+ i [u +1,u ], |n |,j+ i a pivotal entry and we keep [l ,u ],d ,j . d ,j is not a i and ni+1 ni i . Similarly for n [l ,u ],n ,j+ m pivotal entry as ED(s[d ],q[j ]) = ED(s[d ],q[j ]) = x+1 and m,FINDQUADRUPLE adds nm nm m . r r j >j.Weadd[l ,l − 1],d,j and [u +1,u],d,j . Next we discuss how to compute Tx+1 based on Tx.We propose a new operation QUADRUPLEEXTENSION to support (ii) j ≤j and d ≤d : For any string s in [l ,u], d ,j is quadruple extensions. For each pivotal quadruple [l, u],d,j a pivotal entry as ED(s[d ],q[j ]) = ED(s[d +1],q[j +1]).

933 TABLE VII Algorithm 2:TOPKRANGESEARCH(S,q,k) DATASETS Input: S: A string set; q: A query; k: No of answers; Datasets Cardinality Avg Len Max Len Min Len Output:R: A set of top-k answers for S and q; Word 146,033 16.01 35 1 Author 10.27 million 22.02 383 8 1 x =0; r Email 6.4 million 26.58 57 7 2 Tx = FINDQUADRUPLE(r, 0,q) ; 3 if |R| ≥ k then return R ; pivotal quadruples as illustrated in Table VI. 4 while true do r r |B| 5 Tx+1 =QUADRUPLEEXTENSION (Tx, x); Complexity: Let denote the number of trie nodes at (|q|+τ−1) 6 if |R| ≥ k then return R ; the -th level, which is not larger than the number (|q|+τ) |B| 7 ++x ; of trie nodes at the -th level . As we only keep r r Tx to compute Tx+1, the space complexity is O(τ ×|B|). Function FINDQUADRUPLE(n, j, q) As the update operation (e.g., using a hash table) takes O(1) O(τ ×|T|) Input: n: A node; j: An integer; q: A query; time, the worst-case time complexity is . Output: Tr: A set of matching ranges; As the method groups many pivotal triples, it improves the 1 n1 = n and nm is the last matching node ; performance(Section VI-A). 2 for ni ∈{n1,n2,...,nm−1} do r VI. EXPERIMENTAL STUDY 3 T ←[lni ,lni+1 − 1], |ni|,j+ i − 1 ; Tr ←[u +1,u ], |n |,j+ i − 1 4 ni+1 ni i ; We conducted an extensive set of experimental studies on 5 if nm is a leaf and |q| = j + m then R←nm ; three real datasets. The first one is the Author dataset which r 6 T ←[lnm ,unm ], |nm|,j+ m ; is a set of author names and extracted from the publications r 7 return T ; in PubMed . The second one is the Word dataset, which is a set of common English words. The third one is a set of Email Tr x Function QUADRUPLEEXTENSION( x, ) addresses. We randomly selected 100 queries from the datasets r Input: Tx: A set of entries; x: An integer; and compared the average elapsed time. Table VII shows the r Output: Tx+1: A set of entries; detailed information of the three datasets. Figure 5 shows the [l, u],d,j∈Tr 1 foreach x do string length distributions of the three datasets. Tr = [l, u],d+1,j+1 2 x+1 UPDATEQUADRUPLE ( ); We compared our algorithms with state-of-the-art methods, Tr = [l, u],d+1,j ed 3 x+1 UPDATEQUADRUPLE ( ); AQ [20], B -Tree [21], and Flamingo [10]. The code of Tr = [l, u],d,j+1 4 x+1 UPDATEQUADRUPLE ( ); Bed-Tree was provided by the authors. The code of Flamingo r ∗∗ 5 return Tx+1; was download from their website . We extended it to support k d,j (s[d],q[j]) = (s[d],q[j]) = x +1 top- search by increasing the thresholds (initialized as 0). is not as ED ED As it is a famous threshold-based string similarity search and j ≤ j. We replace [l,u],d,j with [l,u],d,j [l,l − 1],d,j [u +1,u],d,j algorithm, we selected it as a baseline for comparison. AQ and add , . was implemented by ourselves in C++. All the algorithms were (2) Insertion: For any strings with ID in range [l, u],we implemented in C++ and compiled using GCC 4.2.4 with -O3 can insert the (d+1)-th character after q[j]. Thus d +1,j flag. All the experiments were run on a Ubuntu machine with may be a pivotal entry for strings in [l, u]. We call function an Intel Xeon X5670 2.93GHz CPU and 32 GB memory. UPDATEQUADRUPLE ([l, u],d+1,j) to update quadruples. A. Evaluation on Our Techniques (3) Deletion: For any strings with ID in range [l, u], we can In this section, we compare our proposed techniques, the delete q[j +1] from q. Thus d, j +1 may be a pivotal entry progressive-based method, the pivotal entry based method, and for strings in [l, u]. We call function UPDATEQUADRUPLE the range-based method. We first compared the number of ([l, u],d,j+1) to update quadruples. entries that were needed to compute of the three methods. Iteratively we can compute the pivotal quadruple set and Figure 6 shows the results by varying k on the three datasets. Algorithm 2 shows the pseudo-code. The correctness of the We can see that the pivotal entry based method involved algorithm is formalized in Lemma 8. smaller numbers of entries than the progressive-based method r on the Email dataset and the Author dataset. This is because Lemma 8: Tx computed by our method satisfies (1) com- r the pivotal entry based method only computed pivotal entries pleteness: If [l, u],d,j is a pivotal quadruple, it is in Tx; (2) r and pruned large numbers of useless entries. For example, correctness: If [l, u],d,j is in Tx, it is a pivotal quadruple. on the Email dataset, for k =50, the progressive-based Example 7: Consider the trie in Figure 2. Table VI shows method computed 0.8 billion entries, and the pivotal entry the example to find top-3 answers of query “srajit”. Con- based method computed 0.6 billion entries. On the Word [6, 6], 1, 1 Tr [6, 6], 2, 2 sider in 1. For substitution, we add . dataset, the pivotal entry based method was worse than the For insertion, [6, 6], 2, 1 is not a pivotal entry as the third progressive-based method. This is because if an entry is a character of string s6=“thrifty” matches q[2]=‘r’. Thus we pivotal entry for all children, the pivotal entry based method [6, 6], 3, 2 apply the FINDQUADRUPLE operation and add . http://www.ncbi.nlm.nih.gov/pubmed/ ∗∗ For deletion, we add [6, 6], 1, 2. Similarly we compute all http://flamingo.ics.uci.edu/

934 20000 1e+006 1.6e+006 1.4e+006 16000 800000 1.2e+006 12000 600000 1e+006 800000 8000 400000 600000 400000 4000 200000 200000 Numbers of strings 0 Numbers of strings 0 Numbers of strings 0 5 10 15 20 25 30 35 50 100 150 200 250 300 350 10 20 30 40 50 60 String Lenghts String Lenghts String Lenghts (a) The Word Dataset (b) The Author Dataset (c) The Email Dataset Fig. 5. String length distribution Progressive 3e+009 Progressive 2.5e+007 Progressive 1.2e+009 Pivotal Pivotal Pivotal Range 2.5e+009 1e+009 Range 2e+007 Range 2e+009 8e+008 1.5e+007 1.5e+009 6e+008 1e+007 1e+009 4e+008

Number of Entries 5e+006 Number of Entries 5e+008 Number of Entries 2e+008 0 0 0 1 5 10 25 50 100 1 5 10 25 50 100 1 5 10 25 50 100 Top-k Top-k Top-k (a) The Word Dataset (b) The Author Dataset (c) The Email Dataset Fig. 6. Evaluating Our Techniques - Number of Entries 6 1000 500 Progressive Progressive Progressive 5 Pivotal Pivotal Pivotal Range 800 Range 400 Range 4 600 300 3 400 200 2 200 100

Elapsed Time (ms) 1 Elapsed Time (ms) Elapsed Time (ms) 0 0 0 1 5 10 25 50 100 1 5 10 25 50 100 1 5 10 25 50 100 Top-k Top-k Top-k (a) The Word Dataset (b) The Author Dataset (c) The Email Dataset Fig. 7. Evaluating Our Techniques - Elapsed Time needed to maintain all such triples which may be expensive. B. Comparison with Existing Methods However the range-based method grouped such triples and In this section, we compare our range-based method with significantly reduced the number of entries. state-of-the-art methods, AQ, Bed-Tree, and Flamingo by In addition, the range-based method computed much smaller varying different k on the three datasets. As they needed tune numbers of entries than the pivotal entry based method and some parameters (e.g., gram length), we reported their best the progressive-based method on the three datasets. The main results. Figure 8 shows the experimental results. reason is that the range-based method grouped large numbers We can see that for small k values (k<50), AQ had the of pivotal entries and reduced the number of pivotal entries worst performance as it is rather time consuming to adaptively significantly. For example, on the Email dataset, for k = 100, select a good gram length. For large k values, AQ was better the pivotal entry based method and the progressive-based than Flamingo as the search time was larger than the time to method involved more than 1 billion entries, and the range- select a good gram length. For example, on the Word dataset, based method had only 0.1 billion entries. The experimental for k =25, AQ took 5 milliseconds and other methods took results consist with our analysis in Section IV. less than 3 milliseconds. In addition, Bed-Tree was better Next we compare the average elapsed time of different than AQ and Flamingo for large k values, as it dynamically methods by varying k. Figure 7 shows the results. updated the threshold and used the threshold to do pruning. We can see that the range-based method achieved the best However it had low pruning power for small k values. This is performance and outperformed the other two methods. The because for small k values, it may scan many irrelevant strings main reason is that, the range-based method pruned many and cannot use a tighter bound to do pruning. non-pivotal entries against the progressive-based method and Our method achieved the highest performance and outper- grouped the pivotal entries to avoid unnecessary computa- formed existing methods. This is because Bed-Tree only used tions. For example, on the Email dataset, for k =50, the the string level pruning, while our method can utilize the progressive-based method took 300 milliseconds, the pivotal character-level pruning. That is for a string, Bed-Tree can entry based method improved the time to 150 milliseconds, only take its edit distance to the query as a threshold. Our and the range-based method further reduced time to 50 mil- method can progressively compute edit distance and can use liseconds. This shows that our range-based pruning technique the edit distance of prefixes of a string and the query as can prune large numbers of unnecessary entries and improve a threshold. Thus our method outperformed Bed-Tree.For the performance significantly. example, on the Email dataset, for each k value, our method

935 10 1400 500 Flamingo Flamingo Flamingo AQ 1200 AQ AQ 8 BedTree BedTree 400 BedTree Range 1000 Range Range 6 800 300 4 600 200 400 2 100 Elapsed Time (ms) Elapsed Time (ms) 200 Elapsed Time (ms) 0 0 0 1 5 10 25 50 100 1 5 10 25 50 100 1 5 10 25 50 100 Top-k Top-k Top-k (a) The Word Dataset (b) The Author Dataset (c) The Email Dataset Fig. 8. Comparison with Existing Methods 2 100 k=100 k=25 k=5 140 k=100 k=25 k=5 k=100 k=25 k=5 k=50 k=10 k=1 k=50 k=10 k=1 k=50 k=10 k=1 120 80 1.5 100 60 1 80 60 40 0.5 40 20 Elapsed Time (ms) Elapsed Time (ms) 20 Elapsed Time (ms) 0 0 0 3 6 9 12 15 1 2 3 4 5 1 2 3 4 5 6 String Size (*10000) String Size (* 2 million) String Size (*million) (a) The Word Dataset (b) The Author Dataset (c) The Email Dataset Fig. 9. Scalability of Our Method achieved the best performance. For k = 100, Flamingo REFERENCES ed took more than 400 milliseconds, B -Tree improved the [1] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. time to 300 milliseconds, and our method only took less In VLDB, pages 918–929, 2006. than 80 milliseconds. The results show the superiority of our [2] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131–140, 2007. progressive based framework and our pivotal-entry-based and [3] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and range-based pruning techniques. efficient fuzzy match for online data cleaning. In SIGMOD Conference, pages 313–324, 2003. [4] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for C. Scalability similarity joins in data cleaning. In ICDE, pages 5–16, 2006. In this section, we evaluate the scalability of our range- [5] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for based method. We varied the number of strings and evaluated free. In VLDB, pages 491–500, 2001. our method for finding top-k answers. Figure 9 shows the [6] M. Hadjieleftheriou, N. Koudas, and D. Srivastava. Incremental main- results on the three datasets. We can see that with the datasets tenance of length normalized indexes for approximate string matching. k In SIGMOD Conference, pages 429–440, 2009. increased, our method scaled very well for different values. [7] J. Jestes, F. Li, Z. Yan, and K. Yi. Probabilistic string similarity joins. For example, on the Email dataset, for k=100, our method took In SIGMOD Conference, pages 327–338, 2010. 27 milliseconds for 1 million strings, and the time increased to [8] S. Ji, G. Li, C. Li, and J. Feng. Efficient interactive fuzzy keyword search. In WWW, pages 433–439, 2009. 52 milliseconds for 3 million strings and 79 milliseconds for [9] T. Kahveci and A. K. Singh. Efficient index structures for string 6 million strings. Note that on the Author dataset the elapsed databases. In VLDB, pages 351–360, 2001. time for 10 million strings was smaller than that for 8 million [10] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257–266, 2008. strings. This is because a large string set may have more [11] G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based possible answers and thus it may lead to early termination method for similarity joins. PVLDB, 5(3):253–264, 2011. for finding top-k answers. [12] G. Li, S. Ji, C. Li, and J. Feng. Efficient fuzzy full-text type-ahead search. VLDB J., 20(4):617–640, 2011. [13] G. Li, J. Wang, C. Li, and J. Feng. Supporting efficient top-k queries VII. CONCLUSION in type-ahead search. In SIGIR, pages 355–364, 2012. In this paper, we have studied the problem of top-k string [14] G. Navarro. A guided tour to approximate string matching. ACM similarity search. We proposed a progressive framework to Comput. Surv., 33(1):31–88, 2001. k [15] S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. support top- similarity search. We proposed pivotal entries In SIGMOD Conference, pages 743–754, 2004. to avoid unnecessary computations which can prune large [16] J. Wang, G. Li, and J. Feng. Trie-join: Efficient trie-based string numbers of useless entries. We extended this technique to similarity joins with edit-distance constraints. VLDB, 1219–1230, 2010. [17] J. Wang, G. Li, and J. Feng. Fast-join: An efficient method for fuzzy support similarity search. We devised a range-based method token matching based string similarity join. In ICDE, pages 458–469, by grouping the pivotal entries which can further reduce the 2011. number of entries. Experimental results show that our method [18] C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. VLDB, 933–944, 2008. significantly outperforms existing methods. [19] C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. In ICDE, pages 916–927, 2009. ACKNOWLEDGMENT [20] Z. Yang, J. Yu, and M. Kitsuregawa. Fast algorithms for top-k approximate string matching. In AAAI, 2010. This work was partly supported by the National Natural [21] Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: Science Foundation of China under Grant No. 61003004 and an all-purpose index structure for string similarity search based on edit No. 61272090, National Grand Fundamental Research 973 distance. In SIGMOD Conference, pages 915–926, 2010. Program of China under Grant No. 2011CB302206, and a Tsinghua project under Grant No. 20111081073.

936