Efficient Top-k Algorithms for Approximate Substring Matching

Younghoon Kim Kyuseok Shim Seoul National University Seoul National University Seoul, Korea Seoul, Korea [email protected] [email protected]

ABSTRACT to deal with vast amounts of data, retrieving similar strings There is a wide range of applications that require to query becomes a more challenging problem today. a large database of texts to search for similar strings or To handle approximate substring matching, various string substrings. Traditional approximate substring matching re- (dis)similarity measures, such as edit distance, hamming dis- quests a user to specify a similarity threshold. Without top- tance, Jaccard coefficient, cosine similarity and Jaro-Winkler k approximate substring matching, users have to try repeat- distance have been considered [2, 10, 11, 19]. Among them, edly different maximum distance threshold values when the the edit distance is one of the most widely accepted distance proper threshold is unknown in advance. measures for database applications where domain specific In our paper, we first propose the efficient algorithms knowledge is not really available [10, 12, 20, 27]. for finding the top-k approximate substring matches with Based on the edit distance, the approximate string match- a given query string in a set of data strings. To reduce the ing problem is to find every string from a string database number of expensive distance computations, the proposed whose edit distance to the query string is not larger than a algorithms utilize our novel filtering techniques which take given maximum distance threshold. There are diverse appli- advantages of q-grams and inverted q-gram indexes avail- cations requiring approximate string matching. For exam- able. We conduct extensive experiments with real-life data ple, it is useful to check whether the names or addresses pro- sets. Our experimental results confirm the effectiveness and vided by users as input are correct by querying databases to scalability of our proposed algorithms. search for the strings within a maximum distance threshold [29]. However, for databases with long strings, the contain- Categories and Subject Descriptors ment search of a query string called approximate substring matching is more reasonable. H.2 [Database Management]: Systems—query process- ing, textual databases Id String Id String s1 Jackson Pollock s4 Jacksomville s2 Jakob Pollack s5 Jakson Pollack Keywords s3 Jason Polock s6 Mackson Polock Top-k approximate substring matching; edit distance; in- Figure 1: An example of a string database D verted q-gram index Two different problem definitions have been introduced for approximate substring matching. In the first definition, 1. INTRODUCTION the problem is to discover all substrings of each string in a There is a wide range of applications that require to query database, whose edit distances to the query string are within a large database of texts to search for similar strings or sub- the given maximum threshold [4, 17, 27]. Consider a query strings. A list of possible applications includes substring string ‘Jackson’ and a database shown in Figure 1. If the matching or short snippet suggestions for web search [23], maximum edit distance threshold is 2, with ‘Jackson Pollock’ named entity recognition [27], bio-informatics for finding alone, the substrings of ‘Jacks’, ‘Jackso’, ‘Jackson’, ‘ackson’ DNA subsequences [20], music data retrieval [18] and spell and ‘ckson’ are all included in the query result. checking [26]. These applications generally require high per- Another popular definition of the approximate substring formance for real-time query processing to find similar sub- matching problem is to retrieve every string s whose sub- strings from text databases. While these applications are string edit distance to the query string σ is at most the given not all new, as there is an increasing trend of applications maximum threshold, where the substring edit distance be- tween σ and s is the smallest edit distance among the ones between σ and all substrings in s [13, 25]. For example, let us reconsider the string database in Figure 1 and the query Permission to make digital or hard copies of all or part of this work for string ‘Jackson’. If the maximum threshold is 2, the ap- personal or classroom use is granted without fee provided that copies are proximate substring matches are {‘Jackson Pollock’, ‘Jason not made or distributed for profit or commercial advantage and that copies Polock’, ‘Jacksomville’, ‘Jakson Pollack’, ‘Mackson Polock’}. bear this notice and the full citation on the first page. To copy otherwise, to All previous studies of approximate string or substring republish, to post on servers or to redistribute to lists, requires prior specific matching mentioned so far require each user to provide a permission and/or a fee. SIGMOD’13, June 22–27, 2013, New York, New York, USA. maximum distance threshold. However, in many applica- Copyright 2013 ACM 978-1-4503-2037-5/13/06 ...$15.00. tions, it is very difficult for each user to know a proper

385 threshold in advance since it depends on the application 2. PRELIMINARIES scenarios. An appealing alternative method is to compute In this section, we provide definitions used in our paper the most similar k strings or substrings without the need to and present a precise statement of the problem for the top-k specify a maximum distance threshold. Without such top-k approximate substring matching. approximate matching, users have to try different distance threshold values, which may lead to empty results (if the 2.1 Notations and our Problem Definition threshold chosen is too low) or too many results with a long Let Σ be a finite alphabet of size |Σ|. For a string s of running time (if the threshold is too high). It also supports Σ∗, we denote the length of s with |s|. We use s[i, j] with interactive applications, where users are presented with top- 1≤i≤j≤|s| to denote the substring of s which starts from k most similar strings or substrings progressively. position i and ends at position j. Similarly, s[i] with 1≤i≤|s| In our paper, we propose the efficient algorithms for the denotes the character at the i-th position of s. top-k approximate substring matching problem which find Edit distance: For two strings si and sj , si can be trans- the top-k strings whose substring edit distances to the query formed to sj by applying repeatedly the three operations: string are the smallest among all strings in a given database. insertion, deletion and substitution. The edit distance be- Contrast to traditional approximate substring matching, top- tween si and sj , which is denoted by d(si, sj ), is defined as k approximate substring matching (1) does not require each the minimum number of operations needed to transform si user to provide a maximum distance threshold, and (2) finds to sj (or sj to si). The dynamic programming algorithm the top-k strings in data whose ‘substring edit distances’ to to compute the edit distance between two strings si and sj a given query string are the k smallest values. To the best of takes O(|σ| |s|) time [14]. our knowledge, no existing work has addressed the top-k ap- proximate substring matching problem and our algorithms Substring edit distance: The substring edit distance from presented in this paper are the first work for the problem. a string s to another string σ, which is represented by dsub(s, σ), Example 1.1.: Consider the string data in Figure 1. Sup- is the minimum among the edit distance between σ and ev- pose that the query string σ is ‘Jackson’ and k = 3. Since s1 ery substring of s. The O(|σ||s|)-time algorithm based on has the substring which is exactly σ, its substring edit dis- dynamic programming to compute the substring edit dis- tance to σ is 0. For s4 and s5, the substring edit distances to tance is presented in [24]. (For more details, read [24, 25].) σ are 1 since s4 and s5 have ‘Jacksom’ and ‘Jakson’ respec- We next present the definition of our top-k approximate tively. For other strings, the substring edit distances to σ are substring matching problem. at least 1. Thus, the top-3 approximate substring matches are {‘Jackson Pollock’, ‘Jacksomville’, ‘Jakson Pollack’}. Definition 2.1.: (Top-k Approximate Substring Match- Given a set of strings D and a query string σ, a naive ing) Given a set of strings D and a query string σ, the top-k algorithm of top-k approximate substring matching exam- approximate substring matching problem is to find the k ap- ines every string s in D and compute the substring edit proximate strings TOPk(σ) = s1, s2, ..., sk in D which is distance dsub(s, σ) one by one. We refer to the brute-force an ordered sequence of k strings satisfying the following con- algorithm as TopK-NAIVE. Since computation of substring ditions: edit distance dsub(s, σ) is very expensive, we develop more • dsub(s1, σ) ≤ dsub(s2, σ) ≤ ≤ dsub(sk, σ) holds. efficient algorithms by utilizing q-grams in the query strings • For every sj ∈D such that sj ∈/ TOPk(σ), we have dsub(sk, σ) and available inverted q-gram indexes adopted widely to find ≤ dsub(sj , σ). approximate string matches [9]. This paper makes the fol- Example 2.2.: Consider the strings shown in Figure 1. lowing contributions: Suppose that the query string σ is ‘Jackson’ and we are inter- • We first propose the algorithm TopK-LB, which reduces ested in the top-2 approximate substring matches TOP2(σ). the number of substring edit distance computations in Since s1 includes σ as a substring, the substring edit dis- TopK-NAIVE, by checking each string s with a lower tance dsub(s1, σ) is 0. For the string s4, dsub(s4, σ) is 1 bound of dsub(s, σ) first based on dynamic programming. because s4 has a substring ‘Jacksom’. For other strings, the • We next present the algorithm TopK-SPLIT which allows substring edit distances to σ are at least 1. Thus, we get not to calculate the lower bounds of dsub(s, σ) for some TOP2(σ)={s1,s4}. strings by dividing data D into partitions based on q- grams and utilizing our proposed lower bound of dsub(s, σ) 2.2 The q-grams and Inverted q-gram Indexes between σ and all strings in each partition. We define q-grams and positional q-grams as follows. • We also develop the algorithm TopK-INDEX that can Definition 2.3.: The q-grams of a string s are s[i, i + speed up TopK-SPLIT by utilizing the posting lists of q − 1]s with all 1≤i≤(|s|−q+1). The positional q-grams of some q-grams in the query string, which can be obtained a string s are all pairs of each q-gram in s and its position from the existing inverted positional q-gram index, when- (i.e., (s[i, i + q − 1], i) with 1≤i≤(|s|−q+1)). ever it is available by DBMSs. • We conducted an extensive performance study of our pro- Example 2.4.: For the string s=‘Jackson’ and q=3, (‘Jac’, posed algorithms with real-life data sets. When there is 1), (‘ack’,2), (‘cks’,3), (‘kso’,4) and (‘son’,5) are the posi- no inverted q-gram index, we found that TopK-SPLIT tional 3-grams of the string s. is the best performer and faster than TopK-NAIVE by Given a q-gram and a string containing the q-gram, a pair an order of magnitude. Whenever the inverted indexes of its string id and the position where the q-gram occurs in are available, TopK-INDEX is the best and much faster the string is called a posting. With a set of strings D, the than TopK-SPLIT. Experimental results also confirm that posting list of a q-gram is the list of postings for every oc- our algorithms outperform the extended traditional algo- currence of the q-gram in all strings in D. Note that every rithms significantly and scale up well with large data sizes. posting list is primarily sorted in increasing order of string

386 q-gram posting list q-gram posting list q-gram posting list

Jacs1,1), ( (s 4,1) llo (s1,11) ack (s2,11), (s5,12) computing a substring edit distance requires to find the sub- ack (s1,2), (s4,2), (s6,2) loc (s1,12), (s3,10), (s6,11) Jas (s3,1) cks (s1,3), (s4,3), (s6,3) ock (s1,13), (s3,11), (s6,12) aso (s3,2) string with the minimum edit distance to σ, we can ignore Jak omv kso (s1,4), (s4,4), (s5,3),(s6,4) (s2,1), (s5,1) (s4,6) ako mvi (s4,7) the substrings of s whose lengths are larger than |σ|. Thus, son (s1,5), (s3,3), (s4,5), (s5,4), (s6,5) (s2,2) kob vil (s4,8) on_ (s1,6), (s3,4), (s5,5), (s6,6) (s2,3) ℓo(dsub(s, σ)) is the minimum one among ℓo(d(s[i,i+|σ|−1], ill n_P (s1,7), (s3,5), (s5,6), (s6,7) ob_ (s2,4) (s4,9) lle _Po (s1,8), (s2,6), (s3,6), (s5,7), (s6,8) b_P (s2,5) (s4,10) σ))s with 1≤i≤|s|−|σ|+1. aks 5 Pol (s1,9), (s2,7), (s3,7), (s5,8), (s6,9) lla (s2,9), (s5,10) (s ,2) oll (s1,10), (s2,8), (s5,9) lac (s2,10), (s5,11) kso (s5,3) To compute ℓo(dsub(s, σ)) based on the above Lemma, if olo (s3,8), (s6,10) Mac (s6,1) som (s4,6) we count the common q-grams between σ and every sub- Figure 2: An inverted index for the database D string s[i, i+|σ|−1] with 1≤i≤|s|−|σ|+1, it takes O(|σ| |s|) ids and secondarily sorted in ascending order of positions. time. We now present the algorithm called DYN-LB which With an inverted q-gram index of D, we can access the post- computes a tighter lower bound ℓo(dsub(s, σ)) based on dy- namic programming by utilizing the positions of matching ing lists with q-grams as keys. In Figure 2, we show the 2 inverted 3-gram index of the strings in Figure 1. q-grams with O(ℓ ) time, where ℓ represents the number of matching q-gram pairs between s and σ. Note that we 3. THE LOWER BOUNDS OF SUBSTRING generally have ℓ ≪ min(|s|, |σ|). Applying dynamic programming: We use the following EDIT DISTANCES notations for a query string σ and a string s in our dynamic In this section, to identify the strings in D which do not programming formulation. need to compute actual substring edit distances, we present • Let Xσ=(x1, p1),...,(x|Xσ |, p|Xσ |) be the sequence of po- how to compute the lower bounds of substring edit dis- sitional q-grams from σ, each q-gram of which appears in tances by utilizing q-gram properties. For correctness and the string s, satisfying pi < pi+1 for i=1,...,|Xσ|−1. Sim- efficiency, we develop our techniques to guarantee no false ilarly, let Ys=(y1, r1),...,(y|Ys|, r|Ys|) be the sequence of dismissals and few false positives, respectively. To achieve positional q-grams from s, each q-gram of which occurs in these goals, we devise several filtering methods. σ, satisfying ri < ri+1 for i=1,. . . ,|Ys|−1. • For a positional q-gram pair (xi, pi), (yj , rj ) with (xi, pi) 3.1 A Lower Bound of dsub s ( , σ) ∈ Xσ,(yj , rj ) ∈ Ys and xi=yj , let m[i, j] represent the ′ We denote the lower bound of the edit distance d(s, σ) lower bound of all edit distances d(σ[1, pi+q−1],s ) where ′ between a string s and a query string σ by ℓo(d(s, σ)). The s is every substring of s ending at position rj +q−1. following lemma borrowed from [9] allows us to compute • For two positional q-gram pairs {(xu, pu), (xi, pi)} ⊆ Xσ ℓo(d(s, σ)). and {(yv, rv), (yj , rj )} ⊆ Ys satisfying xu=yv, xi=yj , pu

387 1 2 3 4 5 6 7 89101112 σ J c s v e σ J a c k s no v e σ J s no v e σ J c k s no v e

s J c ? ? i s s ? ? ? ?? ? ? ? ?? s ... ? c ? ? ? s ... J c? ? ? ??

(a) σ and s with four common q-grams (b) Transforming `Jacksonvill' to the substrings of s ending with `ill' lo(dsub(s,σ))= 4 (m[1,1]=0) + 3 = 3 (m[2,2]=0) + 3 = 3 (m[3,4]=2) + 2 = 4 (m[4,3]=2) + 1 = 3 σ J a c k s o n v i l l e σ J c k s o n v i l l e σ J s o n v i l l e σ J a c k v i l l e σ s no v l e

s ? ?? ? ?? ?? ?? ?? s J c ? ?? ?? ?? ?? s J ?? ?? ?? ?? s ... ? ??? s s ... c ? ? ?

(c) Computing the lower bound of dsub(s,σ) using m[i,j]s

Figure 3: Computing the lower bound of dsub(s, σ) with DYN-LB

Computing ℓo(dsub(s, σ)): Now we show how to compute • Modifying σ[1, 4]=‘Jack’ to a substring of s ending with ℓo(dsub(s, σ)) using m[i, j]. When every q-gram in σ is mod- ‘ack’ followed by changing ‘acksonvill’ to ‘ack Will’, which ified, the lower bound of dsub(s, σ) is ⌈(|σ|−q+1)/q⌉ due to requires at least 2 edit operations (=m[2, 2]+max( ⌈(9−2−1) Lemma 3.2. Otherwise, suppose that the last matching q- /3⌉, |10−8|)). gram pair is (xi, pi), (yj , rj ) with (xi, pi) ∈ Xσ,(yj , rj ) • Transforming σ[1, 3]=‘Jac’ to s[1, 3]=‘Jac’ and converting ∈ Ys and xi=yj . The lower bound of dsub(s, σ) is the sum ‘Jacksonvill’ to ‘Jack Will’, which needs at least 3 edit of m[i, j] and the minimum number of edit operations to operations (=m[1, 1]+max(⌈(9−1−1)/3⌉, |11 − 9|)). modify every remaining q-gram in σ[pi+1, |σ|] so that there Now, m[4, 3] becomes 2. Similarly, we obtain m[3, 4]=2. Fi- is no matching q-gram between σ[pi+1, |σ|] and s[rj +1, |s|], nally, to compute the lower bound of dsub(s, σ) by Equa- which is ⌈(|σ|−pi−q+1)/q⌉ according to Lemma 3.2. Since we do not know the actual last matching positional q-gram tion (3), we examine the lower bounds by the five transfor- mations in Figure 3(c), such that every matching q-gram in pair, we consider every (xi, pi), (yj , rj ) ∈ X × Y such that Xσ is modified or each q-gram in X is the right most q-gram xi=yj as the last matching pair to compute ℓo(dsub(s, σ)). which is not changed. Then, we obtain 3 as the lower bound Thus, the lower bound of dsub(s, σ) is of dsub(s, σ). min{⌈(|σ| − q + 1)/q⌉, min {m[i, j] + ⌈(|σ| − pi − q + 1)/q⌉}. (3) 3.2 A Lower Bound of Substring Edit Distances 1≤i≤|Xσ |,1≤j≤|Ys| between a Query and a Set of Strings Computing δ(pu, pi, rv, rj): Given two positional q-gram Previously, we could compute the lower bound of dsub(s, σ) pairs {(xu, pu), (xi, pi)} ⊆ Xσ and {(yv, rv), (yj , rj )} ⊆ Ys using the common q-grams between a query string σ and the satisfying xu=yv, xi=yj , pu

388 − Function TopK-LB (D, k, σ) Proof: Due to Lemma 3.4, for every string s in DG′ which begin ′ ′ does not include any q-gram in G , dsub(s, σ) ≥ ⌈|G |/q⌉ 1. HT opK = an empty max-heap storing dist, str; − ′ 2. G = {σ[i,i+q−1]|1≤i≤|σ|−q+1}; holds. Thus, we have dsub(DG′ , σ) ≥ ⌈|G |/q⌉. 3. for each string s in D do While the lower bound of obtained by Lemma 3.5 is ef- 4. R = {(s[i,i+q−1], i) |∀i s.t.1≤i≤|s|−q+1, s[i,i+q−1]∈G}; fective, it does not consider the positions of the q-grams in 5. LB = DYN-LB(R, |σ|); ′ 6. if |HT opK | d then ′ ′ 10. HT opK .insert(d,s); q-gram in G does not overlap with another q-gram in G , 11. H .deleteMax(); − T opK we can obtain a tighter lower bound of ℓo(dsub(DG′ , σ)). 12. end for 13. return HT opK ; Lemma 3.6.: Let G be the set of all q-grams in a query end ′ string σ. If we select a subset G ⊆ G in which every q- Figure 4: The TopK-LB algorithm gram does not overlap with another q-gram in G′, we have − ′ dsub(DG′ , σ) ≥ |G |. − 4. TOP-K ALGORITHMS FOR APPROXI- Proof: For each string s in DG′ , suppose that we perform edit operations on σ to convert it to a substring of s. Since MATE SUBSTRING MATCHING every q-gram in G′ does not appear in s, every q-gram in G′ We first introduce the brute-force algorithm TopK-NAIVE should be modified by at least an edit operation. However, which blindly examines every string s in a set of strings D ′ every q-gram in G does not overlap with another q-gram in and computes substring edit distances dsub(s, σ) one by one. G′ and thus a single edit operation cannot affect multiple Since distance computations are very expensive, to reduce − q-grams. Thus, for every string s in DG′ , we need at least the number of substring edit distances to compute, we pro- ′ |G | edit operations to transform σ to any substring of s. pose the algorithm TopK-LB which computes dsub(s, σ) only when the lower bound ℓo(d (s, σ)) is smaller than the cur- Example sub 3.7.: Suppose we have a query string σ=‘Jackson’ rent k-th smallest substring edit distance. The lower bound and G′={‘Jac’,‘kso’} in which every 3-gram does not over- ′ − ℓo(dsub(s, σ)) is computed by our novel dynamic program- lap with another q-gram in G . The strings in DG′ are the ′ ming algorithm in TopK-LB. strings which do not have even one of the 3-grams in G . Since TopK-LB utilizes the algorithm based on dynamic Since each 3-gram in G′ does not overlap with another one, − programming to compute the lower bound ℓo(dsub(s, σ)) with to transform σ to every substring of each string in DG′ , we every string s in D, we next propose the algorithm TopK- need at least 2 edit operations (an edit operation on each of − SPLIT which can even skip the expensive computation of ‘Jac’ and ‘kso’) in σ. Note that we have dsub(D , σ) ≥ 2 G′ ℓo(dsub(s, σ)) for some strings s in D. TopK-SPLIT divides by Lemma 3.6. D into partitions for a given q-gram set G′ and utilizes our ′ new lower bound of dsub(s, σ). We also develop a novel dy- If we fix the size of G ⊆ G, we obtain the tightest lower ′ − ′ namic programming algorithm to find the best q-gram set G bound ℓo(dsub(DG′ , σ)) when G is selected without any pair of overlapping q-grams. to be used at the beginning of TopK-SPLIT. Furthermore, whenever the k-th substring edit distance becomes smaller, Corollary 3.8.: Consider a query string σ and let G be TopK-SPLIT adaptively adjusts G′ so that we can skip even the set of q-grams in σ. We obtain the tightest lower bound more strings for computing the lower bounds of dsub(s, σ). − ′ ℓo(dsub(DG′ , σ)) among every possible G ⊆ G when the size TopK-SPLIT still examines every string in D and thus its of G′ is ⌊|σ|/q⌋. running time increases proportionally to the size of D. To Proof: This is because we cannot select a set of q-grams speed up TopK-SPLIT further, we develop TopK-INDEX with a larger size than ⌊|σ|/q⌋ in which every q-gram does that does not need to check every string in D by utilizing not overlap with another q-gram in the set. the inverted q-gram indexes, whenever they are available. We next present the above algorithms for finding the top-k We now introduce the following lemma to use Lemma 3.6 approximate substring matches in more details. effectively for top-k approximate substring matching. TopK-NAIVE: The naive algorithm examines every string Lemma 3.9.: Consider a set of strings D, a query string s in D and computes the substring edit distance dsub(s, σ). σ, the q-gram set G of σ and a subset G′ ⊆ G. Assume that To find the top-k approximate substring matches in D, we + − ′ we partitioned D into DG′ and DG′ . While examining every maintain a max-heap HT opK storing the k strings s with + ′ the smallest dsub(s , σ)s which are used as the keys in the string in DG′ , if there are at least k strings whose substring − max-heap HT opK . If the size of HT opK is less than k, we just edit distances to σ are at most ℓo(dsub(DG′ , σ)), we can find the top-k approximate substring matches in D by examining insert the string s to HT opK . Otherwise, we check whether + dsub(s, σ) is smaller than dsub(sR, σ) of the string sR at the the remaining strings in DG′ only by ignoring the remaining strings in D− . root of HT opK (i.e., whether dsub(s, σ) is smaller than the G′ k-th smallest substring edit distance so far). If it is satisfied, + Proof: While examining every string in DG′ one by one, let we delete the string at the root sR of HT opK and insert the sk denote the string with the k-th smallest substring edit dis- string s to HT opK . If not, we move to the next string and − tance dsub(sk, σ) so far. Since dsub(sk, σ) ≤ ℓo(dsub(DG′ , σ)), repeat the above step until we encounter the last string in ′ − ′ D. We refer to the naive algorithm as TopK-NAIVE. for every string s ∈ DG′ , dsub(s , σ) cannot get smaller than dsub(sk, σ) and we can not have any string of the top-k ap- − TopK-LB: Similar to TopK-NAIVE, we scan all strings in proximate substring matches in DG′ . D. However, for each string s in D, we generate the list of

389 Function TopK-SPLIT (D, k, σ) common positional q-grams R = (g1, p1),... (gn, pn) be- begin tween σ and s in increasing order of positions where the po- 1. HT opK = an empty max-heap storing dist, str; sition pi of each (gi, pi) in R is the position in s at which the 2. G = {σ[i,i+q−1]|1≤i≤|σ|−q+1}; 3. τ = ⌊|σ|/q⌋; q-gram gi appears. Then, we can compute the lower bound 4. cond = false; ℓo(dsub(s, σ)) by invoking DYN-LB presented in Section 3.1. 5. for each string s in D do If ℓo(dsub(s, σ)) is not smaller than the k-th smallest sub- 6. R = {(s[i,i+q−1], i) |∀i s.t.1≤i≤|s|−q+1, s[i,i+q−1]∈G}; ′ string edit distance so far, we do not need to calculate the 7. if cond=true and ∃(gi, pi) ∈ R, gi ∈ G then part = ‘+’; 8. else part = ‘−’; actual value of dsub(s, σ) since s cannot be a string of the 9. if cond=false or part = ‘+’ then top-k approximate substring matches. We call the algorithm 10. LB = DYN-LB(R, |σ|); TopK-LB and present its pseudocode in Figure 4. 11. if |HT opK |d then 15. HT opK .insert(d,s); and we are interested in the top-2 approximate matches. 16. HT opK .deleteMax(); The first two strings s1 and s2, whose substring edit dis- 17. if HT opK .getMax().dist≤τ then tances to σ are 1 and 4 respectively, are inserted into the 18. cond = true; 19. τ = HT opK .getMax().dist; max-heap HT opK . For the next string s3=‘Jason Polock’, 20. G′ = BEST-G′(τ, G); the lower bound ℓo(dsub(s3, σ)) is 2 by DYN-LB. Since the 21. if cond=true and HT opK .getMax().dist<τ then 22. τ = H .getMax().dist; second smallest substring edit distance in HT opK is 4 and T opK 23. G′ = BEST-G′(τ, G); ℓo(dsub(s3, σ)) is smaller than 4, we have to compute dsub(s3, σ), 24. end for which turns out to be 3, and insert (s3, 3) into HT opK . Now 25. return HT opK ; end we have {(s1, 1), (s3, 3)} in HT opK and the second smallest substring edit distance so far becomes 3. Figure 5: The TopK-SPLIT algorithm With s4=‘Jacksomville’, ℓo(dsub(s4, σ)) becomes 1. Since when |G′| is large, we have a smaller number of strings in D ℓo(dsub(s4, σ)) is smaller than the second smallest substring ′ to examine before dsub(sR, σ) ≤ |G | holds. However, once edit distance in HT opK , we have to compute dsub(s4, σ) which the inequality is satisfied, with a large |G′|, the number of is 2 and (s4, 2) is added to HT opK . Then, the second small- the unseen strings belonging to D+ also becomes generally est substring edit distance so far becomes 2. For s5=‘Jakson G′ large. Pollack’, ℓo(dsub(s5, σ)) is 2 and thus we can skip computing ′ To satisfy dsub(sR, σ) ≤|G | as soon as possible, we ini- expensive dsub(s5, σ). Finally, with s6=‘Mackson Polock’, tially set |G′| with ⌊|σ|/q⌋. Note that the lower bound of ℓo(dsub(s5, σ)) is 1 and we should compute dsub(s5, σ). In − ′ summary, we calculated the substring edit distances with 5 dsub(DG′ , σ) is |G | and cannot be larger than ⌊|σ|/q⌋ due to Corollary 3.8. Furthermore, note that we need G′ to de- strings out of 6 strings. + termine DG′ to prune unseen strings only when dsub(sR, σ) TopK-SPLIT: Let G be the set of all q-grams in σ and let ≤|G′| is satisfied, since the inequality holds in the same lo- ′ ′ G be a subset of G such that every q-gram in G does not cation in D regardless of G′ with the same size of G′. Thus, overlap with another q-gram in the set. We conceptually once dsub(sR, σ)≤⌊|σ|/q⌋ holds, while scanning strings in D, + − + ′ divide the strings in D into DG′ and DG′ where DG′ is the we compute G of size dsub(sR, σ) and next examine the un- set of strings in D each of which has at least a q-gram in G′ + ′ seen strings in DG′ only. For a given size ρ of G , we will − + ′ and DG′ is (D−DG′ ). While scanning each string s, we can discuss how to select such a subset G minimizing the num- check easily whether the string s belongs to D+ or D− by + G′ G′ ber of strings in DG′ in Section 5. Thus, for the time being, checking the q-grams in s. we will assume that such a G′ is provided by the function Similar to TopK-LB, when examining every string s in D BEST-G′(ρ). one by one, TopK-SPLIT generates the list of common posi- Let τ be the k-th substring edit distance when dsub(sR, σ) tional q-grams R and skips computing dsub(s, σ) if ℓo(dsub(s, σ)) ≤⌊|σ|/q⌋ is first satisfied. While we examine the strings in is at least the k-th smallest substring edit distance so far. D, the k-the smallest substring edit distance monotonically However, unlike TopK-LB, TopK-SPLIT can even skip com- decreases. When the k-the smallest substring edit distance ′′ puting ℓo(dsub(s, σ)) by using Lemma 3.9 for each string as is changed from τ to τ , we can still test the unseen strings follows. + ′′ ′′ ′ in DG′ only. However, since any G satisfying |G |<|G | Let sR denote the string whose substring edit distance ′′ ′′ − and dsub(sR, σ) = τ ≤ |G | allows us not to examine DG′′ to σ (i.e., dsub(sR, σ)) is the k-th smallest value among the and we have |D− |≤|D− | generally, we can select such a G′′ strings examined so far. Among the unseen strings, we can G′′ G′ and check the unseen strings in D+ only. We can perform skip distance computations for the strings whose substring G′′ the above step whenever the k-th substring edit distance edit distances to σ are at least dsub(sR, σ). Since we split + − decreases. D into DG′ and DG′ , after dsub(sR, σ) becomes at most − The pseudocode of TopK-SPLIT is provided in Figure 5. ℓo(dsub(D ′ , σ)), we can skip every unseen string s belong- − G The lower bound of dsub(D , σ) (i.e., τ) is set to ⌊|σ|/q⌋ − G′ ing to DG′ . In this case, we have initially which is the largest one according to Corollary 3.8. ′ − The variable cond has false initially representing that the dsub(sR, σ) ≤ |G | ≤ dsub(D ′ , σ) (4) G k-th smallest substring edit distance so far exceeds τ. Until − ′ since the lower bound of dsub(DG′ , σ) is |G | due to Lemma 3.6 we set cond as true, while scanning each string s in D, if ′ regardless of the q-grams in G . However, for the unseen LB = ℓo(dsub(s, σ)) computed by DYN-LB is smaller than + strings s ∈ DG′ , we still have to compute dsub(s, σ). the k-th smallest substring edit distance so far, we com- ′ Intuitively, the larger |G | is, the earlier the condition of pute the actual value of dsub(s, σ) similar to TopK-LB. How- ′ dsub(sR, σ) ≤ |G | in Inequality (4) can be satisfied. Thus, ever, once the k-th smallest substring edit distance (i.e.,

390 Function TopK-INDEX (D, I, k, σ) begin While scanning each string in D, our new algorithm TopK- 1. HT opK = an empty max-heap storing dist, str; INDEX also scans the posting lists of the q-grams in G to- 2. G = {σ[i,i+q−1]|1≤i≤|σ|−q+1}; gether. We assume that the strings in D are sorted with ′ 3. τ = |G |; increasing order of string ids and we are also able to retrieve 4. cond = false; 5. while D.end() = false do the strings in D with string ids as keys by random I/O ac- 6. sidD = D.getCurrentId(); cess. We also assume that the posting list of every q-gram 7. sidI = I.getFrontierId(G); in G is stored with the increasing order of string id first and 8. s = null; 9. if cond = false and sidI > sidD then its position next in the inverted q-gram index. By reading 10. s = D.getString(sidD); the posting lists of the q-grams in G only, we can obtain the 11. R = {(s[i,i+q−1], i) |∀i s.t.1≤i≤|s|−q+1, s[i,i+q−1]∈G}; ′ common q-gram list R for the strings which share at least 12. if ∃(gi, pi) ∈ R, gi ∈ G then part = ‘+’; 13. else part = ‘−’; a q-gram with σ. Whenever we need to compute the actual 14. else substring distance dsub(s, σ), we actually access the string s 15. (R, part) = I.getPosQgrams(sidI ); stored in D. Since the size of each string in D is generally 16. if cond = false or part = ‘+’ then 17. LB = DYN-LB(R, |σ|); much larger than that of the query string σ, using the post- 18. if |HT opK | d then dsub(sR, σ) ≤ ℓo(dsub(DG′ , σ))), instead of sequentially read- 23. HT opK .insert(d,s); ing the unseen strings in D, we access the remaining strings 24. HT opK .deleteMax(); s ∈ D+ only by random I/O access with utilizing the in- 25. if H .getMax().dist≤τ then G′ T opK − 26. cond = true; verted q-gram index to skip the remaining strings s ∈ DG′ . 27. τ = HT opK .getMax().dist; Similar to TopK-SPLIT, whenever dsub(sR, σ) decreases, we 28. G′ = BEST-G′(τ, G); ′ ′ select a smaller G with |G | = dsub(sR, σ) to reduce the size 29. if cond=true and HT opK .getMax().dist<τ then − 30. τ = HT opK .getMax().dist; of DG′ . 31. G′ = BEST-G′(τ, G); The pseudocode of TopK-INDEX is presented in Figure 6. 32. end while The difference from TopK-SPLIT is the lines 6–16 in which 33. return HT opK ; end the common positional q-gram list R is produced by reading the posting lists of q-grams in G sequentially in increasing Figure 6: The TopK-INDEX algorithm order of string ids. HT opK .getMax().dist) becomes at most τ, we set cond to Note that every posting list in the inverted index is sorted true and select G′ by calling BEST-G′ in lines 17–21 so by string ids and positions. To obtain the list R in increasing that we can skip the computation of dsub(s, σ) for the un- order of string ids, we initially set the frontier of each posting − seen strings s ∈ DG′ by Lemma 3.9. Furthermore, whenever list of a q-gram in G as the location of the first posting in the the k-th smallest substring edit distance decreases, we select posting list. Then, we read the smallest id among the post- ′ G with size of HT opK .getMax().dist again in lines 23–26. ings in all frontiers and move the frontier to the next posting in its posting list. Let I denote the given inverted index. The Example 4.2.: Consider the strings D in Figure 1. Sup- call of I.getFrontierId(G) returns the smallest string id sidI pose we want to find the top-2 approximate substring matches among the ids in all frontiers, and I.getPosQgrams(sidI ) the query string σ = ’Jacksen’ using 3-grams. The lower returns the common positional q-gram list R between the − string of sidI and the query string σ by reading all postings bound of dsub(DG′ , σ) (i.e., τ) τ is initially set to 2(=⌊7/3⌋). First, the strings s1 and s2 are inserted into HT opK whose (gi, pi) from the posting lists of the q-grams in G such that substring edit distances to σ are 1 and 4 respectively. With gi=sidI . The invocation of I.getPosQgrams also informs + − s3 and s4, we compute dsub(s3, σ)(=3) and dsub(s4, σ)(=2), whether the string belongs to DG′ or DG′ . and then the second smallest substring edit distance in HT opK If we just follow the posting lists of the q-grams in G only becomes 2. to generate the list R, there may exist a string s in D which We now have 2 strings whose substring edit distances are may not have any common q-gram with the query string σ at most τ. After selecting the q-grams set G′ ⊆ G by BEST- and we may miss all of such strings. Thus, while we move ′ G , we can skip computing dsub(s, σ) with the strings s ∈ a cursor on D together in the increasing order of string ids, − if the smallest frontier id sidI is larger than the id sidD of DG′ due to Lemma 3.9. Suppose that the selected 3-gram set ′ G is {‘Jac’, ‘kse’}. Then, for the strings s5 and s6, we can D which the current cursor indicates, we read the string of ignore them because they do not have any common 3-gram sidD from D, generate R from the actual string in D, and with G′. In summary, we could find the top-2 approximate then perform the computations of lower bound and actual substring matches by calculating the substring edit distances distance if necessary. with 4 strings only. After cond becomes true, we always go to line 15 to read the posting lists to obtain the common q-gram list R with- TopK-INDEX: When the k-th smallest substring edit dis- out reading strings from D. The reason is that, after we ′ find at least k strings whose substring edit distances are at tance so far becomes at most |G | in TopK-SPLIT, we can − − most ℓo(dsub(D ′ , σ)), if the query string σ and a string even skip computing ℓo(dsub(s, σ)) with the strings s ∈ D . G G′ s in D do not share at least a q-gram, the string s be- However, we still have to read unseen string s in D to de- − − longs to DG′ and thus we do not need to check such strings termine whether s belongs to DG′ by checking if s has at ′ in D by Lemma 3.9. Furthermore, if the string s of sidI least a q-gram in G . We will next improve the inefficiency ′ ′ + shares at least a common q-gram with G (i.e., part is ‘+ ), by retrieving the unseen strings belonging to DG′ only with utilizing inverted q-gram indexes when they are available. we compute ℓo(dsub(s, σ)) in line 17. If the lower bound

391 i=15 σ[i−q+1, i] (i.e., the q-gram ending at the position i) al- σ ways appears in Q and (3) all q-grams in Q does not over- lap to each other. • Let θ[i, t] denote the q-gram set Q in Q(i, t) such that |L(Q)| is the minimum among all q-gram sets in Q(i, t). • Let u[i, t] represent |L(θ[i, t])|.

Figure 7: Computing u[i, t] and θ[i, t] Given the size ρ of G′, we select Q with the minimum ′ ℓo(dsub(s, σ)) is smaller than the k-th smallest substring edit |L(Q)| as G among all subsets Q ⊆ G consisting of ρ non- distance so far, we retrieve the string s from D by random overlapping q-grams in σ. In every possible such Q, the I/O access to compute dsub(s, σ) as in line 20. ending position of the rightmost q-gram σ[i−q+1, i] can be located in the positions i of σ with ρq≤i≤|σ|. Thus, for Example 4.3.: Consider the strings D in Figure 1. Sup- every substring σ[1, i] with ρq≤i≤|σ|, we will enumerate the pose that the query string σ is ‘Jacksen’ and we want to find best ρ non-overlapping q-gram set Q which not only contains the top-2 approximate substring matches from D using 3- σ[i−q+1, i] but also has the minimum |L(Q)|. The reason grams. The 3-gram set G of σ is {‘Jac’, ‘ack’, ‘cks’, ‘kse’, why we do not consider every i which is smaller than (ρq) is − ‘sen’}. Initially, the lower bound τ of dsub(DG′ , σ) is set to because the substring σ[1, i] with i < ρq cannot have ρ non- 2 (=⌊7/3⌋). overlapping q-grams. Since we use θ[i, ρ] to store the best At the beginning, the minimum string id among the fron- Q with ρ non-overlapping q-grams which not only contains tiers of the posting lists of q-rams in G is s1. The current σ[i−q+1, i] but also has the minimum |L(Q)|, we can find cursor of D is also at s1. Since we can obtain the common G′ by selecting the θ[i, ρ] with the smallest |L(θ[i, ρ])| among q-gram list R of s1 from the posting lists, in each posting θ[i, ρ]s with ρq≤i≤|σ|. In other words, if we compute θ[i, t]s list, we read the frontier posting and move the frontier to for every 1 ≤ t ≤ ⌊|σ|/q⌋ and t q ≤ i ≤ |σ|, we can select the next. We also move the cursor of D to the next string the q-gram set θ[i∗, ρ] as G′ where i∗ is chosen as follows: without examining s1. Since the current HT opK is empty, i∗ = arg min {u[i, ρ]} = arg min {|L(θ[i, ρ])|} . (5) we compute dsub(s1, σ)=1 and insert (s1, 1) into HT opK . tq≤i≤|σ| tq≤i≤|σ| Now, the minimum string id among the frontiers is s4. However, since the current cursor of D indicates s2, we read We next present a dynamic programming algorithm to the s2 and compute dsub(s2, σ)=4. Similarly, we read s3 compute θ[i, t] for t = 1, ..., ⌊|σ|/q⌋ and i = t q, ..., |σ|. from D to calculate dsub(s3, σ)=3 and update HT opK . Now Dynamic programming formulation: Since θ[i, t] con- the second smallest substring edit distance so far becomes 3. tains t number of non-overlapping q-grams in the substring We next read the frontier postings to obtain the common σ[1, i] including σ[i−q+1, i], we consider θ[j, t−1]∪{σ[i−q+1, i]} 3-gram list R between s4 and σ which is {(‘Jac’,1),(‘ack’,2), with 1≤j≤(i − q) to compute θ[i, t]. However, θ[j, t−1] (‘cks’,3)}. The current cursor of D moves to the next string does not exist for j=1,...,(t−1)q − 1 because it is impossi- s5. After we compute dsub(s4, σ)=2, the second smallest sub- ble to have (t−1) non-overlapping q-grams in the substring string edit distance so far becomes 2. From now, since we σ[1, j]. Thus, we enumerate θ[j, t−1] ∪ {σ[i−q+1, i]} for already have 2 strings whose substring edit distances are at (t−1) q≤j≤(i−q) only and select θ[j, t−1] ∪ {σ[i−q+1, i]} ′ ′ most τ=2, with G of size 2 obtained by calling BEST-G , we with the smallest |L(θ[j, t−1]) ∪ Lσ[i−q+1,i]| as θ[i, t]. Our can find the top-2 approximate substring matches by consid- recursive definition for θ[i, t] is ′ ering the strings sharing at least a 3-gram in G only due to ∗ Lemma 3.9. Let us assume that BEST-G′ selects G′={‘Jac’, θ[i, t] = θ[j , t − 1] ∪ {σ[i − q + 1, i]} (6) ‘kse’}. Then, for the strings s5 and s6, we can ignore them where because they do not appear in any posting list of ‘Jac’ or ∗ ‘kse’. In summary, we can find the top-2 approximate sub- j = arg min Lσ[i−q+1,i] ∪ L(θ[j, t − 1]) . (7) (t−1)q≤j≤i−q string matches by calculating the actual substring edit dis-  tances with 4 strings in D only. In Figure 7, we show an example to illustrate how we com- pute θ[15, 4]. We enumerate |Lσ[13,15] ∪ θ[j, 3]| with 9≤j≤12 ′ and select {σ[13, 15]} ∪ θ[11, 15] as θ[15, 4] which minimizes 5. SELECTING THE BEST G OF SIZE ρ |L(θ[15, 4])|. In the previous section, we assumed that G′ ⊆ G of size If we compute θ[i, t] for every t from 1 to ⌊|σ|/q⌋ and ρ is provided by the function BEST-G’ for TopK-SPLIT every i from tq to |σ| with the above dynamic programming and TopK-INDEX. Now, we present the algorithm BEST- algorithm, we can choose G′ with a size ρ by Equation (5). G’ which finds G′ of size ρ such that the number of strings We refer to this algorithm which finds G′ with a given size ′ + ′ with at least a q-gram in G in D (i.e., the size of DG′ ) is ρ as BEST-G (ρ). the smallest. We use the following notations to describe our Note that BEST-G′(ρ) cannot find the optimal q-gram dynamic programming formulation to select such a G′. set G′ because the optimal substructure property of our dy- namic programming formulation is not satisfied. Suppose

• Let Lgi be the set of the ids of strings containing the q- {σ[i−q+1, i]} ∪ Q is set to θ[i, t] in Equation (6) where Q gram gi∈G. is the optimal set with (t − 1) non-overlapping q-grams for ∗ ′ • For a q-gram set Q ⊆ G, let L(Q) = g ∈Q Lgi , which is σ[1, j ]. Let Q be a non-optimal q-gram set with (t−1) non- i ∗ the union of Lgi s for all q-grams gi ∈ Q, and we will use overlapping q-grams in σ[1, j ] including the q-gram ending |L(Q)| to represent the number of elementsS in L(Q). at the position j∗. Even though Q′ is not an optimal q- ∗ ′ • Let Q(i, t) be the set of all possible subsets Q of q-grams gram set for σ[1, j ], if Lσ[i−q+1,i] and L(Q ) share many ′ appearing in σ[1, i] such that (1) |Q|=t, (2) the q-gram common string ids so that |Lσ[i−q+1,i] ∪ L(Q )| becomes

392 g2 t=1 t=2 g1 - Jac ack cks i θ[i,1] u[i,1] θ[i,2] u[i,2] • TopK-SPLIT: This is the algorithm which improves TopK- Jac 100 - - - 3 {Jac} 100 - - LB further by skipping even the computation of ℓo(dsub(s, σ)) ack 32 - - - 4 {ack} 32 - - cks 16 - - - 5 {cks} 16 - - using the lower bound of substring edit distance between kso 10 110 - - 6 {kso} 10 {Jac,kso} 110 a query and a set of strings in Lemma 3.9. son 120 130 125 - 7 {son} 120 {ack,son} 125 onv 40 120 43 50 8 {onv} 40 {ack,onv} 43 • TopK-INDEX: This is the implementation of TopK-

(a) |Lg1 U Lg2| (=|L({g1,g2})|) (b) θ[i,t] and u[i,t] INDEX that utilizes the inverted q-gram indexes available to speed up TopK-SPLIT. Figure 8: Computations in BEST-G′(ρ) • TopK-NGPP: This is the modified version of NGPP in ′ smaller than |Lσ[i−q+1,i] ∪ L(Q)|, we have |L(Q )| > |L(Q)| [27] to obtain the top-k approximate substring matches. and thus the computed θ[i, t] is not an optimal q-gram set The algorithm NGPP is the state-of-the-art algorithm to for σ[1, i]. Thus, BEST-G′(ρ) finds an approximate q-gram find all substrings of each string in D whose edit distances set for G′. However, our performance study confirms that to a query string are at most a given maximum thresh- BEST-G′(ρ) obtains a good G′ close to the optimal sets. old τ. To adapt NGPP to top-k approximate substring When BEST-G′(ρ) is invoked in TopK-SPLIT or TopK- matching, we first read the first k strings in D and set the INDEX, we need to know the actual size of L(θ[i, t]) which k-th smallest one among their substring edit distances as is the union of the sets of string ids whose strings contain a the initial threshold τ. Then, as examining every string q-gram in θ[i, t]. We will use the MinHash technique [5, 7] in D, if the k-th smallest substring edit distance becomes to estimate the sizes of unions of sets in BEST-G′(ρ). smaller than τ, we update the threshold τ to the k-th Assume that for every q-gram gi appearing in D, there is smallest substring edit distance. a MinHash signature of the set of string ids in which the q- • TopK-FSS: Similar to TopK-NGPP, we also extended ′ gram gi occurs. In BEST-G (ρ), we maintain the MinHash FSS in [4] to compute the top-k approximate substring signature of L(θ[i, t]) to compute Lσ[i−q+1,i] ∪ L(θ[j, t − 1]) matches in the same manner. in Equation (7) with constant time. Note that we used our own buffer management for reading ′ Time complexity: In BEST-G (ρ), we have to compute the data strings in disk and accessing inverted indexes in all θ[i, t] and u[i, t] for every 1≤t≤ ⌊|σ|/q⌋ and every tq≤i≤|σ|. of our implementations without utilizing OS buffers to see To compute each θ[i, t], it takes O(|σ|) time. For each u[i, t], the buffering effects for the tested algorithms more carefully. it takes constant time since u[i, t] is simply |L(θ[i, t])|. Thus, We report the native execution times (i.e., wall clock times) ′ 3 the time complexity of BEST-G (ρ) is O(|σ| /q). of the tested algorithms in this section. Example 5.1.: Consider the query string σ=‘Jacksonv’ 6.2 Data Sets with |σ| = 8. Suppose that we should select the non-overlapping 3-gram set G′ with size 2. For all possible non-overlapping To study the performance our proposed algorithms, we 3-gram sets Q whose sizes are 1 or 2, |L(Q)|s are shown in utilize the following two real-life data sets. Figure 8(a). We show all θ[i, t]s and u[i, t]s computed by DBLP: For a short string data set, we used DBLP titles col- ′ BEST-G (ρ) in Figure 8(b). lected from http://dblp.uni-trier.de/xml/. However, since Let us assume that we want to compute θ[8, 2] that always the original data is small, we increased its size by duplicating includes the 3-gram σ[6, 8]=‘onv’. We enumerate the fol- the original data 5 times. While we duplicate each string, we lowing three cases of |Lonv ∪ Lcks|(=50), |Lonv ∪ Lack|(=43) randomly performed an edit operation such as insert, delete and |Lonv ∪ LJac|(=120) by Equation (7), and select θ[8, 2] and substitute on each position of the string with the prob- = {‘ack’,‘onv’} with u[8, 2] = 43. Finally, to select the best ability of 0.1. The size of generated data is 635 MB with G′ with size 2 by Equation (5), we choose θ[8, 2] = {‘ack’, 13,966,030 strings whose average size is 48 bytes. ‘onv’} for G′ since u[8, 2] is the minimum among u[6, 2], Wikipedia: This is the data set consisting of 106,185 web u[7, 2] and u[8, 2]. pages obtained from http://en.wikipedia.org/wiki/Wikipedia: 6. EXPERIMENTS Database_download for a long string data set. The data size We empirically compared the performance of our proposed is 1.1 GB and the average size is 11,027 bytes. algorithms. All experiments reported in this section were 6.3 Queries Used performed on the machines with Intel Core2 Duo 2.66GHz of processor and 2GB of main memory running Linux operating For DBLP, we randomly selected between 1 and 4 adjacent systems. All algorithms were implemented using C++ and words appearing in DBLP titles to generate test queries. compiled with GCC Compiler of version 4.1.3. The range of query lengths is from 6 to 25 with average length 13.2. For Wikipedia, we sampled 50 entities from 6.1 Implemented Algorithms CoNLL data, which is a name entity data collected for Name We implemented the following algorithms for our perfor- Entity Recognition available for download at http://www. mance study. cnts.ua.ac.be/conll2003/ner/, as test queries. The lengths of selected queries are from 5 to 27 with average length 12.3. • TopK-NAIVE: This represents the brute-force algorithm Similarly, for our experiments with varying lengths of query which computes the substring edit distance with every strings, we also selected the query strings of lengths from 5 string in D to find the top-k approximate substring matches. to 25 with DBLP titles and CoNLL data respectively. For • TopK-LB: It is the implementation of TopK-LB which each query length, 50 query strings were sampled. computes dsub(s, σ) only for the strings s in D whose lower bound ℓo(dsub(s, σ)), obtained by calling DYN-LB, 6.4 Performance Results is smaller than the k-th smallest substring edit distance We measured the performance of the algorithms for both found so far. DBLP and Wikipedia with varying k, the number of strings

393 TopK-FSS TopK-LB(F1) TopK-NAIVE,TopK-LB,TopK-SPLIT 100000 TopK-NGPP TopK-SPLIT(F1) TopK-INDEX TopK-NAIVE TopK-SPLIT(F2) 10000 TopK-LB TopK-SPLIT(F1+F2) 10M TopK-SPLIT 1000 TopK-INDEX 100 1M 80 100 100K 60 10 10K 40 Execution time (sec) 1 1K 20 Number of strings checked Number of filtered strings (%) 0.1 0 1 3 5 7 10 15 20 1 3 5 7 10 15 20 1 3 5 7 10 15 20 k k k (a) Execution times (DBLP) (b) Filtering effects (DBLP) (c) Checked strings (DBLP)

TopK-FSS TopK-LB (F1) TopK-NAIVE,TopK-LB,TopK-SPLIT TopK-NGPP TopK-SPLIT(F1) TopK-INDEX 10000 TopK-NAIVE TopK-SPLIT(F2) TopK-LB TopK-SPLIT(F1+F2) TopK-SPLIT 100K TopK-INDEX 100 1000 80 60 100 40 Execution time (sec) 10K

20 Number of strings checked Number of filtered strings (%) 10 0 1 3 5 7 10 15 20 1 3 5 7 10 15 20 1 3 5 7 10 15 20 k k k (d) Execution times (Wikipedia) (e) Filtering effects (Wikipedia) (f) Checked strings (Wikipedia)

Figure 9: Varying k using DBLP and Wikipedia n and the length of query strings L. Furthermore, we varied filtering effect of TopK-INDEX is exactly the same with that the length of q-grams q, MinHash signature size ℓ and buffer of TopK-SPLIT, we report the result of TopK-SPLIT only. size B used. The default parameters are: k=5, q=3, ℓ=50 With every range of k, TopK-SPLIT(F1+F2) is always and B=512MB. larger than TopK-LB(F1) in both data sets. For k=20 with Varying k: We first report the execution times, the number Wikipedia, Figure 9(e) shows that TopK-SPLIT skips com- of filtered strings and the percentage of strings read from D puting dsub(s, σ) with 17% more strings than TopK-LB does. with varying k from 1 to 20 in Figure 9(a)–(f). Considering that TopK-SPLIT was 2.2 times faster than TopK-LB in Figure 9(d), we can conclude that using the (1) Execution times: We show the execution times for DBLP lower bound by Lemma 3.9 improves the speed of query and Wikipedia in Figure 9(a) and Figure 9(d) respectively. processing more than using the lower bound by DYN-LB The log scale was used on the y-axises. In both data, TopK- does because the lower bound by Lemma 3.9 requires only NGPP and TopK-FSS show even worse performance than to check whether each string contains at least a common q- our naive algorithm TopK-NAIVE. The reason is that both gram with the query string in a constant time. With increas- of TopK-NGPP and TopK-FSS have to enumerate all sub- ing k, both TopK-LB(F1) and TopK-SPLIT(F1) decrease strings of every string in D to compute the substring edit slowly since the k-th smallest distance so far also becomes distances. As expected from the experiments in [27], TopK- larger together with k. However, since the k-th substring NGPP performs better than TopK-FSS. edit distance found so far decreases slower as k grows, the As k is increased, the execution times for TopK-LB, TopK- chances that TopK-SPLIT skips the strings using the lower SPLIT and TopK-INDEX grow gradually. Since the k-th bound by DYN-LB increase on the contrary and thus TopK- smallest substring edit distance so far becomes larger with SPLIT(F2) grows gradually. growing k, less number of strings are skipped for computing substring edit distances. (3) Number of strings read from D: We also plotted the For every range of k, TopK-INDEX shows the best per- number of data strings examined in disk with random I/O formance. TopK-INDEX is faster than TopK-NAIVE by at access by TopK-INDEX for DBLP and Wikipedia in Fig- least 49.4 times and 5.5 times for DBLP and Wikipedia re- ure 9(c) and Figure 9(f) respectively. The y-axises are in a spectively. Furthermore, TopK-SPLIT is also faster than log scale. The graph confirms that TopK-INDEX effectively TopK-NAIVE by at least 11.6 and 3.5 times for DBLP and reduces the cost for reading the data strings in disk. Wikipedia respectively. Varying n: With each of DBLP and Wikipedia, we selected (2) Filtering effects: To show the effectiveness of our filter- the strings from the original data set to produce smaller data ing methods, we plotted the percentage of strings skipped with varying the sampling rate from 6.25% to 100%. With for computing their substring edit distances in Figure 9(b) varying the size of data, we plotted the execution times in and Figure 9(e) with DBLP and Wikipedia respectively. Figure 10(a) for DBLP and in Figure 10(b) for Wikipedia. Note that TopK-LB(F1) in the graphs represents the per- The graphs show that TopK-INDEX is the fastest in every centage of strings filtered with TopK-LB using the lower range of data sizes. Both TopK-NGPP and TopK-FSS were even slower than TopK-NAIVE and their performance de- bounds ℓo(dsub(s, σ)) obtained by DYN-LB. We also use TopK-SPLIT(F1) similarly in the graphs. TopK-SPLIT(F2) grades linearly as data size grows because they have to exam- in the graphs represents the percentage of strings filtered by ine every substring exhaustively. As the data size increases, TopK-SPLIT using the lower bound of substring edit dis- the relative speedup of TopK-INDEX to TopK-NAIVE im- tances between a query string and a set of strings presented proves from 4.6 times to 8.6 times for Wikipedia. With in Lemma 3.9. Furthermore, TopK-SPLIT(F1+F2) is the DBLP, the speedup increases from 9.3 times to 24.5 times. total ratio of strings filtered with TopK-SPLIT. Since the Thus, we conclude that the performance of TopK-INDEX

394 600000 TopK-FSS TopK-FSS TopK-NAIVE TopK-NAIVE 10000 TopK-NGPP TopK-NGPP TopK-LB TopK-LB TopK-NAIVE 10000 TopK-NAIVE 500000 TopK-SPLIT TopK-SPLIT TopK-INDEX TopK-LB TopK-LB 1000 TopK-INDEX TopK-SPLIT TopK-SPLIT 1000 400000 TopK-INDEX 1000 TopK-INDEX 300000 100 100 100 200000 Execution time (sec) Execution time (sec) 10 Execution time (sec) Execution time (sec) 100000 10 1 0 10 6.25 12.5 25 50 100 6.2512.5 25 50 100 2 3 4 5 64M 128M 256M 512M Data set size (%) Data set size (%) The length of q-grams Buffer size (a) Varying n (DBLP) (b) Varying n (Wikipedia) (c) Varying q (d) Varying B Figure 10: Varying n, q and B

1e+06 TopK-FSS TopK-LB (F1) 200000 TopK-NGPP TopK-SPLIT(F1) TopK-SPLIT 100000 TopK-NAIVE TopK-SPLIT(F2) TopK-LB TopK-SPLIT(F1+F2) TopK-INDEX 10000 TopK-SPLIT 150000 TopK-INDEX 100 1000 80 100 100000 60 10 40 Execution time (sec) 1 20 50000 Number of filtered strings (%) Execution time (sec) 0.1 0 5 10 15 20 25 5 10 15 20 25 0 Length of query strings Length of query strings 10 30 50 70 100 (a) Execution times (Wikipedia) (b) Filtering effects (Wikipedia) MinHash signature size Figure 11: Experiments with varying L Figure 12: Varying ℓ (Wikipedia) does not decline linearly with increasing data size and TopK- skipped for computing dsub(s, σ). Furthermore, since the − INDEX scales well to large data. maximum of ℓo(dsub(DG′ , σ)), which is ⌊|σ|/q⌋, due to Corol- Varying L: With varying query lengths L from 5 to 25, we lary 3.8 also becomes smaller with a larger q, the critical point string where we meet the k-th string whose substring plotted the execution times for Wikipedia in Figure 11(a). − The log scale was used on the y-axises. With every range of edit distance is at most ℓo(dsub(DG′ , σ)) appears later. How- L, TopK-INDEX shows the best performance and TopK-FSS ever, TopK-INDEX shows the best performance when q=3 is the worst. When the lengths of query strings are 5 and 25, because the posting lists become very large with 2-grams. TopK-INDEX is 540 times and 8.44 times faster than TopK- Varying B: With varying the size of buffer B from 64 MB NAIVE respectively. As L is increased, the execution times to 512 MB, we show the execution times in Figure 10(d). for all algorithms grow gradually. This is because there is a With increasing B, the speed of TopK-INDEX is improved less chance that the strings in D have similar substrings to gradually since we have more chance to hit the cached pages a longer query string and thus the k-th smallest substring of inverted indexes with larger buffer sizes. edit distance so far becomes larger with growing L. We also plotted the percentage of strings skipped for com- 7. RELATED WORK puting their substring edit distances dsub(s, σ) in Figure 11(b). We present the previous work on approximate string match- With every range of L, TopK-SPLIT skips more strings for ing first and then approximate substring matching. computing dsub(s, σ) than TopK-LB does. Furthermore, Approximate string matching: For approximate string while TopK-LB computes ℓo(dsub(s, σ)) based on dynamic programming algorithm DYN-LB for every string in D, TopK- matching with a maximum distance threshold τ, the count filtering technique proposed in [9] utilizes the fact that if SPLIT skips even computing ℓo(dsub(s, σ)) for many strings. With increasing L, the gaps between the ratios of strings fil- the edit distance between s and σ is at most τ, they must tered by TopK-SPLIT and TopK-LB become larger because share at least (max(|s|, |σ|)−q+1−τ q) q-grams. In [16], an- the query strings have more chance to have very selective q- other algorithm is proposed to speed up approximate string grams resulting that more strings are filtered using the lower matching by reducing the average size of inverted lists using bound by Lemma 3.9. variable length q-grams. However, this algorithm utilizes the length filtering or prefix filtering which cannot be used Varying ℓ: We varied the MinHash signature size ℓ from for approximate substring matching. In [1] and [22], the 10 to 100 and plotted the running times of TopK-SPLIT algorithms to find similar strings probabilistically were pro- and TopK-INDEX in Figure 12(a). Since the other algo- ′ posed, but we focused on the problem of finding the exact rithms do not estimate the computational cost to select G , top-k approximate substring matches in this paper. they are not affected by the MinHash signature size. The In [3] and [15], the algorithms using inverted q-gram in- graph shows that the performances of both algorithms are dexes are presented. However, as mentioned in [27], to uti- the worst when ℓ=10 and do not degrade that much with lize the inverted q-gram indexes, the length of the q-grams ℓ ≤ 50. Thus, we used ℓ=50 as the default value in all our to be used should be smaller than (|σ|+1)/(τ+1). Thus, experiments. depending on the value of τ and the q-grams used for ac- Varying q: Figure 10(c) shows the graph of execution time cessing existing inverted indexes, it is not always possible to as the length q of the q-grams used is varied from 3 to use existing inverted q-gram indexes for approximate string 5. Since TopK-NAIVE is not affected by the length of q- matching [27]. In [26] and [28], the top-k approximate string grams, its execution times are always constant. TopK-LB matching algorithms using inverted q-gram indexes are pre- and TopK-SPLIT show the best performances when q=2. sented. However, as pointed out in [27], we cannot deter- This is because, with a large q, the lower bound ⌈(|σ|−q+1 mine the q-gram size in advance to build indexes and thus −n)/q⌉ in Lemma 3.2 decreases and thus, less strings are these algorithms may not find the top-k approximate string

395 matches correctly when using preexisting inverted q-gram search, Second edition. Pearson Education Ltd., Harlow, indexes. One of our proposed algorithms also utilizes exist- England, 2011. ing inverted indexes, but we still guarantee the correctness. [3] A. Behm, S. Ji, C. Li, and J. Lu. Space-constrained The algorithms in [6, 29] utilize the suffix which index gram-based indexing for efficient approximate string search. In ICDE, 2009. a small portion of suffixes only in the data. However, to [4] T. Bocek, E. Hunt, and B. Stiller. Fast similarity search in use suffix tries for approximate substring matching, we have large dictionaries. In Technical Report, 2007. to index every suffix appearing in the data, resulting large [5] A. Z. Broder. On the resemblance and containment of index sizes. Because of the high space requirements and documents. In Proceedings of Compression and Complexity poor locality of suffix tries, it is mentioned in [21] that the of SEQUENCES, 1997. suffix based approaches are proper only when both of [6] S. Chaudhuri and R. Kaushik. Extending autocompletion the data and index fit in main memory. Furthermore, since to tolerate errors. In SIGMOD, 2009. the approximate string matching algorithm with a maximum [7] Z. Chen, F. Korn, N. Koudas, and S. Muithukrishnan. threshold τ using suffix tries in [8] has exponential space and Selectivity estimation for boolean queries. In PODS, 2000. time complexity to τ, it is hard to be extended for top-k [8] R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don’t cares. In approximate substring matching. STOC, 2004. Approximate substring matching: Approximate sub- [9] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, string matching has been studied in the context of approxi- S. Muthukrishnan, and D. Srivastava. Approximate string mate entity extraction in [4, 17, 27]. With a given threshold joins in a database (almost) for free. In VLDB, 2001. and a entity dictionary, these algorithms are actually join [10] L. Jin and C. Li. Selectivity estimation for fuzzy string predicates in large data sets. In VLDB, 2005. algorithms which find every substring in a database such [11] N. Koudas, A. Marathe, and D. Srivastava. Flexible string that the edit distance between the substring and one of the matching against large databases in practice. In VLDB, strings in the entity dictionary is at most the threshold. The 2004. algorithms in [4, 27] can be extended for top-k approximate [12] H. Lee, R. T. Ng, and K. Shim. Extending q-grams to substring matching and thus we compared our proposed al- estimate selectivity of string matching with low edit gorithms to our extended algorithms of [4, 27]. The algo- distance. In VLDB, 2007. rithm in [17] utilizes inverted q-gram indexes but to use [13] H. Lee, R. T. Ng, and K. Shim. Approximate substring EDBT inverted indexes for top-k approximate substring matching, selectivity estimation. In , 2009. [14] V. I. Levenshtein. Binary codes capable of correcting we have to ensure that the length of the q-grams used must deletions, insertions and reversals. Soviet Phys. Dokl., 1985. be smaller than (|σ| + 1)/(τk + 1) where τk is the k-th small- [15] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering est substring edit distance in D. Since we cannot know τk algorithms for approximate string searches. In ICDE, 2008. in advance to build indexes, we could not adapt the algo- [16] C. Li, B. Wang, and X. Yang. VGRAM: Improving rithm in [17] to our top-k approximate substring matching performance of approximate queries on string collections problem. using variable-length grams. In VLDB, 2007. To the best of our knowledge, no previous work addresses [17] G. Li, D. Deng, and J. Feng. Faerie: efficient filtering the top-k approximate substring matching problem and our algorithms for approximate dictionary-based entity extraction. In SIGMOD, 2011. algorithms presented here are the first work for the problem. [18] C.-C. Liu, J.-L. Hsu, and A. L. P. Chen. An approximate 8. CONCLUSION string matching algorithm for content-based music data retrieval. In ICMCS, 1999. In this paper, we studied the problem of top-k approxi- [19] A. Mazeika, M. H. B¨ohlen, N. Koudas, and D. Srivastava. mate substring matching. We first proposed efficient filter- Estimating the selectivity of approximate string queries. ing methods using q-grams which may enable us to skip ACM Trans. Database Syst., 32(2), 2007. strings without computing the actual substring edit dis- [20] G. Navarro. A guided tour to approximate string matching. tances to the query string. We next presented two algo- ACM Comput. Surv., 33(1):31–88, 2001. rithms TopK-LB and TopK-SPLIT which efficiently find top- [21] G. Navarro, E. Sutinen, and J. Tarhio. Indexing text with approximate q-grams. J. Discrete Algorithms, 2005. k approximate substring matches by utilizing the filtering [22] R. Ostrovsky and Y. Rabani. Low distortion embeddings techniques. Furthermore, we developed an improved algo- for edit distance. J. ACM, 54(5), 2007. rithm TopK-INDEX which utilizes inverted q-gram indexes [23] M. Sahami and T. D. Heilman. A web-based kernel to speed up TopK-SPLIT. By experiment results, we show function for measuring the similarity of short text snippets. the effectiveness and scalability of our algorithms with real- In WWW, 2006. life data sets. [24] P. H. Sellers. The theory and computation of evolutionary distances: Pattern recognition. J. Algorithms, 1(4), 1980. Acknowledgment [25] E. Ukkonen. Finding approximate patterns in strings. J. This work was supported by the National Research Foun- Algorithms, 6(1):132–137, 1985. dation of Korea(NRF) grant funded by the Korea govern- [26] R. Vernica and C. Li. Efficient top-k algorithms for fuzzy ment(MEST) (No. NRF-2009-0078828). It was also sup- search in string collections. In KEYS, 2009. ported by Next-Generation Information Computing Devel- [27] W. Wang, C. Xiao, X. Lin, and C. Zhang. Efficient opment Program through the National Research Foundation approximate entity extraction with edit distance of Korea(NRF) funded by the Ministry of Education, Sci- constraints. In SIGMOD, 2009. ence and Technology (No. NRF-2012M3C4A7033342). [28] Z. Yang, J. Yu, and M. Kitsuregawa. Fast algorithms for top-k approximate string matching. In AAAI, 2010. 9. REFERENCES [29] Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and [1] A. Andoni and K. Onak. Approximating edit distance in D. Srivastava. Bed-: an all-purpose index structure for near-linear time. In STOC, 2009. string similarity search based on edit distance. In [2] R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern SIGMOD, 2010. Information Retrieval - the concepts and technology behind

396