Efficient Trie-Based String Similarity Joins with Edit-Distance Constraints

Trie•Join: Efficient Trie•based String Similarity Joins with Edit•Distance Constraints Jiannan Wang Jianhua Feng Guoliang Li yDepartment of Computer Science, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 10084, China [email protected]; [email protected]; [email protected] ABSTRACT study string similarity joins with edit-distance constraints, A string similarity join finds similar pairs between two col- which, given two sets of strings, find all similar string pairs lections of strings. It is an essential operation in many ap- from each set, such that the edit distance between each plications, such as data integration and cleaning, and has string pair is within a given threshold. The string simi- attracted significant attention recently. In this paper, we larity join has many real applications, such as finding near study string similarity joins with edit-distance constraints. duplicated queries in query log mining and correlating two Existing methods usually employ a filter-and-refine frame- sets of data (e.g., people name, place name, address). work and have the following disadvantages: (1) They are Existing studies, such as Part-Enum [1], All-Pairs-Ed [2], inefficient for the data sets with short strings (the average Ed-Join [18], usually employ a filter-and-refine framework. string length is no larger than 30); (2) They involve large In the filter step, they generate signatures for each string indexes; (3) They are expensive to support dynamic update and use the signatures to generate candidate pairs. In the of data sets. To address these problems, we propose a novel refine step, they verify the candidate pairs and output the framework called trie-join, which can generate results effi- final results. However, these approaches have the follow- ciently with small indexes. We use a trie structure to index ing disadvantages. Firstly, they are inefficient for the data the strings and utilize the trie structure to efficiently find sets with short strings (the average string length is no larger the similar string pairs based on subtrie pruning. We de- than 30), since they cannot select high-quality signatures for vise efficient trie-join algorithms and pruning techniques to short strings and thus they may generate a large number of achieve high performance. Our method can be easily ex- candidate pairs which need to be further verified. Secondly, tended to support dynamic update of data sets efficiently. they cannot support dynamic update of data sets. For ex- Experimental results show that our algorithms outperform ample, Ed-Join and All-Pairs-Ed need to select signatures state-of-the-art methods by an order of magnitude on three with higher weights. The dynamic update may change the real data sets with short strings. weights of signatures. Thus the two methods need to rese- lect signatures, rebuild indexes, and rerun their algorithms from scratch. Thirdly, they involve large index sizes as there 1. INTRODUCTION could be large numbers of signatures. The similarity join is an essential operation in many ap- To address above-mentioned problems, in this paper we plications, such as data integration and cleaning, near du- propose a new trie-based framework for efficient string sim- plicate object detection and elimination, and collaborative ilarity joins with edit-distance constraints. In comparison filtering. Recently it has attracted significant attention in with the filter-and-refine framework, our approach can ef- both academic and industrial community. For example, ficiently generate all similar string pairs without the refine SSJoin [4] proposed by Microsoft has been used in the data step. We use a trie structure to index strings which needs debugger project. A similarity join between two sets of much smaller space than existing methods, as the trie struc- objects finds all similar object pairs from each set. For ture can share many common prefixes of strings. To avoid example, given two sets of strings R = fkobe; ebay;::: g repeated computation, we propose subtrie pruning and dual and S = fbag; koby;::: g. We want to find all similar pairs subtrie pruning to improve performance. We devise efficient hr; si 2 R × S, such as hkobe; kobyi. trie-join-based algorithms and three pruning techniques to Many similarity functions have been proposed to quantify achieve high performance. Our method can be easily ex- the similarity between two objects, such as jaccard similar- tended to support dynamic update of data sets. ity, cosine similarity, and edit distance. In this paper, we To summarize, in this paper, we make the following contri- butions: (1) We propose a trie-based framework for efficient string similarity joins with edit-distance constraints. (2) We devise efficient trie-join-based algorithms and develop Permission to make digital or hard copies of all or part of this work for pruning techniques to achieve high performance. (3) We ex- personal or classroom use is granted without fee provided that copies are tend our method to support dynamic update of data sets not made or distributed for profit or commercial advantage and that copies efficiently. (4) Experimental results show that our method bear this notice and the full citation on the first page. To copy otherwise, to achieves high performance and outperforms existing algo- republish, to post on servers or to redistribute to lists, requires prior specific rithms by an order of magnitude on data sets with short permission and/or a fee. Technical Report of Department of Computer Science and Technology, strings (the average string length is no larger than 30). Tsinghua University, Beijing, China.. 1 The rest of this paper is organized as follows. Section 2 j 0 1 2 3 4 introduces the trie-based framework. We propose efficient i e b a y trie-join algorithms in Section 3 and develop three pruning 0 0 1 2 3 4 techniques in Section 4. Section 5 presents an incremental 1 k 1 1 2 3 4 algorithm for string similarity joins and Section 6 discuss how to extend our algorithm for two different string sets. 2 o 2 2 2 3 4 Experimental results are provided in Section 7. We review 3 b 3 3 2 3 4 related work in Section 8 and make a conclusion in Section 9. 4 y 4 4 3 3 3 2. TRIE•BASED FRAMEWORK Figure 1: Prefix pruning. Matrix for computing edit In this section, we first formalize the problem of string distance of two strings \ebay" and \koby". Shaded similarity joins with edit-distance constraints and then in- cells denote active entries for τ = 1. troduce a trie-based framework for efficient similarity joins. 2.3 Our Observations 2.1 Problem Formulation Observation 1 - Subtrie Pruning: As there are a large Given two sets of strings, a similarity join finds all simi- number of strings in the two sets and many strings share lar string pairs from the two sets. In this paper, we use edit prefixes, we can extend prefix pruning to prune a group of distance to quantify the similarity between two strings. For- strings. We use a trie structure to index all strings. Trie mally, the edit distance between two strings r and s, denoted is a tree structure where each path from the root to a leaf ed as (r, s), is the minimum number of single-character edit represents a string in the data set and every node on the operations (i.e., insertion, deletion, and substitution) needed path has a label of a character in the string. For instance, ed to transform r to s. For example, (koby, ebay)=3. In Figure 2 shows a trie structure of a sample data set with this paper two strings are similar if their edit distance is no six strings. String \ebay" has a trie node ID of 12 and its larger than a given edit-distance threshold τ. We formalize prefix \eb" has a trie node ID of 10. For simplicity, a node the problem of string similarity joins as follows. is mentioned interchangeably with its corresponding string Definition 1 (String Similarity Joins). Given two in later text. For example, both node \ko" and string \ko" sets of strings R and S, and an edit-distance threshold τ, a refer to node 14, and node 14 also refers to string \ko". similarity join finds all similar string pairs hr; si 2 R × S Given a trie node n, let jnj denote its depth (the depth of such that ed(r; s) ≤ τ. the root node is 0). For example, j\ko"j = 2. 2.2 Prefix Pruning 0 One na¨ıve solution to address this problem is all-pair ver- A sample data set ification, which enumerates all string pairs hr; si 2 R × S 1 9 13 b e k SID String and computes their edit distances. However, this solution s is rather expensive. In fact, in most cases to check whether 2 5 10 14 1 bag a e b o s two strings are similar, we need not compute the edit dis- 2 ebay 3 4 11 15 s tance between the two complete strings. Instead we can do g y 6 a a b 3 bay an early termination in the dynamic-programming compu- s4 kobe tation as follows [15]. 7 12 16 17 s g y e y 5 koby Given two strings r = r1r2 : : : rn and s = s1s2 : : : sm, let 8 s6 beagy D denote a matrix with n + 1 rows and m + 1 columns, and y D(i; j) be the edit distance between the prefix r1r2 : : : ri and Figure 2: Trie index of a sample data set the prefix s1s2 : : : sj .

Load more