LS-Join: Local Similarity Join on String Collections
Jiaying Wang, Xiaochun Yang,Member,IEEE, Bin Wang, and Chengfei Liu,Member,IEEE
Abstract—String similarity join, as an essential operation in applications including data integration and data cleaning, has attracted significant attention in the research community. Previous studies focus on global similarity join. In this paper, we study local similarity join with edit distance constraints, which finds string pairs from two string collections that have similar substrings. We study two kinds of local similarity join problems: checking local similar pairs and locating local similar pairs. We first consider the case where if two strings are locally similar to each other, they must share a common gram of a certain length. We show how to do efficient local similarity verification based on a matching gram pair. We propose two pruning techniques and an incremental method to further improve the efficiency of finding matching gram pairs. Then we devise a method to locate the longest similar substring pair for two local similar strings. We conducted a comprehensive experimental study to evaluate the efficiency of these techniques.
Index Terms—Local Similarity Join, Edit Distance, Similar Substrings, Filtering. !
1INTRODUCTION ID Strings r1 Samsung DV150F 16.2MP Smart Camera HE problem of similarity join, which is to find similar r2 Canon EOS ELAN 7E(35mm) SLR Camera string pairs from two string collections, is relevant to ::::::::::::::::::: T r3 Canon PowerShot SX170 IS MP Digital Camera many data cleaning and data integration applications [11], r4 Sony W800/B 20 MP Digital Camera [24]. Various functions can be used to quantify the similarity (a) Purchase of goods. between two strings, such as edit distance and Jaccard. ID Strings Many approaches [2], [15], [17], [19], [22], [28], [31] are s1 New Samsung’s DV150FX camera $85.00 developed to solve this problem. s2 Best for Beginners: Canon:::::::::EOS::::::ELAN:::::7/7E $449.99 Existing studies focus on global similarity join. However, s3 Memory Card for Canon:::::::::EOS::::::ELAN:::::7/7E $15.99 in applications such as data integration [13], and bioinfor- s4 New Canon’s PowerShot SX170IS Camera $99.95 matics [1], it is often important to find similar substring pairs, (b) Supply of goods. even if two strings are not similar globally. The following are Fig. 1: Two product tables. two motivating examples. Example 1. In data integration, users often want to match that are globally dissimilar but share similar substrings. In s3 s5 the same entity from different sources. Fig. 1 shows an particular, the underlined substrings of and are similar. example of an online shopping mall’s purchase list and a supply list from its supplier. Record r1 in Fig. 1(a) and ID Strings s1 record s1 in Fig. 1(b) describe the same Samsung camera DCCADGGCRAARDCRCDD model. In particular, they have two substrings that have s2 AGACAGCRRAARCDRAGG slightly different representations. Finding this type of pairs s3 GCAGTACTCAACGATAGC s4 can help us locate records related to the same product, so ::::::::::::::::::::::::GGATTACCTAGGCATTCT that we can do a deeper analysis to remove duplicates and s5 ATCATGCACTACTGAACG s6 integrate information from different sources. ::::::::::::::::::::::::GGATTACCTAAGCATTCT Example 2. A fundamental problem in protein sequence Fig. 2: Bio-sequences. comparison is to decide whether two sequences share com- In this paper, we study the problem of local similarity mon structural and functional features based on similarity join, which finds pairs from two string collections such that observed in their amino acid sequences. The decision can they share similar substrings. To evaluate local similarity, help scientists detect biologically similar living organisms we follow the way in [3] to evaluate local similarity by using in a large genome bank. Fig. 2 shows two bio-sequences length and edit distance constraints, since edit distance has been widely used for evaluating string similarity. In [3], the • Jiaying Wang, Xiaochun Yang and Bin Wang are with School of local similarity matching problem is defined as matching Computer Science and Engineering, Northeastern University, Liaoning any l-length pattern with k errors. It finds all locations in 110819, China. Xiaochun Yang is the corresponding author. E-mail: [email protected];{yangxc,bwang}@mail.neu.edu.cn. the text where an l-length substring of P ocucrs, with at • Chengfei Liu is with Department of Computer Science and Software most k differences. In real applications, l can be set to be the Engineering, Swinburne University of Technology, Australia. minimal entity length or phrase length, and k should satisfy E-mail: [email protected]. k l. Different from the similarity matching problem in
Digital Object Identifier no. 10.1109/TKDE.2017.2687460 1041-4347 ß 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 2
[3], we focus on local similarity join problem. To the best of We say α, β is a longest local similar pair of r and s if our knowledge, this is the first study of the local similarity min(|α|, |β|) is maximum among all the similar substring join problem under edit distance constraint. It contains two pairs of r and s. sub-problems: checking local similar pairs and locating local Consider the example in Fig. 1. Let l =10and τ =2. similar pairs. We develop techniques to solve the problem Then r1,s1 is a local similar pair under τ w.r.t. l, since sub- efficiently. We make the following contributions: string r1[1, 14] = Samsung_DV150F and substring s1[5, 20] 2 • We develop a local similarity join framework for = Samsung’s_DV150F have an edit distance and their 10 [1 14] [5 20] string collections in Section 3. It consists of three lengths are not less than . r1 , , s1 , is also their steps: (i) finding and pruning matching gram pairs longest similar substring pair. to generate candidate gram pairs, (ii) verifying candi- Below we define two sub-problems. dates by extending candidate gram pairs to substring Checking local similar pairs (also called LS-JOIN CHECK- pairs and calculating edit distance of substring pairs, ING problem or the “LJC problem”): Given two string and (iii) updating index. The framework can support collections R and S, let l be a length threshold and τ be self-join as well as join between two collections of an edit distance threshold. The problem of checking local ∈ × strings. similar pairs is to find all similar string pairs r, s R S, ( ) ≤ C • We first focus on the sub-problem of “checking local such that edl r, s τ. We use “Rl,τ S” to represent the similar pairs”. Local similarity verification based on a operation of checking local similar pairs. matching gram pair is not trivial. Naively it needs to Example 3. For the two string sets R and S in Fig. 1, we enumerate all the substring pairs. Existing extension- C have Rl=10,τ=2 S ={r1,s1 , r2,s2 , r2,s3 , r3,s4 }. based method, which conducts exact extension in Similarly, for the string set S shown in Fig. 2, we have one string and similar extension in another string, C Sl=10,τ=2 S ={s3,s5 , s4,s6 }. cannot solve the problem. How to do efficient veri- fication without enumerating all the substring paris? Locating local similar pairs (also called LS-JOIN LOCATING We propose techniques in Section 4 to tackle this problem or the “LJL problem”): For each local similar string ∈ × ( ) ≤ question. pair r, s R S with edl r, s τ, locate the longest local 1 L • Furthermore, there are many matching grams, some similar pair α, β of r, s under τ w.r.t. l. We use “Rl,τ of which could not extend to the final results, where- S” to represent the operation of locating local similar pairs. as others could cause duplicate verifications. Existing Example 4. Consider the two string sets R and S in Fig. 1. pruning methods do not work for local similarity L = [1 14] [5 20] [1 17] We have Rl=10,τ=2 S { r1 , , s1 , , r2 , ,s2 problem, thus in Section 5 we propose two new [21, 39] , r2[1, 17], s3[17, 35] , r3[1, 21], s4[5, 27] }. Simi- orthogonal pruning techniques to reduce candidates, L = larly, for the string set S in Fig. 2, Sl=10,τ=2 S and an incremental method to boost the process. { s3[1, 13],s5[6, 18] , s4[1, 18],s6[1, 18] }. The underlined • We extend the techniques of “checking local similar substrings in Figs. 1 and 2 are answers of the examples. pairs” to the sub-problem of “locating local similar pairs” in Section 6. We show that our techniques are 2.2 Edit Distance Matrix general enough to cover these two sub-problems. Edit distance can be computed using matrix-filling dynamic • We conduct extensive experiments on real and syn- programming algorithm[23], which reserves a matrix to thetic datasets to show the efficiency of the proposed hold the edit distances between all the prefixes of two techniques in Section 7. strings. We use i, j to denote a cell at the i-th row and j-th column in an edit distance matrix, and use D(i, j) to denote 2PRELIMINARIES the cell value. The edit distance matrix is constructed based on the recurrence relation given in Equation 1, in which 2.1 Problem Description cij =1if s1[i] = s2[j]; otherwise cij =0. Initially, D(i, 0) = (0 )= Let Σ be an alphabet of characters. For a string s consisting i and D ,j j. For two strings s1 and s2, we have ed(s1[1,i],s2[1,j]) = D(i, j) ed(s1,s2)=D(|s1|, |s2|) of characters in Σ,weuse|s| to denote its length, s[i] to ⎧ ,so . [ ] denote the i-th character (starting from 1), and s i, j to ⎨⎪ D(i, j − 1)+1, denote the substring from position i to position j (a.k.a., −1 D(i, j)=min D(i − 1,j)+1, (1) i j − i +1 s ⎪ “gram” at position with length ). We use to ⎩ ( − 1 − 1) + denote the reversed string of s. For example, if s =abc, D i ,j cij , −1 s =cba.Weuseed(r, s) to denote the edit distance be- Fig. 3 shows an example of edit distance computation tween string r and string s, which is the minimum number between strings ARDCRC and ARCDRA. Their edit distance is of single-character edit operations (insertion, deletion, and D(6, 6) = 3. substitution) to transform r to s. The computation can be improved if we just want to check whether the edit distance of two strings is within a Definition 1. [Local similar pair] Given an edit distance threshold τ. Since |i − j|≤D(i, j) ≤ τ, the computation of threshold τ and a length l (τ