More Fuzzy String Matching!
Total Page:16
File Type:pdf, Size:1020Kb
More fuzzy string matching! Steven Bedrick CS/EE 5/655, 12/1/14 Plan for today: Tries Simple uses of tries Fuzzy search with tries Levenshtein automata A trie is essentially a prefix tree: A: 15 i: 11 in: 5 inn: 9 to: 7 tea: 3 ted: 4 ten: 12 Simple uses of tries: Key lookup in O(m) time, predictably. (Compare to hash table: best-case O(1), worst-case O(n), depending on key) Fast longest-prefix matching IP routing table lookup For an incoming packet, find the closest next hop in a routing table. Simple uses of tries: Fast longest-prefix matching Useful for autocompletion: “All words/names/whatevers that start with XYZ...” The problem with tries: When the space of keys is sparse, the trie is not very compact: (One) Solution: PATRICIA Tries (One) Solution: PATRICIA Tries Key ideas: edges represent more than a single symbol; nodes with only one child get collapsed. One could explicitly represent edges with multiple symbols... ... but that would complicate matching. Instead, each internal node stores the offset for the next difference to look for: 1 e s 3 5 s t e i 6 8 sublease 7 a c t e i e i essence essential estimate estimation sublimate sublime subliminal Instead, each internal node stores the offset for the next difference to look for: 1 e s 3 5 s t e i 6 8 sublease 7 a c t e i e i essence essential estimate estimation sublimate sublime subliminal e s t i m a t i o n i 1 2 3 4 5 6 7 8 9 10 Instead, each internal node stores the offset for the next difference to look for: 1 e s 3 5 s t e i 6 8 sublease 7 a c t e i e i essence essential estimate estimation sublimate sublime subliminal ↓ e s t i m a t i o n i 1 2 3 4 5 6 7 8 9 10 Instead, each internal node stores the offset for the next difference to look for: 1 e s 3 5 s t e i 6 8 sublease 7 a c t e i e i essence essential estimate estimation sublimate sublime subliminal ↓ e s t i m a t i o n i 1 2 3 4 5 6 7 8 9 10 Instead, each internal node stores the offset for the next difference to look for: 1 e s 3 5 s t e i 6 8 sublease 7 a c t e i e i essence essential estimate estimation sublimate sublime subliminal ↓ e s t i m a t i o n i 1 2 3 4 5 6 7 8 9 10 Instead, each internal node stores the offset for the next difference to look for: 1 e s 3 5 s t e i 6 8 sublease 7 a c t e i e i essence essential estimate estimation sublimate sublime subliminal ↓ e s t i m a t i o n i 1 2 3 4 5 6 7 8 9 10 Instead, each internal node stores the offset for the next difference to look for: 1 e s 3 5 s t e i 6 8 sublease 7 a c t e i e i essence essential estimate estimation sublimate sublime subliminal ↓ e s t i m a t i o n i 1 2 3 4 5 6 7 8 9 10 Instead, each internal node stores the offset for the next difference to look for: 1 e s 3 5 s t e i 6 8 sublease 7 a c t e i e i essence essential estimate estimation sublimate sublime subliminal ↓ e s t i m a t i o n i 1 2 3 4 5 6 7 8 9 10 Instead, each internal node stores the offset for the next difference to look for: 1 e s 3 5 s t e i 6 8 sublease 7 a c t e i e i essence essential estimate estimation sublimate sublime subliminal Plan for today: Tries Simple uses of tries Fuzzy search with tries Levenshtein automata Fuzzy search with tries Problem: we want to search a dictionary for words similar to a query. Example: “Smyth” and “Zmith” should retrieve “Smith”, “Levenstien” should retrieve “Levenshtein”, etc. By “similar,” we mean “edit distance less than some threshold δ.” One solution: Compute pairwise edit distance between our query q and every word wi in our dictionary; Match if sim(q, wi) ≤ δ A (slightly) better solution: Speed up pairwise edit distance computation using prefix pruning. Prefix pruning’s key idea: If we only care whether strings r and s have an edit distance less than some threshold... j 0 1 2 3 4 2. TRIE-BASED FRAMEWORK i e b a y In this section, we first formalize the problem of string 0 0 1 2 3 4 similarity joins with edit-distance constraints and then in- troduce a trie-based framework for efficient similarity joins. 1 k 1 1 2 3 4 2 o 2 2 2 3 4 2.1 Problem Formulation 3 b 3 3 2 3 4 Given two sets of strings, a similarity join finds all simi- lar string pairs from the two sets. In this paper, we use edit 4 y 4 4 3 3 3 distance to quantify the similarity between two strings. For- Figure 1: Prefix pruning. Matrix for computing edit mally, the edit distance between two strings r and s,denoted ...wedist ancecan do of twoearly strings termination “ebay” of and our “koby computation”. Shaded as as ed(r, s), is the minimum number of single-character edit sooncells we denote exceedactive that entries threshold.for τ =1. operations (i.e., insertion, deletion, and substitution) needed Wang J, Feng J, Li G. Trie-join: efficient trie-based string similarity joins with edit-distance constraints. VLDB Endowment; 2010. to transform r to s.Forexample,ed(koby, ebay)=3. In Figure 2 shows a trie structure of a sample data set with this paper two strings are similar if their edit distance is no six strings. String “ebay”hasatrienodeIDof12andits larger than a given edit-distance threshold τ.Weformalize prefix “eb”hasatrienodeIDof10.Forsimplicity,anode the problem of string similarity joins as follows. is mentioned interchangeably with its corresponding string Definition 1 (String Similarity Joins). Given two in later text. For example, both node “ko”andstring“ko” sets of strings R and S,andanedit-distancethresholdτ,a refer to node 14, and node 14 also refers to string “ko”. similarity join finds all similar string pairs ⟨r, s⟩∈R × S Given a trie node n,let|n| denote its depth (the depth of such that ed(r, s) ≤ τ. the root node is 0). For example, |“ko”| =2. 2.2 Prefix Pruning 0 One na¨ıve solution to address this problem is all-pair ver- A sample data set 1 9 13 ification, which enumerates all string pairs ⟨r, s⟩∈R × S b e k SID String and computes their edit distances. However, this solution 25 10 14 s is rather expensive. In fact, in most cases to check whether a e b o 1 bag two strings are similar, we need not compute the edit dis- s2 ebay 3 4 6 11 15 tance between the two complete strings. Instead we can do g y a a b s3 bay an early termination in the dynamic-programming compu- s4 kobe 7 12 16 17 tation as follows [15]. g y e y s5 koby Given two strings r = r1r2 ...rn and s = s1s2 ...sm,let 8 s6 beagy D denote a matrix with n +1 rowsand m +1 columns,and y D(i, j)betheeditdistancebetweentheprefixr1r2 ...ri and Figure 2: Trie index of a sample data set the prefix s1s2 ...sj .Weusethedynamic-programmingal- gorithm to compute the matrix: D(0,j)=j for 0 ≤ j ≤ n, Note that many strings with same prefixes share the same and D(i, j)=min(D(i − 1,j)+1,D(i, j − 1) + 1,D(i − 1,j− ancestor nodes on the trie structure. Based on this property, we can extend the idea of prefix pruning to prune a group 1) + θ)whereθ =0ifri = sj ;otherwiseθ =1. D(i, j) is called an active entry if D(i, j) ≤ τ. Figure 1 shows the of strings. Given a trie and a string s,noden in the trie is matrix to compute the edit distance between “ebay”and called an active node of string s if ed(s, n) ≤ τ.Ifn is not “koby”. The shaded cells (e.g., D(1, 1)) denote active en- an active node for every prefix of string s,thenallthestrings tries for τ = 1. (For all running examples in the remainder under n cannot be similar to s.Thereasonisthefollowing. of this paper, we assume τ =1.)Tocheckwhetherr = For any string with prefix n in the trie, say r,inthedynamic- “ebay”ands =“koby” are similar, we first compute the programming algorithm, we can take r as the row and s entries in row D(0, ∗)(onlythoseentriescircledbythebold as the column. As the row D(|n|, ∗) has no active entry, lines). As D(0, 0) and D(0, 1) are active entries, we compute r cannot be similar to s based on prefix pruning.Basedon the entries in row D(1, ∗). Similarly, we compute the entries this observation, we propose a new pruning technique, called in row D(2, ∗). We find that D(2, 1), D(2, 2) and D(2, 3) subtrie pruning:Givenatrieandastrings,tocomputethe are not active entries.