<<

More fuzzy string matching!

Steven Bedrick CS/EE 5/655, 12/1/14 Plan for today:

Tries Simple uses of Fuzzy search with tries Levenshtein automata A is essentially a prefix tree:

A: 15 i: 11 in: 5 inn: 9 to: 7 tea: 3 ted: 4 ten: 12 Simple uses of tries:

Key lookup in O(m) time, predictably. (Compare to hash table: best-case O(1), worst-case O(n), depending on key)

Fast longest-prefix matching

IP routing table lookup For an incoming packet, find the closest next hop in a routing table. Simple uses of tries: Fast longest-prefix matching

Useful for autocompletion: “All words/names/whatevers that start with XYZ...”

The problem with tries: When the space of keys is sparse, the trie is not very compact:

(One) Solution: PATRICIA Tries (One) Solution: PATRICIA Tries

Key ideas: edges represent more than a single symbol; nodes with only one child get collapsed. One could explicitly represent edges with multiple symbols...

... but that would complicate matching. Instead, each internal node stores the offset for the next difference to look for:

1 e s

3 5

s t e i

6 8 sublease 7 a c t e i e i

essence essential estimate estimation sublimate sublime subliminal Instead, each internal node stores the offset for the next difference to look for:

1 e s

3 5

s t e i

6 8 sublease 7 a c t e i e i

essence essential estimate estimation sublimate sublime subliminal

e s t i m a t i o n

i 1 2 3 4 5 6 7 8 9 10 Instead, each internal node stores the offset for the next difference to look for:

1 e s

3 5

s t e i

6 8 sublease 7 a c t e i e i

essence essential estimate estimation sublimate sublime subliminal

↓ e s t i m a t i o n

i 1 2 3 4 5 6 7 8 9 10 Instead, each internal node stores the offset for the next difference to look for:

1 e s

3 5

s t e i

6 8 sublease 7 a c t e i e i

essence essential estimate estimation sublimate sublime subliminal

↓ e s t i m a t i o n

i 1 2 3 4 5 6 7 8 9 10 Instead, each internal node stores the offset for the next difference to look for:

1 e s

3 5

s t e i

6 8 sublease 7 a c t e i e i

essence essential estimate estimation sublimate sublime subliminal

↓ e s t i m a t i o n

i 1 2 3 4 5 6 7 8 9 10 Instead, each internal node stores the offset for the next difference to look for:

1 e s

3 5

s t e i

6 8 sublease 7 a c t e i e i

essence essential estimate estimation sublimate sublime subliminal

↓ e s t i m a t i o n

i 1 2 3 4 5 6 7 8 9 10 Instead, each internal node stores the offset for the next difference to look for:

1 e s

3 5

s t e i

6 8 sublease 7 a c t e i e i

essence essential estimate estimation sublimate sublime subliminal

↓ e s t i m a t i o n

i 1 2 3 4 5 6 7 8 9 10 Instead, each internal node stores the offset for the next difference to look for:

1 e s

3 5

s t e i

6 8 sublease 7 a c t e i e i

essence essential estimate estimation sublimate sublime subliminal

↓ e s t i m a t i o n

i 1 2 3 4 5 6 7 8 9 10 Instead, each internal node stores the offset for the next difference to look for:

1 e s

3 5

s t e i

6 8 sublease 7 a c t e i e i

essence essential estimate estimation sublimate sublime subliminal Plan for today:

Tries Simple uses of tries Fuzzy search with tries Levenshtein automata Fuzzy search with tries

Problem: we want to search a dictionary for words similar to a query.

Example: “Smyth” and “Zmith” should retrieve “Smith”,

“Levenstien” should retrieve “Levenshtein”, etc.

By “similar,” we mean “edit less than some threshold δ.” One solution: Compute pairwise edit distance between our query q and every word wi in our dictionary;

Match if sim(q, wi) ≤ δ

A (slightly) better solution: Speed up pairwise edit distance computation using prefix pruning. Prefix pruning’s key idea: If we only care whether strings r and s have an edit distance less than some threshold...

j 0 1 2 3 4 2. TRIE-BASED FRAMEWORK i e b a y In this section, we first formalize the problem of string 0 0 1 2 3 4 similarity joins with edit-distance constraints and then in- troduce a trie-based framework for efficient similarity joins. 1 k 1 1 2 3 4 2 o 2 2 2 3 4 2.1 Problem Formulation 3 b 3 3 2 3 4 Given two sets of strings, a similarity join finds all simi- lar string pairs from the two sets. In this paper, we use edit 4 y 4 4 3 3 3 distance to quantify the similarity between two strings. For- Figure 1: Prefix pruning. Matrix for computing edit mally, the edit distance between two strings r and s,denoted ...wedist ancecan do of twoearly strings termination “ebay” of and our “koby computation”. Shaded as as ed(r, s), is the minimum number of single-character edit sooncells we denote exceedactive that entries threshold.for τ =1. operations (i.e., insertion, deletion, and substitution) needed Wang J, Feng J, Li G. Trie-join: efficient trie-based string similarity joins with edit-distance constraints. VLDB Endowment; 2010. to transform r to s.Forexample,ed(koby, ebay)=3. In Figure 2 shows a trie structure of a sample data set with this paper two strings are similar if their edit distance is no six strings. String “ebay”hasatrienodeIDof12andits larger than a given edit-distance threshold τ.Weformalize prefix “eb”hasatrienodeIDof10.Forsimplicity,anode the problem of string similarity joins as follows. is mentioned interchangeably with its corresponding string Definition 1 (String Similarity Joins). Given two in later text. For example, both node “ko”andstring“ko” sets of strings R and S,andanedit-distancethresholdτ,a refer to node 14, and node 14 also refers to string “ko”. similarity join finds all similar string pairs ⟨r, s⟩∈R × S Given a trie node n,let|n| denote its depth (the depth of such that ed(r, s) ≤ τ. the root node is 0). For example, |“ko”| =2. 2.2 Prefix Pruning 0 One na¨ıve solution to address this problem is all-pair ver- A sample data set 1 9 13 ification, which enumerates all string pairs ⟨r, s⟩∈R × S b e k SID String and computes their edit . However, this solution 25 10 14 s is rather expensive. In fact, in most cases to check whether a e b o 1 bag two strings are similar, we need not compute the edit dis- s2 ebay 3 4 6 11 15 tance between the two complete strings. Instead we can do g y a a b s3 bay an early termination in the dynamic-programming compu- s4 kobe 7 12 16 17 tation as follows [15]. g y e y s5 koby Given two strings r = r1r2 ...rn and s = s1s2 ...sm,let 8 s6 beagy D denote a matrix with n +1 rowsand m +1 columns,and y D(i, j)betheeditdistancebetweentheprefixr1r2 ...ri and Figure 2: Trie index of a sample data set the prefix s1s2 ...sj .Weusethedynamic-programmingal- gorithm to compute the matrix: D(0,j)=j for 0 ≤ j ≤ n, Note that many strings with same prefixes share the same and D(i, j)=min(D(i − 1,j)+1,D(i, j − 1) + 1,D(i − 1,j− ancestor nodes on the trie structure. Based on this property, we can extend the idea of prefix pruning to prune a group 1) + θ)whereθ =0ifri = sj ;otherwiseθ =1. D(i, j) is called an active entry if D(i, j) ≤ τ. Figure 1 shows the of strings. Given a trie and a string s,noden in the trie is matrix to compute the edit distance between “ebay”and called an active node of string s if ed(s, n) ≤ τ.Ifn is not “koby”. The shaded cells (e.g., D(1, 1)) denote active en- an active node for every prefix of string s,thenallthestrings tries for τ = 1. (For all running examples in the remainder under n cannot be similar to s.Thereasonisthefollowing. of this paper, we assume τ =1.)Tocheckwhetherr = For any string with prefix n in the trie, say r,inthedynamic- “ebay”ands =“koby” are similar, we first compute the programming algorithm, we can take r as the row and s entries in row D(0, ∗)(onlythoseentriescircledbythebold as the column. As the row D(|n|, ∗) has no active entry, lines). As D(0, 0) and D(0, 1) are active entries, we compute r cannot be similar to s based on prefix pruning.Basedon the entries in row D(1, ∗). Similarly, we compute the entries this observation, we propose a new pruning technique, called in row D(2, ∗). We find that D(2, 1), D(2, 2) and D(2, 3) subtrie pruning:Givenatrieandastrings,tocomputethe are not active entries. Based on the dynamic-programming similar strings of s on the trie, for each trie node n,ifn is algorithm, the following rows D(i>2, ∗)cannothaveactive not an active node of every prefix of s,weneednottraverse entries, thus we can do an early termination. This prun- the subtrie rooted at n. The following Lemma shows the ing technique is called prefix pruning.Howeverthemethod correctness of the subtrie pruning. using prefix pruning for similarity joins also needs to do all- Lemma 1 (Subtrie Pruning). Given a trie T and a pair verification. To improve prefix pruning and increase string s,ifnoden is not an active node for every prefix of performance, we make the following two observations. s,thenn’s descendants will not be similar to s. For example, consider the trie in Figure 2 and suppose 2.3 Our Observations τ =1.Givenastring“ebay”, since node “ko”isnotan Observation 1 - Subtrie Pruning:Astherearealarge active node for every prefix of “ebay”, we can figure out number of strings in the two sets and many strings share that all the strings in the subtree rooted at “ko”cannotbe prefixes, we can extend prefix pruning to prune a group of similar to “ebay”basedonLemma 1, and thus those strings strings. We use a trie structure to index all strings. Trie under “ko”(e.g.,“kobe”and“koby”) can be pruned. is a tree structure where each path from the root to a leaf Using subtrie pruning, we can devise a trie-search-based represents a string in the data set and every node on the method for similarity joins, called Trie-Search. Trie- path has a label of a character in the string. For instance, Search first constructs a trie structure for all strings in R,

1220 One solution: Compute pairwise edit distance between our query q and every word wi in our dictionary;

Match if sim(q, wi) ≤ δ

A (slightly) better solution: Speed up pairwise edit distance computation using prefix pruning. Neither are very good solutions for any kind of “on-line” use case:

Query autocompletion, fuzzy searching, spellchecking, etc.

(our dictionary is large, number of searches is high, etc. etc.) A better solution: use a trie!

1. Build a trie out of our dictionary;

2. Iterate through q; at each point, identify a set of active nodes of the trie.

A node n is “active” with respect to a prefix qi if the edit distance between qi and the prefix represented by n is ≤ δ.

Ji S, Li G, Li C, Feng J. Efficient interactive fuzzy keyword search. WWW ’09 A better solution: use a trie!

1. Build a trie out of our dictionary;

2. Iterate through q; at each point, identify a set of active nodes of the trie. 3. Stop when we reach the end of q or no longer have active nodes.

Active nodes that happen to be leaves represent matches.

Ji S, Li G, Li C, Feng J. Efficient interactive fuzzy keyword search. WWW ’09 A better solution: use a trie!

1. Build a trie out of our dictionary;

2. Iterate through q; at each point, identify a set of active nodes of the trie. 3. Stop when we reach the end of q or no longer have active nodes.

Active nodes that happen to be leaves represent matches.

Ji S, Li G, Li C, Feng J. Efficient interactive fuzzy keyword search. WWW ’09 Intuition: at each symbol i in q, the set of active nodes will be related to the set from qi-1.

WWW 2009So, MADRID! we don’t need to visit every nodeTrack: Search in /the Session: Search UI

trie! ED = 0 ED = 1 ED = 2 0 0 0 0 0

10 10 10 10 10 l l l l l 11 14 11 14 11 14 11 14 11 14 i u 4 i u 4 i u 4 i u 4 i u 4

15 15 15 15 15 1 n u i 1 n u i 1 n u i 1 n u i 1 n u i 12 13 12 13 12 13 12 13 12 13 16 16 16 16 16 s s s s s 34 5 7 34 5 7 34 5 7 34 5 7 34 5 7 (a)Initialize (b)query“n”(c)query“nl”(d)query“nli”(e)query“nlis” WWW 2009 MADRID! Track: Search / Session: Search UI Figure 4: Fuzzy search of prefix queries of “nlis”(threshold δ =2).

ED = 0 ED = 1 ED = 2 nodes”), and we need to compute them efficiently. The leaf for “matching.” In this case, ed(nm,px+1) ed(n, px)= ≤ 0 descendantsJi S, Li G, of Li theC, Feng active0 J. Efficient nodes interactive arefuzzy calledkeyword search.0 the WWWpredicted ’09 key- 0 ξn δ.Thus,nm0is an active node of the new string, so we ≤ words of the prefix. For example, consider the trie in Fig- add nm, ξn into Φp .Thiscasecorrespondstothematch ⟨ ⟩ x+1 10 ure 4. Suppose the10 edit-distance threshold10δ =2,andauser 10 between the character10 cx+1 and the label of nm.Onesub- l l l l l types in a prefix p =“nlis”. Prefi x es “li”, “lin”, “liu”, tlety here is that, if the distance for the node nm is smaller 11 and “14luis”areallsimilarto11 14 p,sincetheireditdistancesto11 14 11 than14 δ,i.e.,ξ11n < δ,thefollowingoperationisalsorequired:14 i u 4 i u 4 i u 4 i u 4 i “nlisu ”arewithin4 δ.Thusnodes11,12,13,and16areactive for each nm’s descendant d that is at most δ ξn letters − nodes for p (Figure 4 (e)). The predicted keywords for the away from nm,weneedtoadd d, ξd to the active-node set 15 15 15 15 15 ⟨ ⟩ 1 n u prefixi are “1li”,n “linu”, “liui ”, an d1 “luisn ”. u i 1 n u fori the new1 stringn u px+1,wherei ξd = ξn + d nm .This 12 13 We develop12 a caching-based13 algorithm12 for13 incrementally12 13 operation corresponds12 13 to inserting several letters| | − | after| node 16 16 16 163 16 computings active nodes fors a keyword as the users types it nms. s in letter by letter. Given an input prefix p, we compute and 34 5 7 34 5 7 34 5 7 34 5 7 34 5 7 store the set of active nodes Φp = n, ξn ,inwhichn is c {⟨ ⟩} 1 (a)Initializean active node, and (b)query“ξn = edn(”(p, n) δ.Theideabehindourc)query“nl”(d)query“nli”(e)query“nlis” c2 ≤ algorithmFigure is 4: to Fuzzy use prefix-filtering: search of prefix when queries the user of types “nlis in”(threshold δ =2). c3 . one more letter after p,theactivenodesofp can be used to px . nodes”), and we needcompute to compute the active them e nodesfficiently. of the The new leaf query. for “matching.” In this case, ed(n ,p ) ed(n, p )= px+1 m x+1 n, n x . descendants of the activeThe nodes algorithm are called works the aspredicted follows. key- First, an active-nodeξ δ.Thus, setn is an active node of the new≤ string, so we n m c is initialized for the empty query ϵ,i.e.,Φϵ =≤ n, ξn n , +1 x words of the prefix. For example, consider the trie in Fig- add n{m⟨ , ξn ⟩into| Φpx+1 .Thiscasecorrespondstothematchn ξn = n δ .Thatis,itincludesalltrienodes⟨ n whose⟩ ure 4. Suppose the edit-distance| | ≤ threshold} δ =2,andauser between the character cx+1 and the label of nm.Onesub- cx+1 corresponding string has a length n within the edit-distance nm , n , types in a prefix p =“nlis”. Prefi x es “li”, “lin”, “liu”, tlety here is that, if the distance for then node nm iss smallern+1 threshold δ.Thesenodesareactivenodesfor| | ϵ since their and “luis”areallsimilartop,sincetheireditdistancesto than δ,i.e.,ξn < δ,thefollowingoperationisalsorequired:cx+1 edit distances to ϵ are within δ. “nlis”arewithinδ.Thusnodes11,12,13,and16areactive for each nm’s descendant d that is at most δ ξn letters As the user types in a query string px = c1c2 ...cx letter − nodes for p (Figure 4 (e)). The predicted keywords for the away from nm,weneedtoadd d, ξd to the active-node set by letter, the active-node set Φp is computed and cached ⟨ ⟩ prefix are “li”, “lin”, “liu”, an d “luis”. x for the new string px+1,wheren ξd = ξn + d nm .This for px.Whentheusertypesinanewcharactercx+1 and | | − | | We develop a caching-based algorithm for incrementally operation corresponds to insertingd , n+|d|-|n severalm| letters after node submits a new query px+1,theservercomputestheactive-3 computing active nodes for a keyword as the user types it nm. node set Φp for px+1 by using Φp .Foreachn, ξn in letter by letter. Given an inputx+1 prefix p, we compute and x ⟨ ⟩ Deletion Match Substitution Insersion in Φpx ,thedescendantsofn are examined as active-node store the set of active nodes Φp = n, ξn ,inwhichn is c1 candidates for p{x⟨+1,asillustratedinFigure5.Forthenode⟩} an active node, and ξn = ed(p, n) δ.Theideabehindour c2 n,ifξn +1 ≤ δ,thenn is an active node for px+1,and Figure 5: Incrementally computing the active-node algorithm is to use prefix-filtering:≤ when the user types in c3 n, ξn +1 is added into Φpx+1 .Thiscasecorrespondsto set Φ from the active-node. set Φ .Weconsider one more letter after⟨p,theactivenodesof⟩ p can be used to px+1 px px deleting the last character cx+1 from the new query string . compute the active nodes of the new query. an active node n,pξxn+1 in Φpx . px+1.Noticeevenifξn +1 δ is not true, node n could still n, ⟨ ⟩ . The algorithm works as follows. First, an active-node≤ set n potentially become an active node of the new string, due to cx is initialized for the empty query ϵ,i.e.,Φϵ = n, ξn During then , computation+1 of set Φpx+1 ,itispossibletoadd operations described below on{⟨ other⟩ active| nodes in Φpx . n ξn = n δ .Thatis,itincludesalltrienodesn whose multiple pairs v, ξ1 , v, ξ2 , ... for the same trie node v. cx+1 | | ≤ } For each child nc of node n,therearetwopossiblecases.n , ⟨ ⟩ ⟨ ⟩ corresponding string has a length n within the edit-distance m n In this case,ns , onlyn+1 the one with the smallest edit operation is Case 1:thechildnode| | nc has a character different from threshold δ.Thesenodesareactivenodesforϵ since their kept in Φp .Thereasonisthat,bydefinition,forthesame cx+1 x+1 edit distances to ϵ arecx within+1.Figure5showsanodeδ. ns for such a child node, where trie node v,onlythepairwiththeeditdistancebetweenthe “s”standsfor“substitution,”themeaningofwhichwillbe- As the user types in a query string px = c1c2 ...cx letter node v and the query string px+1 should be kept in Φpx+1 , come clear shortly. We have ed(ns,px+1) ed(n, px)+1= by letter, the active-node set Φpx is computed and cached ≤ which means the minimum number of edit operations to ξn +1. If ξn +1 δ,thenns is an active noden for the for px.Whentheusertypesinanewcharactercx+1 and transform the string of v to the string of p . ≤ d , +|d|-|n | x+1 new string, and ns, ξn +1 is added into Φpx+1 .Thiscasen m submits a new query px+1,theservercomputestheactive-⟨ ⟩ 3 corresponds to substituting the label of ns for the letter Descendants of node ns do not need to be considered for node set Φpx+1 for px+1 by using Φpx .Foreachn, ξn cx+1. Case 2:thechildnodenc⟨ has a⟩ label cx+1.Figure5Deletion insertions,Match becauseSubstitution if theseInsersion descendants are active nodes of in Φpx ,thedescendantsofn are examined as active-node shows the node nm for such a child node, where “m”stands Φpx+1 ,theirparentsmustappearinΦpx ,andtheycanbe candidates for px+1,asillustratedinFigure5.Forthenode processed by substitutions on their parents. n,ifξn +1 δ,thenn is an active node for px+1,and Figure 5: Incrementally computing the active-node ≤ n, ξn +1 is added into Φpx+1 .Thiscasecorrespondsto set Φ from the active-node set Φ .Weconsider ⟨ ⟩ px+1 px deleting the last character cx+1 from the new query string an active node n, ξn in Φp . ⟨ ⟩ x px+1.Noticeevenifξn +1 δ is not true, node n could still 374 potentially become an active≤ node of the new string, due to During the computation of set Φpx+1 ,itispossibletoadd operations described below on other active nodes in Φpx . multiple pairs v, ξ1 , v, ξ2 , ... for the same trie node v. For each child nc of node n,therearetwopossiblecases. In this case, only⟨ the⟩ one⟨ with⟩ the smallest edit operation is Case 1:thechildnodenc has a character different from kept in Φpx+1 .Thereasonisthat,bydefinition,forthesame cx+1.Figure5showsanodens for such a child node, where trie node v,onlythepairwiththeeditdistancebetweenthe “s”standsfor“substitution,”themeaningofwhichwillbe- node v and the query string px+1 should be kept in Φpx+1 , come clear shortly. We have ed(ns,px+1) ed(n, px)+1= ≤ which means the minimum number of edit operations to ξn +1. If ξn +1 δ,thenns is an active node for the ≤ transform the string of v to the string of px+1. new string, and ns, ξn +1 is added into Φpx+1 .Thiscase ⟨ ⟩ 3 corresponds to substituting the label of ns for the letter Descendants of node ns do not need to be considered for cx+1. Case 2:thechildnodenc has a label cx+1.Figure5 insertions, because if these descendants are active nodes of shows the node nm for such a child node, where “m”stands Φpx+1 ,theirparentsmustappearinΦpx ,andtheycanbe processed by substitutions on their parents.

374 Intuition: at each symbol i in q, the set of active nodes will be related to the set from qi-1.

WWW 2009So, MADRID! we don’t need to visit every nodeTrack: Search in /the Session: Search UI

trie! ED = 0 ED = 1 ED = 2 0 0 0 0 0

10 10 10 10 10 l l l l l 11 14 11 14 11 14 11 14 11 14 i u 4 i u 4 i u 4 i u 4 i u 4

15 15 15 15 15 1 n u i 1 n u i 1 n u i 1 n u i 1 n u i 12 13 12 13 12 13 12 13 12 13 16 16 16 16 16 s s s s s 34 5 7 34 5 7 34 5 7 34 5 7 34 5 7 (a)Initialize (b)query“n”(c)query“nl”(d)query“nli”(e)query“nlis” WWW 2009 MADRID! Track: Search / Session: Search UI Figure 4: Fuzzy search of prefix queries of “nlis”(threshold δ =2).

ED = 0 ED = 1 ED = 2 nodes”), and we need to compute them efficiently. The leaf for “matching.” In this case, ed(nm,px+1) ed(n, px)= ≤ 0 descendantsJi S, Li G, of Li theC, Feng active0 J. Efficient nodes interactive arefuzzy calledkeyword search.0 the WWWpredicted ’09 key- 0 ξn δ.Thus,nm0is an active node of the new string, so we ≤ words of the prefix. For example, consider the trie in Fig- add nm, ξn into Φp .Thiscasecorrespondstothematch ⟨ ⟩ x+1 10 ure 4. Suppose the10 edit-distance threshold10δ =2,andauser 10 between the character10 cx+1 and the label of nm.Onesub- l l l l l types in a prefix p =“nlis”. Prefi x es “li”, “lin”, “liu”, tlety here is that, if the distance for the node nm is smaller 11 and “14luis”areallsimilarto11 14 p,sincetheireditdistancesto11 14 11 than14 δ,i.e.,ξ11n < δ,thefollowingoperationisalsorequired:14 i u 4 i u 4 i u 4 i u 4 i “nlisu ”arewithin4 δ.Thusnodes11,12,13,and16areactive for each nm’s descendant d that is at most δ ξn letters − nodes for p (Figure 4 (e)). The predicted keywords for the away from nm,weneedtoadd d, ξd to the active-node set 15 15 15 15 15 ⟨ ⟩ 1 n u prefixi are “1li”,n “linu”, “liui ”, an d1 “luisn ”. u i 1 n u fori the new1 stringn u px+1,wherei ξd = ξn + d nm .This 12 13 We develop12 a caching-based13 algorithm12 for13 incrementally12 13 operation corresponds12 13 to inserting several letters| | − | after| node 16 16 16 163 16 computings active nodes fors a keyword as the users types it nms. s in letter by letter. Given an input prefix p, we compute and 34 5 7 34 5 7 34 5 7 34 5 7 34 5 7 store the set of active nodes Φp = n, ξn ,inwhichn is c {⟨ ⟩} 1 (a)Initializean active node, and (b)query“ξn = edn(”(p, n) δ.Theideabehindourc)query“nl”(d)query“nli”(e)query“nlis” c2 ≤ algorithmFigure is 4: to Fuzzy use prefix-filtering: search of prefix when queries the user of types “nlis in”(threshold δ =2). c3 . one more letter after p,theactivenodesofp can be used to px . nodes”), and we needcompute to compute the active them e nodesfficiently. of the The new leaf query. for “matching.” In this case, ed(n ,p ) ed(n, p )= px+1 m x+1 n, n x . descendants of the activeThe nodes algorithm are called works the aspredicted follows. key- First, an active-nodeξ δ.Thus, setn is an active node of the new≤ string, so we n m c is initialized for the empty query ϵ,i.e.,Φϵ =≤ n, ξn n , +1 x words of the prefix. For example, consider the trie in Fig- add n{m⟨ , ξn ⟩into| Φpx+1 .Thiscasecorrespondstothematchn ξn = n δ .Thatis,itincludesalltrienodes⟨ n whose⟩ ure 4. Suppose the edit-distance| | ≤ threshold} δ =2,andauser between the character cx+1 and the label of nm.Onesub- cx+1 corresponding string has a length n within the edit-distance nm , n , types in a prefix p =“nlis”. Prefi x es “li”, “lin”, “liu”, tlety here is that, if the distance for then node nm iss smallern+1 threshold δ.Thesenodesareactivenodesfor| | ϵ since their and “luis”areallsimilartop,sincetheireditdistancesto than δ,i.e.,ξn < δ,thefollowingoperationisalsorequired:cx+1 edit distances to ϵ are within δ. “nlis”arewithinδ.Thusnodes11,12,13,and16areactive for each nm’s descendant d that is at most δ ξn letters As the user types in a query string px = c1c2 ...cx letter − nodes for p (Figure 4 (e)). The predicted keywords for the away from nm,weneedtoadd d, ξd to the active-node set by letter, the active-node set Φp is computed and cached ⟨ ⟩ prefix are “li”, “lin”, “liu”, an d “luis”. x for the new string px+1,wheren ξd = ξn + d nm .This for px.Whentheusertypesinanewcharactercx+1 and | | − | | We develop a caching-based algorithm for incrementally operation corresponds to insertingd , n+|d|-|n severalm| letters after node submits a new query px+1,theservercomputestheactive-3 computing active nodes for a keyword as the user types it nm. node set Φp for px+1 by using Φp .Foreachn, ξn in letter by letter. Given an inputx+1 prefix p, we compute and x ⟨ ⟩ Deletion Match Substitution Insersion in Φpx ,thedescendantsofn are examined as active-node store the set of active nodes Φp = n, ξn ,inwhichn is c1 candidates for p{x⟨+1,asillustratedinFigure5.Forthenode⟩} an active node, and ξn = ed(p, n) δ.Theideabehindour c2 n,ifξn +1 ≤ δ,thenn is an active node for px+1,and Figure 5: Incrementally computing the active-node algorithm is to use prefix-filtering:≤ when the user types in c3 n, ξn +1 is added into Φpx+1 .Thiscasecorrespondsto set Φ from the active-node. set Φ .Weconsider one more letter after⟨p,theactivenodesof⟩ p can be used to px+1 px px deleting the last character cx+1 from the new query string . compute the active nodes of the new query. an active node n,pξxn+1 in Φpx . px+1.Noticeevenifξn +1 δ is not true, node n could still n, ⟨ ⟩ . The algorithm works as follows. First, an active-node≤ set n potentially become an active node of the new string, due to cx is initialized for the empty query ϵ,i.e.,Φϵ = n, ξn During then , computation+1 of set Φpx+1 ,itispossibletoadd operations described below on{⟨ other⟩ active| nodes in Φpx . n ξn = n δ .Thatis,itincludesalltrienodesn whose multiple pairs v, ξ1 , v, ξ2 , ... for the same trie node v. cx+1 | | ≤ } For each child nc of node n,therearetwopossiblecases.n , ⟨ ⟩ ⟨ ⟩ corresponding string has a length n within the edit-distance m n In this case,ns , onlyn+1 the one with the smallest edit operation is Case 1:thechildnode| | nc has a character different from threshold δ.Thesenodesareactivenodesforϵ since their kept in Φp .Thereasonisthat,bydefinition,forthesame cx+1 x+1 edit distances to ϵ arecx within+1.Figure5showsanodeδ. ns for such a child node, where trie node v,onlythepairwiththeeditdistancebetweenthe “s”standsfor“substitution,”themeaningofwhichwillbe- As the user types in a query string px = c1c2 ...cx letter node v and the query string px+1 should be kept in Φpx+1 , come clear shortly. We have ed(ns,px+1) ed(n, px)+1= by letter, the active-node set Φpx is computed and cached ≤ which means the minimum number of edit operations to ξn +1. If ξn +1 δ,thenns is an active noden for the for px.Whentheusertypesinanewcharactercx+1 and transform the string of v to the string of p . ≤ d , +|d|-|n | x+1 new string, and ns, ξn +1 is added into Φpx+1 .Thiscasen m submits a new query px+1,theservercomputestheactive-⟨ ⟩ 3 corresponds to substituting the label of ns for the letter Descendants of node ns do not need to be considered for node set Φpx+1 for px+1 by using Φpx .Foreachn, ξn cx+1. Case 2:thechildnodenc⟨ has a⟩ label cx+1.Figure5Deletion insertions,Match becauseSubstitution if theseInsersion descendants are active nodes of in Φpx ,thedescendantsofn are examined as active-node shows the node nm for such a child node, where “m”stands Φpx+1 ,theirparentsmustappearinΦpx ,andtheycanbe candidates for px+1,asillustratedinFigure5.Forthenode processed by substitutions on their parents. n,ifξn +1 δ,thenn is an active node for px+1,and Figure 5: Incrementally computing the active-node ≤ n, ξn +1 is added into Φpx+1 .Thiscasecorrespondsto set Φ from the active-node set Φ .Weconsider ⟨ ⟩ px+1 px deleting the last character cx+1 from the new query string an active node n, ξn in Φp . ⟨ ⟩ x px+1.Noticeevenifξn +1 δ is not true, node n could still 374 potentially become an active≤ node of the new string, due to During the computation of set Φpx+1 ,itispossibletoadd operations described below on other active nodes in Φpx . multiple pairs v, ξ1 , v, ξ2 , ... for the same trie node v. For each child nc of node n,therearetwopossiblecases. In this case, only⟨ the⟩ one⟨ with⟩ the smallest edit operation is Case 1:thechildnodenc has a character different from kept in Φpx+1 .Thereasonisthat,bydefinition,forthesame cx+1.Figure5showsanodens for such a child node, where trie node v,onlythepairwiththeeditdistancebetweenthe “s”standsfor“substitution,”themeaningofwhichwillbe- node v and the query string px+1 should be kept in Φpx+1 , come clear shortly. We have ed(ns,px+1) ed(n, px)+1= ≤ which means the minimum number of edit operations to ξn +1. If ξn +1 δ,thenns is an active node for the ≤ transform the string of v to the string of px+1. new string, and ns, ξn +1 is added into Φpx+1 .Thiscase ⟨ ⟩ 3 corresponds to substituting the label of ns for the letter Descendants of node ns do not need to be considered for cx+1. Case 2:thechildnodenc has a label cx+1.Figure5 insertions, because if these descendants are active nodes of shows the node nm for such a child node, where “m”stands Φpx+1 ,theirparentsmustappearinΦpx ,andtheycanbe processed by substitutions on their parents.

374 Intuition: at each symbol i in q, the set of active nodes will be related to the set from qi-1.

WWW 2009So, MADRID! we don’t need to visit every nodeTrack: Search in /the Session: Search UI

trie! ED = 0 ED = 1 ED = 2 0 0 0 0 0

10 10 10 10 10 l l l l l 11 14 11 14 11 14 11 14 11 14 i u 4 i u 4 i u 4 i u 4 i u 4

15 15 15 15 15 1 n u i 1 n u i 1 n u i 1 n u i 1 n u i 12 13 12 13 12 13 12 13 12 13 16 16 16 16 16 s s s s s 34 5 7 34 5 7 34 5 7 34 5 7 34 5 7 (a)Initialize (b)query“n”(c)query“nl”(d)query“nli”(e)query“nlis” WWW 2009 MADRID! Track: Search / Session: Search UI Figure 4: Fuzzy search of prefix queries of “nlis”(threshold δ =2).

ED = 0 ED = 1 ED = 2 nodes”), and we need to compute them efficiently. The leaf for “matching.” In this case, ed(nm,px+1) ed(n, px)= ≤ 0 descendantsJi S, Li G, of Li theC, Feng active0 J. Efficient nodes interactive arefuzzy calledkeyword search.0 the WWWpredicted ’09 key- 0 ξn δ.Thus,nm0is an active node of the new string, so we ≤ words of the prefix. For example, consider the trie in Fig- add nm, ξn into Φp .Thiscasecorrespondstothematch ⟨ ⟩ x+1 10 ure 4. Suppose the10 edit-distance threshold10δ =2,andauser 10 between the character10 cx+1 and the label of nm.Onesub- l l l l l types in a prefix p =“nlis”. Prefi x es “li”, “lin”, “liu”, tlety here is that, if the distance for the node nm is smaller 11 and “14luis”areallsimilarto11 14 p,sincetheireditdistancesto11 14 11 than14 δ,i.e.,ξ11n < δ,thefollowingoperationisalsorequired:14 i u 4 i u 4 i u 4 i u 4 i “nlisu ”arewithin4 δ.Thusnodes11,12,13,and16areactive for each nm’s descendant d that is at most δ ξn letters − nodes for p (Figure 4 (e)). The predicted keywords for the away from nm,weneedtoadd d, ξd to the active-node set 15 15 15 15 15 ⟨ ⟩ 1 n u prefixi are “1li”,n “linu”, “liui ”, an d1 “luisn ”. u i 1 n u fori the new1 stringn u px+1,wherei ξd = ξn + d nm .This 12 13 We develop12 a caching-based13 algorithm12 for13 incrementally12 13 operation corresponds12 13 to inserting several letters| | − | after| node 16 16 16 163 16 computings active nodes fors a keyword as the users types it nms. s in letter by letter. Given an input prefix p, we compute and 34 5 7 34 5 7 34 5 7 34 5 7 34 5 7 store the set of active nodes Φp = n, ξn ,inwhichn is c {⟨ ⟩} 1 (a)Initializean active node, and (b)query“ξn = edn(”(p, n) δ.Theideabehindourc)query“nl”(d)query“nli”(e)query“nlis” c2 ≤ algorithmFigure is 4: to Fuzzy use prefix-filtering: search of prefix when queries the user of types “nlis in”(threshold δ =2). c3 . one more letter after p,theactivenodesofp can be used to px . nodes”), and we needcompute to compute the active them e nodesfficiently. of the The new leaf query. for “matching.” In this case, ed(n ,p ) ed(n, p )= px+1 m x+1 n, n x . descendants of the activeThe nodes algorithm are called works the aspredicted follows. key- First, an active-nodeξ δ.Thus, setn is an active node of the new≤ string, so we n m c is initialized for the empty query ϵ,i.e.,Φϵ =≤ n, ξn n , +1 x words of the prefix. For example, consider the trie in Fig- add n{m⟨ , ξn ⟩into| Φpx+1 .Thiscasecorrespondstothematchn ξn = n δ .Thatis,itincludesalltrienodes⟨ n whose⟩ ure 4. Suppose the edit-distance| | ≤ threshold} δ =2,andauser between the character cx+1 and the label of nm.Onesub- cx+1 corresponding string has a length n within the edit-distance nm , n , types in a prefix p =“nlis”. Prefi x es “li”, “lin”, “liu”, tlety here is that, if the distance for then node nm iss smallern+1 threshold δ.Thesenodesareactivenodesfor| | ϵ since their and “luis”areallsimilartop,sincetheireditdistancesto than δ,i.e.,ξn < δ,thefollowingoperationisalsorequired:cx+1 edit distances to ϵ are within δ. “nlis”arewithinδ.Thusnodes11,12,13,and16areactive for each nm’s descendant d that is at most δ ξn letters As the user types in a query string px = c1c2 ...cx letter − nodes for p (Figure 4 (e)). The predicted keywords for the away from nm,weneedtoadd d, ξd to the active-node set by letter, the active-node set Φp is computed and cached ⟨ ⟩ prefix are “li”, “lin”, “liu”, an d “luis”. x for the new string px+1,wheren ξd = ξn + d nm .This for px.Whentheusertypesinanewcharactercx+1 and | | − | | We develop a caching-based algorithm for incrementally operation corresponds to insertingd , n+|d|-|n severalm| letters after node submits a new query px+1,theservercomputestheactive-3 computing active nodes for a keyword as the user types it nm. node set Φp for px+1 by using Φp .Foreachn, ξn in letter by letter. Given an inputx+1 prefix p, we compute and x ⟨ ⟩ Deletion Match Substitution Insersion in Φpx ,thedescendantsofn are examined as active-node store the set of active nodes Φp = n, ξn ,inwhichn is c1 candidates for p{x⟨+1,asillustratedinFigure5.Forthenode⟩} an active node, and ξn = ed(p, n) δ.Theideabehindour c2 n,ifξn +1 ≤ δ,thenn is an active node for px+1,and Figure 5: Incrementally computing the active-node algorithm is to use prefix-filtering:≤ when the user types in c3 n, ξn +1 is added into Φpx+1 .Thiscasecorrespondsto set Φ from the active-node. set Φ .Weconsider one more letter after⟨p,theactivenodesof⟩ p can be used to px+1 px px deleting the last character cx+1 from the new query string . compute the active nodes of the new query. an active node n,pξxn+1 in Φpx . px+1.Noticeevenifξn +1 δ is not true, node n could still n, ⟨ ⟩ . The algorithm works as follows. First, an active-node≤ set n potentially become an active node of the new string, due to cx is initialized for the empty query ϵ,i.e.,Φϵ = n, ξn During then , computation+1 of set Φpx+1 ,itispossibletoadd operations described below on{⟨ other⟩ active| nodes in Φpx . n ξn = n δ .Thatis,itincludesalltrienodesn whose multiple pairs v, ξ1 , v, ξ2 , ... for the same trie node v. cx+1 | | ≤ } For each child nc of node n,therearetwopossiblecases.n , ⟨ ⟩ ⟨ ⟩ corresponding string has a length n within the edit-distance m n In this case,ns , onlyn+1 the one with the smallest edit operation is Case 1:thechildnode| | nc has a character different from threshold δ.Thesenodesareactivenodesforϵ since their kept in Φp .Thereasonisthat,bydefinition,forthesame cx+1 x+1 edit distances to ϵ arecx within+1.Figure5showsanodeδ. ns for such a child node, where trie node v,onlythepairwiththeeditdistancebetweenthe “s”standsfor“substitution,”themeaningofwhichwillbe- As the user types in a query string px = c1c2 ...cx letter node v and the query string px+1 should be kept in Φpx+1 , come clear shortly. We have ed(ns,px+1) ed(n, px)+1= by letter, the active-node set Φpx is computed and cached ≤ which means the minimum number of edit operations to ξn +1. If ξn +1 δ,thenns is an active noden for the for px.Whentheusertypesinanewcharactercx+1 and transform the string of v to the string of p . ≤ d , +|d|-|n | x+1 new string, and ns, ξn +1 is added into Φpx+1 .Thiscasen m submits a new query px+1,theservercomputestheactive-⟨ ⟩ 3 corresponds to substituting the label of ns for the letter Descendants of node ns do not need to be considered for node set Φpx+1 for px+1 by using Φpx .Foreachn, ξn cx+1. Case 2:thechildnodenc⟨ has a⟩ label cx+1.Figure5Deletion insertions,Match becauseSubstitution if theseInsersion descendants are active nodes of in Φpx ,thedescendantsofn are examined as active-node shows the node nm for such a child node, where “m”stands Φpx+1 ,theirparentsmustappearinΦpx ,andtheycanbe candidates for px+1,asillustratedinFigure5.Forthenode processed by substitutions on their parents. n,ifξn +1 δ,thenn is an active node for px+1,and Figure 5: Incrementally computing the active-node ≤ n, ξn +1 is added into Φpx+1 .Thiscasecorrespondsto set Φ from the active-node set Φ .Weconsider ⟨ ⟩ px+1 px deleting the last character cx+1 from the new query string an active node n, ξn in Φp . ⟨ ⟩ x px+1.Noticeevenifξn +1 δ is not true, node n could still 374 potentially become an active≤ node of the new string, due to During the computation of set Φpx+1 ,itispossibletoadd operations described below on other active nodes in Φpx . multiple pairs v, ξ1 , v, ξ2 , ... for the same trie node v. For each child nc of node n,therearetwopossiblecases. In this case, only⟨ the⟩ one⟨ with⟩ the smallest edit operation is Case 1:thechildnodenc has a character different from kept in Φpx+1 .Thereasonisthat,bydefinition,forthesame cx+1.Figure5showsanodens for such a child node, where trie node v,onlythepairwiththeeditdistancebetweenthe “s”standsfor“substitution,”themeaningofwhichwillbe- node v and the query string px+1 should be kept in Φpx+1 , come clear shortly. We have ed(ns,px+1) ed(n, px)+1= ≤ which means the minimum number of edit operations to ξn +1. If ξn +1 δ,thenns is an active node for the ≤ transform the string of v to the string of px+1. new string, and ns, ξn +1 is added into Φpx+1 .Thiscase ⟨ ⟩ 3 corresponds to substituting the label of ns for the letter Descendants of node ns do not need to be considered for cx+1. Case 2:thechildnodenc has a label cx+1.Figure5 insertions, because if these descendants are active nodes of shows the node nm for such a child node, where “m”stands Φpx+1 ,theirparentsmustappearinΦpx ,andtheycanbe processed by substitutions on their parents.

374 Intuition: at each symbol i in q, the set of active nodes will be related to the set from qi-1.

WWW 2009So, MADRID! we don’t need to visit every nodeTrack: Search in /the Session: Search UI

trie! ED = 0 ED = 1 ED = 2 0 0 0 0 0

10 10 10 10 10 l l l l l 11 14 11 14 11 14 11 14 11 14 i u 4 i u 4 i u 4 i u 4 i u 4

15 15 15 15 15 1 n u i 1 n u i 1 n u i 1 n u i 1 n u i 12 13 12 13 12 13 12 13 12 13 16 16 16 16 16 s s s s s 34 5 7 34 5 7 34 5 7 34 5 7 34 5 7 (a)Initialize (b)query“n”(c)query“nl”(d)query“nli”(e)query“nlis” WWW 2009 MADRID! Track: Search / Session: Search UI Figure 4: Fuzzy search of prefix queries of “nlis”(threshold δ =2).

ED = 0 ED = 1 ED = 2 nodes”), and we need to compute them efficiently. The leaf for “matching.” In this case, ed(nm,px+1) ed(n, px)= ≤ 0 descendantsJi S, Li G, of Li theC, Feng active0 J. Efficient nodes interactive arefuzzy calledkeyword search.0 the WWWpredicted ’09 key- 0 ξn δ.Thus,nm0is an active node of the new string, so we ≤ words of the prefix. For example, consider the trie in Fig- add nm, ξn into Φp .Thiscasecorrespondstothematch ⟨ ⟩ x+1 10 ure 4. Suppose the10 edit-distance threshold10δ =2,andauser 10 between the character10 cx+1 and the label of nm.Onesub- l l l l l types in a prefix p =“nlis”. Prefi x es “li”, “lin”, “liu”, tlety here is that, if the distance for the node nm is smaller 11 and “14luis”areallsimilarto11 14 p,sincetheireditdistancesto11 14 11 than14 δ,i.e.,ξ11n < δ,thefollowingoperationisalsorequired:14 i u 4 i u 4 i u 4 i u 4 i “nlisu ”arewithin4 δ.Thusnodes11,12,13,and16areactive for each nm’s descendant d that is at most δ ξn letters − nodes for p (Figure 4 (e)). The predicted keywords for the away from nm,weneedtoadd d, ξd to the active-node set 15 15 15 15 15 ⟨ ⟩ 1 n u prefixi are “1li”,n “linu”, “liui ”, an d1 “luisn ”. u i 1 n u fori the new1 stringn u px+1,wherei ξd = ξn + d nm .This 12 13 We develop12 a caching-based13 algorithm12 for13 incrementally12 13 operation corresponds12 13 to inserting several letters| | − | after| node 16 16 16 163 16 computings active nodes fors a keyword as the users types it nms. s in letter by letter. Given an input prefix p, we compute and 34 5 7 34 5 7 34 5 7 34 5 7 34 5 7 store the set of active nodes Φp = n, ξn ,inwhichn is c {⟨ ⟩} 1 (a)Initializean active node, and (b)query“ξn = edn(”(p, n) δ.Theideabehindourc)query“nl”(d)query“nli”(e)query“nlis” c2 ≤ algorithmFigure is 4: to Fuzzy use prefix-filtering: search of prefix when queries the user of types “nlis in”(threshold δ =2). c3 . one more letter after p,theactivenodesofp can be used to px . nodes”), and we needcompute to compute the active them e nodesfficiently. of the The new leaf query. for “matching.” In this case, ed(n ,p ) ed(n, p )= px+1 m x+1 n, n x . descendants of the activeThe nodes algorithm are called works the aspredicted follows. key- First, an active-nodeξ δ.Thus, setn is an active node of the new≤ string, so we n m c is initialized for the empty query ϵ,i.e.,Φϵ =≤ n, ξn n , +1 x words of the prefix. For example, consider the trie in Fig- add n{m⟨ , ξn ⟩into| Φpx+1 .Thiscasecorrespondstothematchn ξn = n δ .Thatis,itincludesalltrienodes⟨ n whose⟩ ure 4. Suppose the edit-distance| | ≤ threshold} δ =2,andauser between the character cx+1 and the label of nm.Onesub- cx+1 corresponding string has a length n within the edit-distance nm , n , types in a prefix p =“nlis”. Prefi x es “li”, “lin”, “liu”, tlety here is that, if the distance for then node nm iss smallern+1 threshold δ.Thesenodesareactivenodesfor| | ϵ since their and “luis”areallsimilartop,sincetheireditdistancesto than δ,i.e.,ξn < δ,thefollowingoperationisalsorequired:cx+1 edit distances to ϵ are within δ. “nlis”arewithinδ.Thusnodes11,12,13,and16areactive for each nm’s descendant d that is at most δ ξn letters As the user types in a query string px = c1c2 ...cx letter − nodes for p (Figure 4 (e)). The predicted keywords for the away from nm,weneedtoadd d, ξd to the active-node set by letter, the active-node set Φp is computed and cached ⟨ ⟩ prefix are “li”, “lin”, “liu”, an d “luis”. x for the new string px+1,wheren ξd = ξn + d nm .This for px.Whentheusertypesinanewcharactercx+1 and | | − | | We develop a caching-based algorithm for incrementally operation corresponds to insertingd , n+|d|-|n severalm| letters after node submits a new query px+1,theservercomputestheactive-3 computing active nodes for a keyword as the user types it nm. node set Φp for px+1 by using Φp .Foreachn, ξn in letter by letter. Given an inputx+1 prefix p, we compute and x ⟨ ⟩ Deletion Match Substitution Insersion in Φpx ,thedescendantsofn are examined as active-node store the set of active nodes Φp = n, ξn ,inwhichn is c1 candidates for p{x⟨+1,asillustratedinFigure5.Forthenode⟩} an active node, and ξn = ed(p, n) δ.Theideabehindour c2 n,ifξn +1 ≤ δ,thenn is an active node for px+1,and Figure 5: Incrementally computing the active-node algorithm is to use prefix-filtering:≤ when the user types in c3 n, ξn +1 is added into Φpx+1 .Thiscasecorrespondsto set Φ from the active-node. set Φ .Weconsider one more letter after⟨p,theactivenodesof⟩ p can be used to px+1 px px deleting the last character cx+1 from the new query string . compute the active nodes of the new query. an active node n,pξxn+1 in Φpx . px+1.Noticeevenifξn +1 δ is not true, node n could still n, ⟨ ⟩ . The algorithm works as follows. First, an active-node≤ set n potentially become an active node of the new string, due to cx is initialized for the empty query ϵ,i.e.,Φϵ = n, ξn During then , computation+1 of set Φpx+1 ,itispossibletoadd operations described below on{⟨ other⟩ active| nodes in Φpx . n ξn = n δ .Thatis,itincludesalltrienodesn whose multiple pairs v, ξ1 , v, ξ2 , ... for the same trie node v. cx+1 | | ≤ } For each child nc of node n,therearetwopossiblecases.n , ⟨ ⟩ ⟨ ⟩ corresponding string has a length n within the edit-distance m n In this case,ns , onlyn+1 the one with the smallest edit operation is Case 1:thechildnode| | nc has a character different from threshold δ.Thesenodesareactivenodesforϵ since their kept in Φp .Thereasonisthat,bydefinition,forthesame cx+1 x+1 edit distances to ϵ arecx within+1.Figure5showsanodeδ. ns for such a child node, where trie node v,onlythepairwiththeeditdistancebetweenthe “s”standsfor“substitution,”themeaningofwhichwillbe- As the user types in a query string px = c1c2 ...cx letter node v and the query string px+1 should be kept in Φpx+1 , come clear shortly. We have ed(ns,px+1) ed(n, px)+1= by letter, the active-node set Φpx is computed and cached ≤ which means the minimum number of edit operations to ξn +1. If ξn +1 δ,thenns is an active noden for the for px.Whentheusertypesinanewcharactercx+1 and transform the string of v to the string of p . ≤ d , +|d|-|n | x+1 new string, and ns, ξn +1 is added into Φpx+1 .Thiscasen m submits a new query px+1,theservercomputestheactive-⟨ ⟩ 3 corresponds to substituting the label of ns for the letter Descendants of node ns do not need to be considered for node set Φpx+1 for px+1 by using Φpx .Foreachn, ξn cx+1. Case 2:thechildnodenc⟨ has a⟩ label cx+1.Figure5Deletion insertions,Match becauseSubstitution if theseInsersion descendants are active nodes of in Φpx ,thedescendantsofn are examined as active-node shows the node nm for such a child node, where “m”stands Φpx+1 ,theirparentsmustappearinΦpx ,andtheycanbe candidates for px+1,asillustratedinFigure5.Forthenode processed by substitutions on their parents. n,ifξn +1 δ,thenn is an active node for px+1,and Figure 5: Incrementally computing the active-node ≤ n, ξn +1 is added into Φpx+1 .Thiscasecorrespondsto set Φ from the active-node set Φ .Weconsider ⟨ ⟩ px+1 px deleting the last character cx+1 from the new query string an active node n, ξn in Φp . ⟨ ⟩ x px+1.Noticeevenifξn +1 δ is not true, node n could still 374 potentially become an active≤ node of the new string, due to During the computation of set Φpx+1 ,itispossibletoadd operations described below on other active nodes in Φpx . multiple pairs v, ξ1 , v, ξ2 , ... for the same trie node v. For each child nc of node n,therearetwopossiblecases. In this case, only⟨ the⟩ one⟨ with⟩ the smallest edit operation is Case 1:thechildnodenc has a character different from kept in Φpx+1 .Thereasonisthat,bydefinition,forthesame cx+1.Figure5showsanodens for such a child node, where trie node v,onlythepairwiththeeditdistancebetweenthe “s”standsfor“substitution,”themeaningofwhichwillbe- node v and the query string px+1 should be kept in Φpx+1 , come clear shortly. We have ed(ns,px+1) ed(n, px)+1= ≤ which means the minimum number of edit operations to ξn +1. If ξn +1 δ,thenns is an active node for the ≤ transform the string of v to the string of px+1. new string, and ns, ξn +1 is added into Φpx+1 .Thiscase ⟨ ⟩ 3 corresponds to substituting the label of ns for the letter Descendants of node ns do not need to be considered for cx+1. Case 2:thechildnodenc has a label cx+1.Figure5 insertions, because if these descendants are active nodes of shows the node nm for such a child node, where “m”stands Φpx+1 ,theirparentsmustappearinΦpx ,andtheycanbe processed by substitutions on their parents.

374 Intuition: at each symbol i in q, the set of active nodes will be related to the set from qi-1.

WWW 2009So, MADRID! we don’t need to visit every nodeTrack: Search in /the Session: Search UI

trie! ED = 0 ED = 1 ED = 2 0 0 0 0 0

10 10 10 10 10 l l l l l 11 14 11 14 11 14 11 14 11 14 i u 4 i u 4 i u 4 i u 4 i u 4

15 15 15 15 15 1 n u i 1 n u i 1 n u i 1 n u i 1 n u i 12 13 12 13 12 13 12 13 12 13 16 16 16 16 16 s s s s s 34 5 7 34 5 7 34 5 7 34 5 7 34 5 7 (a)Initialize (b)query“n”(c)query“nl”(d)query“nli”(e)query“nlis” WWW 2009 MADRID! Track: Search / Session: Search UI Figure 4: Fuzzy search of prefix queries of “nlis”(threshold δ =2).

ED = 0 ED = 1 ED = 2 nodes”), and we need to compute them efficiently. The leaf for “matching.” In this case, ed(nm,px+1) ed(n, px)= ≤ 0 descendantsJi S, Li G, of Li theC, Feng active0 J. Efficient nodes interactive arefuzzy calledkeyword search.0 the WWWpredicted ’09 key- 0 ξn δ.Thus,nm0is an active node of the new string, so we ≤ words of the prefix. For example, consider the trie in Fig- add nm, ξn into Φp .Thiscasecorrespondstothematch ⟨ ⟩ x+1 10 ure 4. Suppose the10 edit-distance threshold10δ =2,andauser 10 between the character10 cx+1 and the label of nm.Onesub- l l l l l types in a prefix p =“nlis”. Prefi x es “li”, “lin”, “liu”, tlety here is that, if the distance for the node nm is smaller 11 and “14luis”areallsimilarto11 14 p,sincetheireditdistancesto11 14 11 than14 δ,i.e.,ξ11n < δ,thefollowingoperationisalsorequired:14 i u 4 i u 4 i u 4 i u 4 i “nlisu ”arewithin4 δ.Thusnodes11,12,13,and16areactive for each nm’s descendant d that is at most δ ξn letters − nodes for p (Figure 4 (e)). The predicted keywords for the away from nm,weneedtoadd d, ξd to the active-node set 15 15 15 15 15 ⟨ ⟩ 1 n u prefixi are “1li”,n “linu”, “liui ”, an d1 “luisn ”. u i 1 n u fori the new1 stringn u px+1,wherei ξd = ξn + d nm .This 12 13 We develop12 a caching-based13 algorithm12 for13 incrementally12 13 operation corresponds12 13 to inserting several letters| | − | after| node 16 16 16 163 16 computings active nodes fors a keyword as the users types it nms. s in letter by letter. Given an input prefix p, we compute and 34 5 7 34 5 7 34 5 7 34 5 7 34 5 7 store the set of active nodes Φp = n, ξn ,inwhichn is c {⟨ ⟩} 1 (a)Initializean active node, and (b)query“ξn = edn(”(p, n) δ.Theideabehindourc)query“nl”(d)query“nli”(e)query“nlis” c2 ≤ algorithmFigure is 4: to Fuzzy use prefix-filtering: search of prefix when queries the user of types “nlis in”(threshold δ =2). c3 . one more letter after p,theactivenodesofp can be used to px . nodes”), and we needcompute to compute the active them e nodesfficiently. of the The new leaf query. for “matching.” In this case, ed(n ,p ) ed(n, p )= px+1 m x+1 n, n x . descendants of the activeThe nodes algorithm are called works the aspredicted follows. key- First, an active-nodeξ δ.Thus, setn is an active node of the new≤ string, so we n m c is initialized for the empty query ϵ,i.e.,Φϵ =≤ n, ξn n , +1 x words of the prefix. For example, consider the trie in Fig- add n{m⟨ , ξn ⟩into| Φpx+1 .Thiscasecorrespondstothematchn ξn = n δ .Thatis,itincludesalltrienodes⟨ n whose⟩ ure 4. Suppose the edit-distance| | ≤ threshold} δ =2,andauser between the character cx+1 and the label of nm.Onesub- cx+1 corresponding string has a length n within the edit-distance nm , n , types in a prefix p =“nlis”. Prefi x es “li”, “lin”, “liu”, tlety here is that, if the distance for then node nm iss smallern+1 threshold δ.Thesenodesareactivenodesfor| | ϵ since their and “luis”areallsimilartop,sincetheireditdistancesto than δ,i.e.,ξn < δ,thefollowingoperationisalsorequired:cx+1 edit distances to ϵ are within δ. “nlis”arewithinδ.Thusnodes11,12,13,and16areactive for each nm’s descendant d that is at most δ ξn letters As the user types in a query string px = c1c2 ...cx letter − nodes for p (Figure 4 (e)). The predicted keywords for the away from nm,weneedtoadd d, ξd to the active-node set by letter, the active-node set Φp is computed and cached ⟨ ⟩ prefix are “li”, “lin”, “liu”, an d “luis”. x for the new string px+1,wheren ξd = ξn + d nm .This for px.Whentheusertypesinanewcharactercx+1 and | | − | | We develop a caching-based algorithm for incrementally operation corresponds to insertingd , n+|d|-|n severalm| letters after node submits a new query px+1,theservercomputestheactive-3 computing active nodes for a keyword as the user types it nm. node set Φp for px+1 by using Φp .Foreachn, ξn in letter by letter. Given an inputx+1 prefix p, we compute and x ⟨ ⟩ Deletion Match Substitution Insersion in Φpx ,thedescendantsofn are examined as active-node store the set of active nodes Φp = n, ξn ,inwhichn is c1 candidates for p{x⟨+1,asillustratedinFigure5.Forthenode⟩} an active node, and ξn = ed(p, n) δ.Theideabehindour c2 n,ifξn +1 ≤ δ,thenn is an active node for px+1,and Figure 5: Incrementally computing the active-node algorithm is to use prefix-filtering:≤ when the user types in c3 n, ξn +1 is added into Φpx+1 .Thiscasecorrespondsto set Φ from the active-node. set Φ .Weconsider one more letter after⟨p,theactivenodesof⟩ p can be used to px+1 px px deleting the last character cx+1 from the new query string . compute the active nodes of the new query. an active node n,pξxn+1 in Φpx . px+1.Noticeevenifξn +1 δ is not true, node n could still n, ⟨ ⟩ . The algorithm works as follows. First, an active-node≤ set n potentially become an active node of the new string, due to cx is initialized for the empty query ϵ,i.e.,Φϵ = n, ξn During then , computation+1 of set Φpx+1 ,itispossibletoadd operations described below on{⟨ other⟩ active| nodes in Φpx . n ξn = n δ .Thatis,itincludesalltrienodesn whose multiple pairs v, ξ1 , v, ξ2 , ... for the same trie node v. cx+1 | | ≤ } For each child nc of node n,therearetwopossiblecases.n , ⟨ ⟩ ⟨ ⟩ corresponding string has a length n within the edit-distance m n In this case,ns , onlyn+1 the one with the smallest edit operation is Case 1:thechildnode| | nc has a character different from threshold δ.Thesenodesareactivenodesforϵ since their kept in Φp .Thereasonisthat,bydefinition,forthesame cx+1 x+1 edit distances to ϵ arecx within+1.Figure5showsanodeδ. ns for such a child node, where trie node v,onlythepairwiththeeditdistancebetweenthe “s”standsfor“substitution,”themeaningofwhichwillbe- As the user types in a query string px = c1c2 ...cx letter node v and the query string px+1 should be kept in Φpx+1 , come clear shortly. We have ed(ns,px+1) ed(n, px)+1= by letter, the active-node set Φpx is computed and cached ≤ which means the minimum number of edit operations to ξn +1. If ξn +1 δ,thenns is an active noden for the for px.Whentheusertypesinanewcharactercx+1 and transform the string of v to the string of p . ≤ d , +|d|-|n | x+1 new string, and ns, ξn +1 is added into Φpx+1 .Thiscasen m submits a new query px+1,theservercomputestheactive-⟨ ⟩ 3 corresponds to substituting the label of ns for the letter Descendants of node ns do not need to be considered for node set Φpx+1 for px+1 by using Φpx .Foreachn, ξn cx+1. Case 2:thechildnodenc⟨ has a⟩ label cx+1.Figure5Deletion insertions,Match becauseSubstitution if theseInsersion descendants are active nodes of in Φpx ,thedescendantsofn are examined as active-node shows the node nm for such a child node, where “m”stands Φpx+1 ,theirparentsmustappearinΦpx ,andtheycanbe candidates for px+1,asillustratedinFigure5.Forthenode processed by substitutions on their parents. n,ifξn +1 δ,thenn is an active node for px+1,and Figure 5: Incrementally computing the active-node ≤ n, ξn +1 is added into Φpx+1 .Thiscasecorrespondsto set Φ from the active-node set Φ .Weconsider ⟨ ⟩ px+1 px deleting the last character cx+1 from the new query string an active node n, ξn in Φp . ⟨ ⟩ x px+1.Noticeevenifξn +1 δ is not true, node n could still 374 potentially become an active≤ node of the new string, due to During the computation of set Φpx+1 ,itispossibletoadd operations described below on other active nodes in Φpx . multiple pairs v, ξ1 , v, ξ2 , ... for the same trie node v. For each child nc of node n,therearetwopossiblecases. In this case, only⟨ the⟩ one⟨ with⟩ the smallest edit operation is Case 1:thechildnodenc has a character different from kept in Φpx+1 .Thereasonisthat,bydefinition,forthesame cx+1.Figure5showsanodens for such a child node, where trie node v,onlythepairwiththeeditdistancebetweenthe “s”standsfor“substitution,”themeaningofwhichwillbe- node v and the query string px+1 should be kept in Φpx+1 , come clear shortly. We have ed(ns,px+1) ed(n, px)+1= ≤ which means the minimum number of edit operations to ξn +1. If ξn +1 δ,thenns is an active node for the ≤ transform the string of v to the string of px+1. new string, and ns, ξn +1 is added into Φpx+1 .Thiscase ⟨ ⟩ 3 corresponds to substituting the label of ns for the letter Descendants of node ns do not need to be considered for cx+1. Case 2:thechildnodenc has a label cx+1.Figure5 insertions, because if these descendants are active nodes of shows the node nm for such a child node, where “m”stands Φpx+1 ,theirparentsmustappearinΦpx ,andtheycanbe processed by substitutions on their parents.

374 Related problem: the “similarity join”

We have two bags of words, R and S.

Goal: identify pairs of similar words.

Example: R = { kobe, ebay, ...} S = { bag, koby, ...}

We would want to identify pairs such as

Feng J, Wang J, Li G. Trie-join: a trie-based method for efficient string similarity joins. The VLDB Journal — The International Journal on Very Large Data Bases. 2012 Aug;21(4). Again, one solution is pairwise edit distance calculation...

... but if R and S are very large, that will be incredibly time consuming, even with prefix pruning!

One solution: use the trie search method!

Build a trie representing R;

For every string s in S, identify the active nodes As of R’s trie; for each leaf node r in As, produce .

Feng J, Wang J, Li G. Trie-join: a trie-based method for efficient string similarity joins. The VLDB Journal — The International Journal on Very Large Data Bases. 2012 Aug;21(4). Another solution: use sub-trie pruning

Intuition: given the set of active nodes An for a particular trie node n...

... we can say that only children of nodes in An could possibly be similar to children of node n.

We can use this fact to speed up extraction of similar pairs. Let us consider the case where our two sets are actually one set (R = S), and we simply want to identify similar pairs.

Feng J, Wang J, Li G. Trie-join: a trie-based method for efficient string similarity joins. The VLDB Journal — The International Journal on Very Large Data Bases. 2012 Aug;21(4). Algorithm:

1. Build a trie for our set of words; 2. Traverse the trie in preorder. At each node, compute its set of active nodes A.

3. At each leaf node n, identify any leaf nodes in An; these are similar pairs.

As we traverse, we must keep the current node’s ancestor’s set of active nodes in memory; total is O(δ|AT|).

Feng J, Wang J, Li G. Trie-join: a trie-based method for efficient string similarity joins. The VLDB Journal — The International Journal on Very Large Data Bases. 2012 Aug;21(4). Trie-join: a trie-based method for efficient string similarity joins 441

Fig. 5 An example to use Trie-Traverse algorithm to find all similar pairs (τ 1) = Feng J, Wang J, Li G. Trie-join: a trie-based method for efficient string similarity joins. The VLDB Journal — The International Journal on Very Large Data Bases. 2012 Aug;21(4).

strings (Line 2), computes the active-node set of the root node (Line 4), and then calls its subroutine findSimilarPair to find all similar string pairs recursively (Lines 5–6). find- SimilarPair first calculates the active-node set c of A node c based on its parent’s active-note set p (Line 2), using A the above-mentioned incremental algorithm. If c is a leaf node, it calls a subroutine outputSimilarPair to output all the similar string pairs of c (Line 3). Finally, findSim- ilarPair calls itself to compute the similar string pairs of c’s descendants (Lines 6–7). In the worst case, the time complexity of computing c A from its parent’s active-note set p is (τ c ), since each A O ·|A | active node only can be computed from its ancestors within τ steps. Therefore, the time complexity of Trie-Traverse is (τ T )where T is the sum of the numbers of the O ·|A | |A | active-node sets of all the trie nodes in the trie T . When tra- versing the trie nodes, we need to maintain the trie and the Fig. 4 Trie-Traverse algorithm active nodes of ancestors of the current node. Given a leaf nodel, let C(l) denote the sum of the active nodes of ancestors 3.1 Trie-traverse algorithm of node l and Cmax is the maximal value of C(l) among all leaf nodes. The space complexity is ( T Cmax),where T O | |+ | | In this section, we propose a trie-traversal-based method, is the size of trie T .Example1 shows how Trie-Traverse called Trie-Traverse,toimprovetheperformanceofTrie- works. Search. Intuitively, Trie-Traverse utilizes dual subtrie pruning to avoid duplicated computation in Trie-Search. Example 1 Consider the string set and the corresponding trie Trie-Traverse constructs a trie for the strings in both structure in Fig. 5.Initially,weconstructatrieindexforall R and and computes the active-node set for a node in the trie strings. We compute the active-node set of the root node S exactly once even though the node is a prefix of a potentially 0 0, 1, 9, 13 ,whichiscomposedofthenodeswith A ={ } large number of strings. depths within τ 1, since their edit distances to the root node = For ease of presentation, in the following, we focus on (an empty string) are within τ.Then,wecomputeactive- self-join, that is .Ourapproachcanbeeasilyextended node sets of every node using preorder traversal (following R = S to (Sect. 4.2). Trie-Traverse first constructs a trie the dashed lines). This traversal can guarantee that, for each R ̸= S index for all strings in and then traverses the trie in preorder. node, we always compute its parent’s active-node set before S For each trie node, Trie-Traverse computes its active-node its own active-node set. Consider node 2, we use its par- set. When reaching a leaf node l,fors l ,ifs is a leaf ent’s active-node set 1 to compute its active-node set 2. ∈ A A A node (i.e., s ), Trie-Traverse outputs l, s as a simi- Similarly, we compute 3 using 2.Asnode3isaleafnode, ∈ S ⟨ ⟩ A A lar string pair. Figure 4 gives the pseudo-code of the Trie- and node 4 is a leaf node in 3 2, 3, 4, 7 ;thus,weoutput A ={ } Traverse algorithm. It first constructs a trie index for all the similar pair 3, 4 . ⟨ ⟩ ⊓&

123 We can do better!

Intuition: given the set of active nodes An for a particular trie node n...

... we can say that only children of nodes in An could possibly be similar to children of node n.

Also: if node u has node v in its active set, v must also have u in its set!

Feng J, Wang J, Li G. Trie-join: a trie-based method for efficient string similarity joins. The VLDB Journal — The International Journal on Very Large Data Bases. 2012 Aug;21(4). Trie-join: a trie-based method for efficient string similarity joins 441

Fig. 5 An example to use Trie-Traverse algorithm to find all similar pairs (τ 1) = Feng J, Wang J, Li G. Trie-join: a trie-based method for efficient string similarity joins. The VLDB Journal — The International Journal on Very Large Data Bases. 2012 Aug;21(4).

strings (Line 2), computes the active-node set of the root node (Line 4), and then calls its subroutine findSimilarPair to find all similar string pairs recursively (Lines 5–6). find- SimilarPair first calculates the active-node set c of A node c based on its parent’s active-note set p (Line 2), using A the above-mentioned incremental algorithm. If c is a leaf node, it calls a subroutine outputSimilarPair to output all the similar string pairs of c (Line 3). Finally, findSim- ilarPair calls itself to compute the similar string pairs of c’s descendants (Lines 6–7). In the worst case, the time complexity of computing c A from its parent’s active-note set p is (τ c ), since each A O ·|A | active node only can be computed from its ancestors within τ steps. Therefore, the time complexity of Trie-Traverse is (τ T )where T is the sum of the numbers of the O ·|A | |A | active-node sets of all the trie nodes in the trie T . When tra- versing the trie nodes, we need to maintain the trie and the Fig. 4 Trie-Traverse algorithm active nodes of ancestors of the current node. Given a leaf nodel, let C(l) denote the sum of the active nodes of ancestors 3.1 Trie-traverse algorithm of node l and Cmax is the maximal value of C(l) among all leaf nodes. The space complexity is ( T Cmax),where T O | |+ | | In this section, we propose a trie-traversal-based method, is the size of trie T .Example1 shows how Trie-Traverse called Trie-Traverse,toimprovetheperformanceofTrie- works. Search. Intuitively, Trie-Traverse utilizes dual subtrie pruning to avoid duplicated computation in Trie-Search. Example 1 Consider the string set and the corresponding trie Trie-Traverse constructs a trie for the strings in both structure in Fig. 5.Initially,weconstructatrieindexforall R and and computes the active-node set for a node in the trie strings. We compute the active-node set of the root node S exactly once even though the node is a prefix of a potentially 0 0, 1, 9, 13 ,whichiscomposedofthenodeswith A ={ } large number of strings. depths within τ 1, since their edit distances to the root node = For ease of presentation, in the following, we focus on (an empty string) are within τ.Then,wecomputeactive- self-join, that is .Ourapproachcanbeeasilyextended node sets of every node using preorder traversal (following R = S to (Sect. 4.2). Trie-Traverse first constructs a trie the dashed lines). This traversal can guarantee that, for each R ̸= S index for all strings in and then traverses the trie in preorder. node, we always compute its parent’s active-node set before S For each trie node, Trie-Traverse computes its active-node its own active-node set. Consider node 2, we use its par- set. When reaching a leaf node l,fors l ,ifs is a leaf ent’s active-node set 1 to compute its active-node set 2. ∈ A A A node (i.e., s ), Trie-Traverse outputs l, s as a simi- Similarly, we compute 3 using 2.Asnode3isaleafnode, ∈ S ⟨ ⟩ A A lar string pair. Figure 4 gives the pseudo-code of the Trie- and node 4 is a leaf node in 3 2, 3, 4, 7 ;thus,weoutput A ={ } Traverse algorithm. It first constructs a trie index for all the similar pair 3, 4 . ⟨ ⟩ ⊓&

123 We can compute active nodes as we build the trie, and eliminate duplicate calculations. Trie-join: a trie-based method for efficient string similarity joins 443

Fig. 7 An example to use Trie-Dynamic algorithm to find all similar pairs (τ 1) = (a)

(b) (c) (d)

Fig. 7c, we find that 2, 3, 7 are different. Because after and only if both the nodes s and t are leaf nodes of t ,ands A A A By increasing our space complexity, we canT reduce we insert node 8 and compute 2, 3, 7, 8 ,weupdate t 8 is in tT . Based on the proved claim, after adding the node A ={ } At the active-node sets of nodes in 8(nodesthe 2,time 3, 7). For complexity each t, T is correctly to O computed.( A T ) Also. note that for each string s A At 2| | node n in 8,weaddnode8ton’s active-node set based on in S and P(s) P(t),itscorrespondingnodein t must be A Feng J, Wang J, Li G. Trie-join: a trie-based method for efficient string similarity joins.≤ The VLDB Journal — The International Journal on Very Large DataT Bases. 2012 Aug;21(4). the symmetry property. aleafnode.Therefore,thetheoremisproved. ⊓" ⊓" Theorem 2 Given a set of strings and an edit-distance S threshold τ, Trie-Dynamic can compute all similar string 3.3 Trie-PathStack algorithm pairs s , t such that ed(s, t) τ. ⟨ ∈ S ∈ S⟩ ≤ When inserting a new string, Trie-Dynamic may gener- Proof As Trie-Dynamic constructs the trie structure ate some new nodes and append them as children of any dynamically, let v denote the trie index constructed from T existing node.Thus,Trie-Dynamic may use active-node all the nodes inserted before v (including itself). For a node sets of any existing node to compute the active-node sets v in v,letP(v) denote the number of nodes in v. v’s node T T T of newly added nodes. For example, in Fig. 7d, when insert- set V ( v) u V ( ) P(u) P(v) and its edge set T ={ ∈ T | ≤ } ing a string “bay”, Trie-Dynamic generates a new node 8, E( v) (ui , u j ) E( ) ui V ( v), u j V ( v) .In appends it as a child of existing node 2, and uses the active- T ={ v ∈ T | ∈ T ∈ T } addition, let uT t V ( v) ed(u, t) τ denote the node set of node 2 to compute the active-node set of the A ={ ∈ T | ≤ } active-node set of u w.r.t v. newly inserted node 8. Thus, although Trie-Dynamic avoids T We first prove that after adding a new node v in the trie, unnecessary active-node computation introduced by Trie- v for each node u of v, Trie-Dynamic can compute uT cor- Traverse, Trie-Dynamic involve large memory space to T A rectly. We prove this claim by induction. The base case for maintain the active-node sets of all trie nodes.2 Recall Trie- the root node clearly holds since after adding a root node Traverse,itfirstconstructsatrieindexforallstringsand r r, r has only one node and rT r is correct. Assume then gets similar string pairs by traversing the trie in preorder. T A ={} this claim holds for a node v. We want to prove that it also Throughout the algorithm, the maximal number of active- holds for the next added node v (i.e., P(v ) P(v) 1). node sets that Trie-Traverse needs to maintain is the same ′ ′ = + For each node u in v ,ifu v ,sincetheactive-nodesetof as the maximal depth of trie leaf nodes. To summarize, Trie- T ′ = ′ u’s parent w.r.t v is correctly computed (as u’s parent is in Traverse uses little memory space but involves unnecessary T v v v), Trie-Dynamic can correctly compute uT ′ (i.e., T ′ ); active-node computation; on the contrary, Trie-Dynamic T A Av′ v avoids such repeated computation but involves large memory if u v′,thenu must be a node of v,so uT is correctly ̸= v T A v space. computed. Trie-Dynamic uses uT to compute uT ′ .Ifu is v A A To address this problem, we propose a new algorithm, in T ′ ,thened(v′, u) τ;thus,Trie-Dynamic can obtain Av′ ≤ called Trie-PathStack,whichnotonlyrequireslittlemem- v v v uT ′ correctly by adding v′ to uT ;ifu is not in T ′ ,then ory space but also achieves much higher performance. A A Av′ v ed(v , u)>τ;thus,Trie-Dynamic can obtain uT ′ cor- Intuitively, Trie-PathStack can integrate the ideas of Trie- ′ A rectly by setting it as v .Therefore,foreachnodeu of , Dynamic and Trie-Traverse together. To achieve higher uT v′ A v T Trie-Dynamic can compute uT ′ correctly. Thus, our claim A 2 If we first sort the strings and then dynamically insert them into the is true. trie, Trie-Dynamic need not maintain all active-node sets. However, Obviously for each string pair s , t (Without ⟨ ∈ S ∈ S⟩ it has two problems: (1) it involves an additional sorting step; (2) it is loss of generality, suppose P(s) P(t).), ed(s, t) τ if still expensive to update the active-node sets (the symmetry property). ≤ ≤ 123 There are extensions to the idea that allow for different sets of strings, more space-efficient construction, etc.

See the Feng, et al. article (cited below) for more details!

Feng J, Wang J, Li G. Trie-join: a trie-based method for efficient string similarity joins. The VLDB Journal — The International Journal on Very Large Data Bases. 2012 Aug;21(4). WWW 2009 MADRID! Track: Search / Session: Search UI

Figure 2: Interactive fuzzy search on the UC Irvine people directory (http://psearch.ics.uci.edu).

Interactive:Thesystemsearchesforthebestanswers“onthe ciency on a large amount of data is especially challenging fly” as the user types in a keyword query; (2) Fuzzy:When because of two reasons. First, we allow the query keywords searching for relevant records, the system also tries to find to appear in different attributes with an arbitrary order, those records that include words similar to the keywords in and the “on-the-fly join” nature of the problem can be com- the query, even if they do nothttp://psearch.ics.uci.edu match exactly. putationally expensive. Second, we want to support fuzzy We have developed several prototypes using this paradigm. search by finding records with keywords that match query The first one supports search on the UC Irvine people direc- keywords approximately. tory. A screenshot is shown in Figure 2. In the figure, a user We develop novel solutions to these problems. We present has typed in a query string “professor smyt”. Even th ou gh several incremental-search algorithms for answering a query the user has not typed in the second keyword completely, by using cached results of earlier queries. In this way, the the system can already find person records that might be computation of the answers to a query can spread across of interest to the user. Notice that the two keywords in multiple keystrokes of the user, thus we can achieve a high the query string (including a partial keyword “smyt”) can speed. Specifically, we make the following contributions. appear in different attributes of the records. In particular, (1) We first study the case of queries with a single keyword, in the first record, the keyword “professor”appearsinthe and present an incremental algorithm for computing key- “title” attribute, and the partial keyword “smyt”appearsin word prefixes similar to a prefix keyword (Section 3). (2) the “name” attribute. The matched prefixes are highlighted. For queries with multiple keywords, we study various tech- The system can also find records with words that are sim- niques for computing the intersection of the inverted lists ilar to the query keywords, such as a person name “smith”. of query keywords, and develop a novel algorithm for com- The feature of supporting fuzzy search is especially impor- puting the results efficiently (Section 4.1). Its main idea tant when the user has limited knowledge about the under- is to use forward lists of keyword IDs for checking whether lying data or the entities he or she is looking for. As the user arecordmatchesquerykeywordconditions(evenapproxi- types in more letters, the system interactively searches on mately). (3) We develop a novel on-demand caching tech- the data, and updates the list of relevant records. The sys- nique for incremental search. Its idea is to cache only part tem also utilizes a-priori knowledge such as synonyms. For of the results of a query. For subsequent queries, unfin- instance, given the fact that “william”and“bill”aresyn- ished computation will be resumed if the previously cached onyms, the system can find a person called “William Kropp” results are not sufficient. In this way, we can efficiently com- when the user has typed in “bill crop”. Th is search p roto- pute and cache a small amount of results (Section 4.2). (4) type has been used regularly by many people at UCI, and We study various features in this paradigm, such as how to received positive feedback due to the friendly user inter- rank results properly, how to highlight keywords in the re- face and high efficiency. Another prototype, available at sults, and how to utilize domain-specific information such http://dblp.ics.uci.edu, supports search on a DBLP dataset as synonyms to improve search (Section 5). (5) In addition (http//www.informatik.uni-trier.de/ ley/db/) with about to deploying several real prototypes, we conducted a thor- 1millionpublicationrecords.Athirdprototype,availableat∼ ough experimental evaluation of the developed techniques http://pubmed.ics.uci.edu, supports search on 3.95 million on real data sets, and show the practicality of this new com- MEDLINE records (http://www.ncbi.nlm.nih.gov/pubmed). puting paradigm. All experiments were done using a single In this paper we study research challenges that arise nat- desktop machine, which can still achieve response times of urally in this computing paradigm. The main challenge is milliseconds on millions of records. the requirement of a high efficiency. To make search re- ally interactive, for each keystroke on the client browser, 1.1 Related Work from the time the user presses the key to the time the re- Prediction and Autocomplete:1 There have been many sults computed from the server are displayed on the browser, studies on predicting queries and user actions [17, 14, 9, 19, the delay should be as small as possible. An interactive 18]. With these techniques, a system predicts a word or speed requires this delay be within milliseconds. Notice that this time includes the network transfer delay, execution time 1The word “autocomplete” could have different meanings. on the server, and the time for the browser to execute its Here we use it to refer to the case where a query (possibly javascript (which tends to be slow). Providing a high effi- with multiple keywords) is treated as a single prefix.

372 Plan for today:

Tries Simple uses of tries Fuzzy search with tries Levenshtein automata Levenshtein automata

A different approach to solving the fuzzy matching problem uses finite-state automata.

The basic idea: construct an acceptor that will recognize an input string with up to δ edits.

Then, walk through the acceptor and our dictionary, emitting any final states we visit.

Schulz KU, Mihov S. Fast string correction with Levenshtein automata. International Journal on Document Analysis and Recognition. 2002 Nov 1;5(1):67–85.

Mihov S, Schulz KU. Fast Approximate Search in Large Dictionaries. Computational Linguistics. 2004 Dec;30(4). Levenshtein automata The basic idea: construct an acceptor that will recognize an input string with up to δ edits. We have seen something not entirely dissimilar:

Levenshtein automata

In a sense, our old friend the edit-distance transducer is a step along the path towards a Levenshtein transducer.

The difference: the edit-distance transducer will allow infinite insertions or deletions...

... and we need to limit the total number of such events. Levenshtein automata

72 K.U. Schulz, S. Mihov

0#2 1#2 2#2 3#2 44#2 5#2 w1 w2 w3 w4 w5 ε ε ε ε ε ¬w1 ¬w2 ¬w3 ¬w4 ¬w5

0#1 1#1 2#1 3#1 44#1 5#1 w1 w2 w3 w4 w5

ε ε ε ε ε ¬w1 ¬w2 ¬w3 ¬w4 ¬w5

0#0 1#0 2#0 3#0 44#0 5#0 w1 w2 w3 w4 w5

Fig. 3 Non-deterministic of degree 2 for arbitrary input W = w1w2w3w4w5 of length 5

• • • • • • • • • • • • • • • • • • #5 #5 #5 #5 #5 #5 #5 #5 #5 0#5 1#5Character2#5 3#5 4#5 position5#5 6#5 7 #5(mantissa)8#5 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #4 #4 #4 #4 #4 #4 #4 #4 #4 0#4 1#4 2#4 3#4 4#4 5#4 6#4 7#4 8#4 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #3 #3 #3 #3 #3 #3 #3 #3 #3 0#3 1#3 2#3 3#3 4#3 5#3 6#3 7#3 8#3 0 1 2 3 4 5 6 7 8 • • • • • • • • • Schulz KU, Mihov• S. Fast •string correction• with• Levenshtein• automata.• International• Journal• on Document• Analysis and#2 Recognition.#2 2002#2 Nov 1;5(1):67–85.#2 #2 #2 #2 #2 #2 0#2 1#2 2#2 3#2 4#2 5#2 6#2 7#2 8#2 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #1 #1 #1 #1 #1 #1 #1 #1 #1 0#1 1#1 2#1 3#1 4#1 5#1 6#1 7#1 8#1 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #0 #0 #0 #0 #0 #0 #0 #0 #0 0#0 1#0 2#0 3#0 4#0 5#0 6#0 7#0 8#0 0 1 2 3 4 5 6 7 8

Fig. 4 Positions and accepting positions for n = 5 and w =8 Fig. 5 Subsumption triangles (cf. Example 4)

Accepting positions can be considered as final states of heavily depends on identities between letters w ,... ,w . 1 5 the non-deterministic Levenshtein automata described For this reason, a uniform deterministic Levenshtein au- above. tomaton for arbitrary input of length 5 cannot be given. Two examples of deterministic Levenshtein automata for Example 3 For n = 5 and w = 8, the set of all positions input of length 5 (of degree 1) can be found in Figs. 8 is depicted in Fig. 4. Accepting positions are marked. and 9. Definition 9 A position i♯e subsumes a position j♯f iff Positions and states. In order to characterize the states e0, otherwise it is called a base position. language

♯e Base positions can be considered as “error-free” posi- Φ(i ) := Lev(n e, xi+1 xw). tions. L − ··· ♯e ♯f Let π := i and π′ := j be two distinct positions. If π ♯e Definition 8 A position i is accepting iff w i n e. subsumes π′, then Φ(π′) is a subset of Φ(π). − ≤ − Levenshtein automata

72 K.U. Schulz, S. Mihov

0#2 1#2 2#2 3#2 44#2 5#2 w1 w2 w3 w4 w5 ε ε ε ε ε ¬w1 ¬w2 ¬w3 ¬w4 ¬w5

0#1 1#1 2#1 3#1 44#1 5#1 w1 w2 w3 w4 w5

ε ε ε ε ε ¬w1 ¬w2 ¬w3 ¬w4 ¬w5

0#0 1#0 2#0 3#0 44#0 5#0 w1 w2 w3 w4 w5

Fig. 3 Non-deterministic Levenshtein automaton of degree 2 for arbitrary input W = w1w2w3w4w5 of length 5

• • • • • • • • • • • • • • • • • • #5 #5 #5 #5 #5 #5 #5 #5 #5 0#5 1#5Number2#5 3#5 4of#5 edits5#5 6#5 thus7#5 8far#5 (exponent)0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #4 #4 #4 #4 #4 #4 #4 #4 #4 0#4 1#4 2#4 3#4 4#4 5#4 6#4 7#4 8#4 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #3 #3 #3 #3 #3 #3 #3 #3 #3 0#3 1#3 2#3 3#3 4#3 5#3 6#3 7#3 8#3 0 1 2 3 4 5 6 7 8 • • • • • • • • • Schulz KU, Mihov• S. Fast •string correction• with• Levenshtein• automata.• International• Journal• on Document• Analysis and#2 Recognition.#2 2002#2 Nov 1;5(1):67–85.#2 #2 #2 #2 #2 #2 0#2 1#2 2#2 3#2 4#2 5#2 6#2 7#2 8#2 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #1 #1 #1 #1 #1 #1 #1 #1 #1 0#1 1#1 2#1 3#1 4#1 5#1 6#1 7#1 8#1 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #0 #0 #0 #0 #0 #0 #0 #0 #0 0#0 1#0 2#0 3#0 4#0 5#0 6#0 7#0 8#0 0 1 2 3 4 5 6 7 8

Fig. 4 Positions and accepting positions for n = 5 and w =8 Fig. 5 Subsumption triangles (cf. Example 4)

Accepting positions can be considered as final states of heavily depends on identities between letters w ,... ,w . 1 5 the non-deterministic Levenshtein automata described For this reason, a uniform deterministic Levenshtein au- above. tomaton for arbitrary input of length 5 cannot be given. Two examples of deterministic Levenshtein automata for Example 3 For n = 5 and w = 8, the set of all positions input of length 5 (of degree 1) can be found in Figs. 8 is depicted in Fig. 4. Accepting positions are marked. and 9. Definition 9 A position i♯e subsumes a position j♯f iff Positions and states. In order to characterize the states e0, otherwise it is called a base position. language

♯e Base positions can be considered as “error-free” posi- Φ(i ) := Lev(n e, xi+1 xw). tions. L − ··· ♯e ♯f Let π := i and π′ := j be two distinct positions. If π ♯e Definition 8 A position i is accepting iff w i n e. subsumes π′, then Φ(π′) is a subset of Φ(π). − ≤ − Levenshtein automata

72 K.U. Schulz, S. Mihov

0#2 1#2 2#2 3#2 44#2 5#2 w1 w2 w3 w4 w5 ε ε ε ε ε ¬w1 ¬w2 ¬w3 ¬w4 ¬w5

0#1 1#1 2#1 3#1 44#1 5#1 w1 w2 w3 w4 w5

ε ε ε ε ε ¬w1 ¬w2 ¬w3 ¬w4 ¬w5

0#0 1#0 2#0 3#0 44#0 5#0 w1 w2 w3 w4 w5

Fig. 3 Non-deterministic Levenshtein automaton of degree 2 for arbitrary input W = w1w2w3w4w5 of length 5

• • • • • • • • • • • • • • • • • • #5 #5 #5 #5 #5 #5 #5 #5 #5 0#5 1#5Match2#5 3#5 4#5 5#5 6#5 7#5 8#5 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #4 #4 #4 #4 #4 #4 #4 #4 #4 0#4 1#4 2#4 3#4 4#4 5#4 6#4 7#4 8#4 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #3 #3 #3 #3 #3 #3 #3 #3 #3 0#3 1#3 2#3 3#3 4#3 5#3 6#3 7#3 8#3 0 1 2 3 4 5 6 7 8 • • • • • • • • • Schulz KU, Mihov• S. Fast •string correction• with• Levenshtein• automata.• International• Journal• on Document• Analysis and#2 Recognition.#2 2002#2 Nov 1;5(1):67–85.#2 #2 #2 #2 #2 #2 0#2 1#2 2#2 3#2 4#2 5#2 6#2 7#2 8#2 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #1 #1 #1 #1 #1 #1 #1 #1 #1 0#1 1#1 2#1 3#1 4#1 5#1 6#1 7#1 8#1 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #0 #0 #0 #0 #0 #0 #0 #0 #0 0#0 1#0 2#0 3#0 4#0 5#0 6#0 7#0 8#0 0 1 2 3 4 5 6 7 8

Fig. 4 Positions and accepting positions for n = 5 and w =8 Fig. 5 Subsumption triangles (cf. Example 4)

Accepting positions can be considered as final states of heavily depends on identities between letters w ,... ,w . 1 5 the non-deterministic Levenshtein automata described For this reason, a uniform deterministic Levenshtein au- above. tomaton for arbitrary input of length 5 cannot be given. Two examples of deterministic Levenshtein automata for Example 3 For n = 5 and w = 8, the set of all positions input of length 5 (of degree 1) can be found in Figs. 8 is depicted in Fig. 4. Accepting positions are marked. and 9. Definition 9 A position i♯e subsumes a position j♯f iff Positions and states. In order to characterize the states e0, otherwise it is called a base position. language

♯e Base positions can be considered as “error-free” posi- Φ(i ) := Lev(n e, xi+1 xw). tions. L − ··· ♯e ♯f Let π := i and π′ := j be two distinct positions. If π ♯e Definition 8 A position i is accepting iff w i n e. subsumes π′, then Φ(π′) is a subset of Φ(π). − ≤ − Levenshtein automata

72 K.U. Schulz, S. Mihov

0#2 1#2 2#2 3#2 44#2 5#2 w1 w2 w3 w4 w5 ε ε ε ε ε ¬w1 ¬w2 ¬w3 ¬w4 ¬w5

0#1 1#1 2#1 3#1 44#1 5#1 w1 w2 w3 w4 w5

ε ε ε ε ε ¬w1 ¬w2 ¬w3 ¬w4 ¬w5

0#0 1#0 2#0 3#0 44#0 5#0 w1 w2 w3 w4 w5

Fig. 3 Non-deterministic Levenshtein automaton of degree 2 for arbitrary input W = w1w2w3w4w5 of length 5

• • • • • • • • • • • • • • • • • • #5 #5 #5 #5 #5 #5 #5 #5 #5 0#5 1#5Substitution2#5 3#5 4#5 5#5 6#5 7#5 8#5 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #4 #4 #4 #4 #4 #4 #4 #4 #4 0#4 1#4 2#4 3#4 4#4 5#4 6#4 7#4 8#4 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #3 #3 #3 #3 #3 #3 #3 #3 #3 0#3 1#3 2#3 3#3 4#3 5#3 6#3 7#3 8#3 0 1 2 3 4 5 6 7 8 • • • • • • • • • Schulz KU, Mihov• S. Fast •string correction• with• Levenshtein• automata.• International• Journal• on Document• Analysis and#2 Recognition.#2 2002#2 Nov 1;5(1):67–85.#2 #2 #2 #2 #2 #2 0#2 1#2 2#2 3#2 4#2 5#2 6#2 7#2 8#2 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #1 #1 #1 #1 #1 #1 #1 #1 #1 0#1 1#1 2#1 3#1 4#1 5#1 6#1 7#1 8#1 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #0 #0 #0 #0 #0 #0 #0 #0 #0 0#0 1#0 2#0 3#0 4#0 5#0 6#0 7#0 8#0 0 1 2 3 4 5 6 7 8

Fig. 4 Positions and accepting positions for n = 5 and w =8 Fig. 5 Subsumption triangles (cf. Example 4)

Accepting positions can be considered as final states of heavily depends on identities between letters w ,... ,w . 1 5 the non-deterministic Levenshtein automata described For this reason, a uniform deterministic Levenshtein au- above. tomaton for arbitrary input of length 5 cannot be given. Two examples of deterministic Levenshtein automata for Example 3 For n = 5 and w = 8, the set of all positions input of length 5 (of degree 1) can be found in Figs. 8 is depicted in Fig. 4. Accepting positions are marked. and 9. Definition 9 A position i♯e subsumes a position j♯f iff Positions and states. In order to characterize the states e0, otherwise it is called a base position. language

♯e Base positions can be considered as “error-free” posi- Φ(i ) := Lev(n e, xi+1 xw). tions. L − ··· ♯e ♯f Let π := i and π′ := j be two distinct positions. If π ♯e Definition 8 A position i is accepting iff w i n e. subsumes π′, then Φ(π′) is a subset of Φ(π). − ≤ − Levenshtein automata

72 K.U. Schulz, S. Mihov

0#2 1#2 2#2 3#2 44#2 5#2 w1 w2 w3 w4 w5 ε ε ε ε ε ¬w1 ¬w2 ¬w3 ¬w4 ¬w5

0#1 1#1 2#1 3#1 44#1 5#1 w1 w2 w3 w4 w5

ε ε ε ε ε ¬w1 ¬w2 ¬w3 ¬w4 ¬w5

0#0 1#0 2#0 3#0 44#0 5#0 w1 w2 w3 w4 w5

Fig. 3 Non-deterministic Levenshtein automaton of degree 2 for arbitrary input W = w1w2w3w4w5 of length 5

• • • • • • • • • • • • • • • • • • #5 #5 #5 #5 #5 #5 #5 #5 #5 0#5 1#5Deletion2#5 3#5 4#5 5#5 6#5 7#5 8#5 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #4 #4 #4 #4 #4 #4 #4 #4 #4 0#4 1#4 2#4 3#4 4#4 5#4 6#4 7#4 8#4 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #3 #3 #3 #3 #3 #3 #3 #3 #3 0#3 1#3 2#3 3#3 4#3 5#3 6#3 7#3 8#3 0 1 2 3 4 5 6 7 8 • • • • • • • • • Schulz KU, Mihov• S. Fast •string correction• with• Levenshtein• automata.• International• Journal• on Document• Analysis and#2 Recognition.#2 2002#2 Nov 1;5(1):67–85.#2 #2 #2 #2 #2 #2 0#2 1#2 2#2 3#2 4#2 5#2 6#2 7#2 8#2 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #1 #1 #1 #1 #1 #1 #1 #1 #1 0#1 1#1 2#1 3#1 4#1 5#1 6#1 7#1 8#1 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #0 #0 #0 #0 #0 #0 #0 #0 #0 0#0 1#0 2#0 3#0 4#0 5#0 6#0 7#0 8#0 0 1 2 3 4 5 6 7 8

Fig. 4 Positions and accepting positions for n = 5 and w =8 Fig. 5 Subsumption triangles (cf. Example 4)

Accepting positions can be considered as final states of heavily depends on identities between letters w ,... ,w . 1 5 the non-deterministic Levenshtein automata described For this reason, a uniform deterministic Levenshtein au- above. tomaton for arbitrary input of length 5 cannot be given. Two examples of deterministic Levenshtein automata for Example 3 For n = 5 and w = 8, the set of all positions input of length 5 (of degree 1) can be found in Figs. 8 is depicted in Fig. 4. Accepting positions are marked. and 9. Definition 9 A position i♯e subsumes a position j♯f iff Positions and states. In order to characterize the states e0, otherwise it is called a base position. language

♯e Base positions can be considered as “error-free” posi- Φ(i ) := Lev(n e, xi+1 xw). tions. L − ··· ♯e ♯f Let π := i and π′ := j be two distinct positions. If π ♯e Definition 8 A position i is accepting iff w i n e. subsumes π′, then Φ(π′) is a subset of Φ(π). − ≤ − Levenshtein automata

72 K.U. Schulz, S. Mihov

0#2 1#2 2#2 3#2 44#2 5#2 w1 w2 w3 w4 w5 ε ε ε ε ε ¬w1 ¬w2 ¬w3 ¬w4 ¬w5

0#1 1#1 2#1 3#1 44#1 5#1 w1 w2 w3 w4 w5

ε ε ε ε ε ¬w1 ¬w2 ¬w3 ¬w4 ¬w5

0#0 1#0 2#0 3#0 44#0 5#0 w1 w2 w3 w4 w5

Fig. 3 Non-deterministic Levenshtein automaton of degree 2 for arbitrary input W = w1w2w3w4w5 of length 5

• • • • • • • • • • • • • • • • • • #5 #5 #5 #5 #5 #5 #5 #5 #5 0#5 1#5Insertion2#5 3#5 4#5 5#5 6#5 7#5 8#5 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #4 #4 #4 #4 #4 #4 #4 #4 #4 0#4 1#4 2#4 3#4 4#4 5#4 6#4 7#4 8#4 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #3 #3 #3 #3 #3 #3 #3 #3 #3 0#3 1#3 2#3 3#3 4#3 5#3 6#3 7#3 8#3 0 1 2 3 4 5 6 7 8 • • • • • • • • • Schulz KU, Mihov• S. Fast •string correction• with• Levenshtein• automata.• International• Journal• on Document• Analysis and#2 Recognition.#2 2002#2 Nov 1;5(1):67–85.#2 #2 #2 #2 #2 #2 0#2 1#2 2#2 3#2 4#2 5#2 6#2 7#2 8#2 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #1 #1 #1 #1 #1 #1 #1 #1 #1 0#1 1#1 2#1 3#1 4#1 5#1 6#1 7#1 8#1 0 1 2 3 4 5 6 7 8 • • • • • • • • • • • • • • • • • • #0 #0 #0 #0 #0 #0 #0 #0 #0 0#0 1#0 2#0 3#0 4#0 5#0 6#0 7#0 8#0 0 1 2 3 4 5 6 7 8

Fig. 4 Positions and accepting positions for n = 5 and w =8 Fig. 5 Subsumption triangles (cf. Example 4)

Accepting positions can be considered as final states of heavily depends on identities between letters w ,... ,w . 1 5 the non-deterministic Levenshtein automata described For this reason, a uniform deterministic Levenshtein au- above. tomaton for arbitrary input of length 5 cannot be given. Two examples of deterministic Levenshtein automata for Example 3 For n = 5 and w = 8, the set of all positions input of length 5 (of degree 1) can be found in Figs. 8 is depicted in Fig. 4. Accepting positions are marked. and 9. Definition 9 A position i♯e subsumes a position j♯f iff Positions and states. In order to characterize the states e0, otherwise it is called a base position. language

♯e Base positions can be considered as “error-free” posi- Φ(i ) := Lev(n e, xi+1 xw). tions. L − ··· ♯e ♯f Let π := i and π′ := j be two distinct positions. If π ♯e Definition 8 A position i is accepting iff w i n e. subsumes π′, then Φ(π′) is a subset of Φ(π). − ≤ − Levenshtein automata

foof doof

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata Levenshtein automata

foof doof

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata Levenshtein automata

foof doof dora

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata Levenshtein automata

foof doof dora

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata Levenshtein automata

foof doof dora foods

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata Levenshtein automata

foof doof dora foods feed

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata Levenshtein automata

foof doof dora foods feed

http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata Levenshtein automata

This is a non-deterministic representation; actually using NFAs in practice is often tricky.

Luckily, NFAs can be determinized, which is generally how Levenshtein automata are actually used. http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata Levenshtein automata

On its own, having a Levenshtein automaton of a query word improves even the naïve approach (pairwise comparison):

Instead of a large set of O(nm) computations, we have a large set of O(n) computations! Levenshtein automata

We can do better, however.

Represent our dictionary as a trie, DAWG, etc....

... and walk through it and our determinized automaton together in tandem. At each state we encounter, follow edges that both have in common. Any time both are in final states, we’ve got a match! f g

o o

o r o r

e d d l d p fore ford fool food good gorp