Dictionary Look-Up Within Small Edit Distance
Total Page:16
File Type:pdf, Size:1020Kb
Dictionary Lo okUp Within Small Edit Distance Ab dullah N Arslan and Omer Egeciog lu Department of Computer Science University of California Santa Barbara Santa Barbara CA USA farslanomergcsucsbedu Abstract Let W b e a dictionary consisting of n binary strings of length m each represented as a trie The usual dquery asks if there exists a string in W within Hamming distance d of a given binary query string q We present an algorithm to determine if there is a memb er in W within edit distance d of a given query string q of length m The metho d d+1 takes time O dm in the RAM mo del indep endent of n and requires O dm additional space Intro duction Let W b e a dictionary consisting of n binary strings of length m each A dquery asks if there exists a string in W within Hamming distance d of a given binary query string q Algorithms for answering dqueries eciently has b een a topic of interest for some time and have also b een studied as the approximate query and the approximate query retrieval problems in the literature The problem was originally p osed by Minsky and Pap ert in in which they asked if there is a data structure that supp orts fast dqueries The cases of small d and large d for this problem seem to require dierent techniques for their solutions The case when d is small was studied by Yao and Yao Dolev et al and Greene et al have made some progress when d is relatively large There are ecient algorithms only when d prop osed by Bro dal and Venkadesh Yao and Yao and Bro dal and Gasieniec The small d case has applications in password security Searching biological sequence databases may also use the metho ds of answering dqueries Previous studies for the dquery problem have fo cused on minimizing the numb er of memory accesses for a dquery assuming other computations are free and used cell or bit prob e mo dels to express complexity We assume a RAM mo del with constant memory access time and take into account all computations in the complexity analysis Dolev et al presented b ounds for the space and time complexity of the dquery problem under certain assumptions using various notions of proximity In the mo del W is stored in buckets and prepro cessing of W is allowed Supp orted in part by NSF Grant No EIA In this pap er we consider answering dqueries eciently without limiting our selves to the construction of a new data structure parametrized by d The variant of the original dquery problem that we consider is when the stringtostring edit distance is used as the distance measure instead of the ordinary case of Hamming distance We assume that W is stored as a trie T and prop ose two algorithms m for the dquery problem in this case Our algorithms use the hybrid treedynamic programming approach The rst one Algorithm LOOKUP Figure re ed d+2 d+1 quires O dm time in the worst case and O dm space in addition to the space requirements of the trie T This complexity is of interest for small values m of d under investigation The second algorithm Algorithm DFTLOOKUP ed d+1 Figure has time complexity O dm and additional space complexity of only O dm There is reason to b elieve that the average p erformance of b oth algorithms is much b etter when W is sparse Motivation Hamming Distance Based Metho ds Hamming distance b etween two binary strings is the numb er of p ositions they dier A dquery asks if there is a memb er in a dictionary W whose Hamming distance is at most d from a given binary query string q We assume a trie representation T for W and assume for simplicity that m W consists of binary words of length m each A trie is a tree whose arcs are lab eled by the symb ols of alphab et in this case f g The leaf no des of T corresp ond to the words in W and when concatenated the lab els of arcs on m a path from the ro ot to a given intermediate no de gives a prex of at least one word in W Clearly in the RAM mo del assumed accessing a word in W takes O m time Figure part a shows an example trie T representing a dictionary 5 W f g Root Root 0 0 1 0 1 0 1 0 1 1 0 1 1 0 1 2 0 0 1 0 0 1 1 2 2 1 0 1 1 0 1 2 2 3 1 1 1 1 1 1 3 3 4 (a) (b) Fig a An example trie with binary words and b The numb ers in italic are the no de weights computed with resp ect to query string A naive metho d for answering a dquery is to generate the whole set of P d m strings diering from q in at most d p ositions and with every string k =0 k generated p erform a dictionary lo okup in T for an exact memb er in W This m d+2 naive generate and test algorithm takes O m time and O m additional space to store a generated string at a time Another naive metho d is to add all strings within Hamming distance d from any memb er in W to obtain a bigger dictionary W Then any dquery can b e answered in O m time using the corresp onding trie T for an exact memb er This latter metho d signicantly increases the size m d of W by a numb er roughly O nm mbit memb ers Cost of constructing and maintaining T may b e extremely high m For Hamming distance we can improve the rst naive algorithm ab ove as follows Let sv denote the prex corresp onding to trie no de v Given a query string q supp ose that we assign weight w to each trie no de v in T as h m w v hsv q h 1js(v )j where h denotes Hamming distance As an example in Figure b the weights of the no des have b een computed with resp ect to query string q The idea is that we can prune the trie in our search for q at the no des in T with m w v d h Lemma Let N be the number of nodes in T with weight d as dened in m d+1 Then N O m Proof It is easy to see that N is maximized over all tries T when T is a m m complete trie over ie T contains all binary strings of length m Figure m shows no de weights of a complete trie with resp ect to q up to level starting with the ro ot at level The ro ot has weight For any other vertex v at level l if the arc from its parent to v has lab el q the l th symb ol of query string q l then v and its parent have the same weight otherwise weight of v is more than that of its parent Let Ll w denote the numb er of vertices with weight w at level l of the complete trie T At any level l the largest weight is l Using m these observations we see that Ll w Ll w Ll w with l w and l Ll Therefore Ll w is the binomial co ecient Furthermore since w the smallest level at which weight w app ears in T is l w the total numb er m w w +1 m m+1 of vertices with weight w in T is Hence m w w w w +1 d X m d+1 O m N w w =0 Based on the ab ove lemma Figure outlines Algorithm LOOKUP for h dictionary lo okup within Hamming distance d The algorithm explores all no des v in T with weight w v d ie sv is a prex of a word in W whose m h Hamming distance from q is p otentially within d S stores the set of no de k weight pairs v w v for all no des v at levels k with weight w v d The h h algorithm iteratively computes S from S by collecting all pairs v a w v k k 1 ha q in S where v w v S w v ha q d and a f g k k k 1 k Level 0 0 0 1 0 1 1 0 1 0 1 0 1 1 2 2 0 1 0 1 0 1 0 1 0 1 1 2 1 2 2 3 3 0 10 10 10 10 10 10 10 1 4 01121223122323 34 Fig In a complete binary trie of height weights with resp ect to a given binary query string are shown in italic Clearly if there is a memb er in W within Hamming distance d then it will b e captured in S in which case the algorithm returns YES otherwise it returns m NO d+1 S contains O m no deweight pairs by Lemma Therefore the time k d+2 complexity in the assumed mo del is O m It also requires additional space d+1 to store O m trie no des The time complexity is no b etter in the worst case than that of the naive algorithm which generates and tries all p ossible strings within Hamming distance d from q However for a sparse dictionary W Algorithm LOOKUP is b ound to b e much faster on the average h Algorithm LOOKUP d h S fv g 0 0 for k to m do S fv a w v ha q j v w v S w v ha q d and a g k k k 1 k Return YES if the minimum weight in S is d NO otherwise m Fig Algorithm LOOKUP for dictionary lo okup within Hamming distance d h Edit Distance Based dqueries Algorithm LOOKUP essentially computes the Hamming distance b etween q h and a selected part of dictionary W Next we investigate the p ossibility of using a similar idea for dqueries dened with resp ect to the edit distance For the purp oses of this pap er we use a simple typ e of edit distance Given two strings p p p and q q q the edit distance edp q is the 1 m 1 m minimum numb er of edit op erations which transforms p into q The edit op er ations are of three typ es insert delete and substitute Substituting a symb ol by itself is called a match A match op eration is the only op eration that do es not contribute to the numb er of steps of the transformation In terms of costs all edit op erations have cost except for the match whose cost is The usual framework for the analysis of edit distance is the edit graph Edit Graph G pq 2 is a directed acyclic graph having m lattice p oints u v as vertices for u v m Horizontal and vertical arcs corresp ond to insert and delete op er ations resp ectively The diagonal arcs corresp ond to substitutions Each arc has a cost corresp onding to the edit op eration it represents If we trace the arcs of a path from no de to an intermediate no de i j and p erform the indicated edit op erations in the given order on p p then we obtain q q Edit dis 1 i 1 j tance b etween prexes p p and q q is the cost of the minimumcost