Approximate Dictionary Queries

Approximate Dictionary Queries 1 2 Gerth Stlting Bro dal and Leszek Gasieniec BRICS Computer Science Department Aarhus University Ny Munkegade DK Arhus C Denmark MaxPlanck Institut fur Informatik Im Stadtwald Saarbruc ken D Germany Abstract Given a set of n binary strings of length m each We consider the problem of answering dqueries Given a binary query string of length m a dquery is to rep ort if there exists a string in the set within Hamming distance d of We present a data structure of size O nm supp orting queries in time O m and the rep orting of all strings within Hamming distance of in time O m The data structure can b e constructed in time O nm A slightly mo died version of the data structure supp orts the insertion of new strings in amortized time O m Intro duction Let W fw w g b e a set of n binary strings of length m each ie w 1 n i m f g The set W is called the dictionary We are interested in answering d m queries ie for any query string f g to decide if there is a string w in i W with at most Hamming distance d of Minsky and Pap ert originally raised this problem in Recently a se quence of pap ers have considered how to solve this problem eciently Manb er and Wu considered the application of approximate dic tionary queries to password security and sp elling correction of bibliographic les Their metho d is based on Blo om lters and uses hashing techniques Dolev et al and Greene Parnas and Yao considered approximate dictionary queries for the case where d is large The initial eort towards a theoretical study of the small d case was given by Yao and Yao in They present for the case d a data structure supp ort ing queries in time O m log log n with space requirement O nm log m Their solution was describ ed in the cellprob e mo del of Yao with word size equal to In this pap er we adopt the standard unit cost RAM mo del Supp orted by the Danish Natural Science Research Council Grant No This research was done while visiting the MaxPlanck Institut fur Informatik Saabruc ken Germany Email gerthdaimiaaudk On leave from Institute of Informatics Warsaw University ul Banacha Warszawa Poland WWW httpzaamimuwedupllechulechuhtml Email leszekmpisbmpgde Basic Research in Computer Science a Centre of the Danish National Research Foundation For the general case where d dqueries can b e answered in optimal P d m space O nm doing exact queries each requiring time O m by using i=0 i the data structure of Fredman Komlos and Szemeredi On the other hand dqueries can b e answered in time O m when the size of the data structure can P d m b e O n We present the corresp onding data structure of size O nm i=0 i for the query case We present a simple data structure based on tries which has optimal size O nm and supp orts queries in time O m Unfortunately we do not know how to construct the data structure in time O nm and we leave this as an op en problem However we give a more involved data structure of size O nm based on two tries supp orting queries in time O m and which can b e constructed in time O nm Both data structures supp ort the rep orting of all strings with Hamming distance at most one of the query string in time O m For general P d1 m d b oth data structures supp ort dqueries in time O m The second i=0 i data structure can b e made semidynamic in terms of allowing insertions in amortized time O m when starting with an initially empty dictionary Both data structures work as well for larger alphab ets j j when the query time is slowed down by a log j j factor The pap er is organized as follows In Sect we give a simple O nm size data structure supp orting queries in time O m In Sect we present an O nm size data structure constructible in time O nm which also supp orts queries in time O m In Sect we present a semidynamic version of the second data structure allowing insertions Finally in Sect we give concluding remarks and mention op en problems A trie based data structure We assume that all strings considered are over a binary alphab et f g We R let jw j denote the length of w w i denote the ith symb ol of w and w denote w reversed The strings in the dictionary W are called dictionary strings We let dist u v denote the Hamming distance b etween the two strings u and v H The basic comp onent of our data structure is a trie A trie also called a digital search tree is a tree representation of a set of strings In a trie all edges are lab eled by symb ols such that every string corresp onds to a path in the trie A trie is a prex tree ie two strings have a common path from the ro ot as long as they have the same prex Since we consider strings over a binary alphab et the maximum degree of a trie is at most two Assume that all strings w W are stored in a dimensional array A of size i W n m ie of n rows and m columns such that the ith string is stored in the ith row of the array A Notice that A i j is the j th symb ol w For every string W W i m w W we dene a set of associated strings A fv f g jdist v w g i i H i where jA j m for i n The main data structure is a trie T containing i all strings w W and all strings from A for all i n ie every path i i from the ro ot to a leaf in the trie represents one of the strings The leaves of T are lab eled by indices of dictionary strings such that a leaf representing a string s and lab eled by index i satises that s w or s A i i Given a query string an query can b e answered as follows The query is answered p ositively if there is an exact match ie w W or A i j for some j n Thus the query is answered p ositively if and only if there is a leaf in the trie T representing the query string This can b e checked in time O m by a topdown traverse in T If the leaf exists then the index stored at the leaf is an index of a matched dictionary string Notice that T has at most O nm leaves b ecause it contains at most O nm dierent strings Thus T has at most O nm internal vertices with degree greater than one If we compress all chains in T into single edges we get a compressed 0 trie T of size O nm Edges which corresp ond to compressed chains are lab eled by prop er intervals of rows in the array A If a compressed chain is a substring W of a string in the a A then the information ab out the corresp onding substring j of w is extended by the p osition of the changed bit Since every entry in A j W can b e accessed in constant time every query can still b e answered in time O m 0 A slight mo dication of the trie T allows all dictionary strings which match 0 the query string to b e rep orted At every leaf s representing a string u in T instead of one index we store all indices i of dictionary strings satisfying s w i or s A Notice that the total size of the trie is still O nm since every index i i for i n is stored at exactly m leaves The rep orting algorithm rst nds the leaf representing the query string and then rep orts all indices stored at that leaf There are at most m rep orted string thus the rep orting algorithm works in time O m Thus the following theorem holds Theorem There exists a data structure of size O nm which supports the reporting of al l matched dictionary strings to an query in time O m The data structure ab ove is quite simple o ccupies optimally space O nm and allows queries to b e answered optimally in time O m But we do not know how to construct it in time O nm The straight forward approach gives a 2 construction time of O nm this is the total size of the strings in W and the asso ciated strings from all A sets i In the next section we give another data structure of size O nm supp orting queries in time O m and constructible in optimal time O nm A doubletrie data structure In the following we assume that all strings in W are enumerated according to their lexicographical order We can satisfy this assumption by sorting the strings in W for example by radix sort in time O nm Let I f ng denote the set of the indices of the enumerated strings from W We denote a set of consecutive indices consecutive integers an interval The new data structure is comp osed of two tries The trie T contains the W set of stings W whereas the trie T W where contains all strings from the set W R jw W g W fw i i Since T is a prex trie every path from the ro ot to a vertex u represents a W prex p of a string w W Denote by W the set fw W jw has prex p g u i u i i u Since strings in W are enumerated according to their lexicographical order those indices form an interval I ie w W if and only if i I Notice that an u i u u interval of a vertex in the trie T is the concatenation of the intervals of its W children For each vertex u in T we compute the corresp onding interval I W u storing at u the rst and last index of I u Similarly every path from the ro ot to a vertex v in T represents a reversed W R v sux s of a string w W Denote by W the set fw W jw has sux s g j i i v v v and by S I the set of indices of strings in W We organize the indices of v every set S in sorted lists L in increasing order At the ro ot r of the trie v v the list L is supp orted by a search tree maintaining the indices of all the T r W dictionary strings For an index in a list L the neighb or with the smaller value v is called left neighb or and the one with greater value is called right neighb or If then S and S are identical If a vertex x is the only child of vertex v T x v W vertex v T has two children x and y there are at most two children since W T is a binary trie the sets S and S form

Load more