<<

Approximate Dictionary Queries

1 2

Gerth Stlting Bro dal and Leszek Gasieniec

BRICS Computer Science Department Aarhus University

Ny Munkegade DK Arhus C Denmark

MaxPlanck Institut fur Informatik Im Stadtwald Saarbruc ken D Germany

Abstract Given a set of n binary strings of length m each We consider

the problem of answering dqueries Given a binary query string of

length m a dquery is to rep ort if there exists a string in the set within

Hamming d of

We present a of size O nm supp orting queries in time

O m and the rep orting of all strings within Hamming distance of

in time O m The data structure can b e constructed in time O nm A

slightly mo died version of the data structure supp orts the insertion of

new strings in amortized time O m

Intro duction

Let W fw w g b e a set of n binary strings of length m each ie w

1 n i

m

f g The set W is called the dictionary We are interested in answering d

m

queries ie for any query string f g to decide if there is a string w in

i

W with at most Hamming distance d of

Minsky and Pap ert originally raised this problem in Recently a se

quence of pap ers have considered how to solve this problem eciently

Manb er and Wu considered the application of approximate dic

tionary queries to password security and sp elling correction of bibliographic les

Their metho d is based on Blo om lters and uses hashing techniques Dolev

et al and Greene Parnas and Yao considered approximate dictionary

queries for the case where d is large

The initial eort towards a theoretical study of the small d case was given by

Yao and Yao in They present for the case d a data structure supp ort

ing queries in time O m log log n with space requirement O nm log m Their

solution was describ ed in the cellprob e mo del of Yao with word size equal

to In this pap er we adopt the standard unit cost RAM mo del

Supp orted by the Danish Natural Science Research Council Grant No

This research was done while visiting the MaxPlanck Institut fur Informatik

Saabruc ken Germany Email gerthdaimiaaudk

On leave from Institute of Informatics Warsaw University ul Banacha

Warszawa Poland WWW httpzaamimuwedupllechulechuhtml

Email leszekmpisbmpgde

Basic Research in Computer Science a Centre of the Danish National Research

Foundation

For the general case where d dqueries can b e answered in optimal

P

d

m

space O nm doing exact queries each requiring time O m by using

i=0

i

the data structure of Fredman Komlos and Szemeredi On the other hand

dqueries can b e answered in time O m when the size of the data structure can

P

d

m

b e O n We present the corresp onding data structure of size O nm

i=0

i

for the query case

We present a simple data structure based on which has optimal

size O nm and supp orts queries in time O m Unfortunately we do not know

how to construct the data structure in time O nm and we leave this as an op en

problem However we give a more involved data structure of size O nm based

on two tries supp orting queries in time O m and which can b e constructed

in time O nm Both data structures supp ort the rep orting of all strings with

Hamming distance at most one of the query string in time O m For general

P

d1

m

d b oth data structures supp ort dqueries in time O m The second

i=0

i

data structure can b e made semidynamic in terms of allowing insertions in

amortized time O m when starting with an initially empty dictionary Both

data structures work as well for larger alphab ets j j when the query time

is slowed down by a log j j factor

The pap er is organized as follows In Sect we give a simple O nm size data

structure supp orting queries in time O m In Sect we present an O nm

size data structure constructible in time O nm which also supp orts queries

in time O m In Sect we present a semidynamic version of the second data

structure allowing insertions Finally in Sect we give concluding remarks and

mention op en problems

A based data structure

We assume that all strings considered are over a binary alphab et f g We

R

let jw j denote the length of w w i denote the ith symb ol of w and w denote

w reversed The strings in the dictionary W are called dictionary strings We let

dist u v denote the Hamming distance b etween the two strings u and v

H

The basic comp onent of our data structure is a trie A trie also called a

digital search tree is a tree representation of a set of strings In a trie all edges

are lab eled by symb ols such that every string corresp onds to a path in the trie

A trie is a prex tree ie two strings have a common path from the ro ot as long

as they have the same prex Since we consider strings over a binary alphab et

the maximum degree of a trie is at most two

Assume that all strings w W are stored in a dimensional array A of size

i W

n m ie of n rows and m columns such that the ith string is stored in the ith

row of the array A Notice that A i j is the j th symb ol w For every string

W W i

m

w W we dene a set of associated strings A fv f g jdist v w g

i i H i

where jA j m for i n The main data structure is a trie T containing

i

all strings w W and all strings from A for all i n ie every path

i i

from the ro ot to a leaf in the trie represents one of the strings The leaves of T

are lab eled by indices of dictionary strings such that a leaf representing a string

s and lab eled by index i satises that s w or s A

i i

Given a query string an query can b e answered as follows The query

is answered p ositively if there is an exact match ie w W or A

i j

for some j n Thus the query is answered p ositively if and only if there

is a leaf in the trie T representing the query string This can b e checked in

time O m by a topdown traverse in T If the leaf exists then the index stored

at the leaf is an index of a matched dictionary string

Notice that T has at most O nm leaves b ecause it contains at most O nm

dierent strings Thus T has at most O nm internal vertices with degree greater

than one If we compress all chains in T into single edges we get a compressed

0

trie T of size O nm Edges which corresp ond to compressed chains are lab eled

by prop er intervals of rows in the array A If a compressed chain is a substring

W

of a string in the a A then the information ab out the corresp onding substring

j

of w is extended by the p osition of the changed bit Since every entry in A

j W

can b e accessed in constant time every query can still b e answered in time

O m

0

A slight mo dication of the trie T allows all dictionary strings which match

0

the query string to b e rep orted At every leaf s representing a string u in T

instead of one index we store all indices i of dictionary strings satisfying s w

i

or s A Notice that the total size of the trie is still O nm since every index

i

i for i n is stored at exactly m leaves The rep orting algorithm

rst nds the leaf representing the query string and then rep orts all indices

stored at that leaf There are at most m rep orted string thus the rep orting

algorithm works in time O m Thus the following theorem holds

Theorem There exists a data structure of size O nm which supports the

reporting of al l matched dictionary strings to an query in time O m

The data structure ab ove is quite simple o ccupies optimally space O nm

and allows queries to b e answered optimally in time O m But we do not

know how to construct it in time O nm The straight forward approach gives a

2

construction time of O nm this is the total size of the strings in W and the

asso ciated strings from all A sets

i

In the next section we give another data structure of size O nm supp orting

queries in time O m and constructible in optimal time O nm

A doubletrie data structure

In the following we assume that all strings in W are enumerated according to

their lexicographical order We can satisfy this assumption by sorting the strings

in W for example by radix sort in time O nm Let I f ng denote the set

of the indices of the enumerated strings from W We denote a set of consecutive

indices consecutive integers an interval

The new data structure is comp osed of two tries The trie T contains the

W

set of stings W whereas the trie T W where contains all strings from the set

W

R

jw W g W fw

i

i

Since T is a prex trie every path from the ro ot to a vertex u represents a

W

prex p of a string w W Denote by W the set fw W jw has prex p g

u i u i i u

Since strings in W are enumerated according to their lexicographical order those

indices form an interval I ie w W if and only if i I Notice that an

u i u u

interval of a vertex in the trie T is the concatenation of the intervals of its

W

children For each vertex u in T we compute the corresp onding interval I

W u

storing at u the rst and last index of I

u

Similarly every path from the ro ot to a vertex v in T represents a reversed

W

R v

sux s of a string w W Denote by W the set fw W jw has sux s g

j i i v

v

v

and by S I the set of indices of strings in W We organize the indices of

v

every set S in sorted lists L in increasing order At the ro ot r of the trie

v v

the list L is supp orted by a search tree maintaining the indices of all the T

r

W

dictionary strings For an index in a list L the neighb or with the smaller value

v

is called left neighb or and the one with greater value is called right neighb or If

then S and S are identical If a vertex x is the only child of vertex v T

x v

W

vertex v T has two children x and y there are at most two children since

W

T is a binary trie the sets S and S form a partition of the set S Since

x y v

W

indices in the set S are not consecutive S is usually not an interval we use

v v

additional links to keep fast connection b etween the set S and its partition into

v

S and S Each element e in the list L has one additional link to the closest

x y v

element in the list L ie to the smallest element e in the list L such that

x r x

e e or the greatest element e in the list L such that e e Moreover in

r l x l

case vertex v has two children element e has also one additional link to the

analogously dened element e L or e L

l y r y

Lemma The tries T and T can be stored in O nm space and they can

W

W

be constructed in time O nm

Proof The trie T has at most O nm edges and vertices ie the numb er of

W

symb ols in all strings in W Every vertex u T keeps only information ab out

W

the two ends of its interval I l r For all u T b oth indices l and r can

u W

b e easily computed by a p ostorder traversal of T in time O nm

W

is similarly b ounded by O nm Moreover for The numb er of vertices in T

W

P

any level i m in T the sum jS j over all vertices v at this level is

v

W

exactly n since the sets of indices stored at the children forms a partition of the

set kept by their parent Since T has exactly m levels and every index in an L

v

W

do es not exceed O nm to o list has at most two additional links the size of T

W

A leaf representing The L lists are constructed by a p ostorder traversal of T

v

W

R

the string w has L i and the L list of an internal vertex of T can

v v

i

W

b e constructed by merging the corresp onding disjoint lists at its children The

additional links are created along with the merging Thus the trie T can b e

W

constructed in time O nm ut

Answering queries In this section we show how to answer queries in time

O m assuming that b oth tries T and T are already constructed We present

W

W

a sequence of three query algorithms all based on the doubletrie structure

The rst algorithm Query outlines how to use the presented data structure to

answer queries The second algorithm Query rep orts the index of a matched

dictionary string The third algorithm Query rep orts all matched dictionary

strings

Let pr ef b e the longest prex of the string that is also a prex of a string

u in in W The prex pr ef is represented by a path from the ro ot to a vertex

the trie T ie p p u the string p is but for the only child x of vertex

W x u

not a prex of We call the vertex u the kernel vertex for the string and the

path from the ro ot of T to the kernel vertex u the leading path in T The

W W

asso ciated with the kernel vertex interval I I u is called the kernel interval

u

for the string and the smallest element I is called the key for the query

string Notice that the key I for every vertex w on the leading path

w

in T

W

we dene the kernel set S which is asso ciated with Similarly in the trie T

v^

W

R

the vertexv where v corresp onds to the longest prex of the string in T

W

R

The vertex v is called a kernel vertex for the string and the path from the

ro ot of T tov is called the leading path in T

W W

The general idea of the algorithm is as follows If the query string has an

exact match in the set W then there is a leaf in T which represents the query

W

string The prop er leaf can b e found in time O m by a topdown traverse of

T starting from its ro ot

W

If the query string has no exact match in W but it has a match within

distance one we know that there is a string w W which has a factorization

i

b satisfying

is a prex of of length l

is a sux of of length r

b l and

l r m

Notice that prex must b e represented by a vertex u in the leading path

in T and sux must b e represented by a vertex v in the leading path of T

W

W

We call such a pair u v a feasible pair To nd the string w within distance

i

of the query string we have to search all feasible pairs u v Every feasible

pair u v for which I S represents at least one string within distance

u v

of the query string The algorithm Query generates consecutive feasible pairs

u the kernel vertex in T The algorithm Query stops u v starting with u

W

with a p ositive answer just after the rst pair u v with I S is found

u v

It stops with a negative answer if all feasible pairs u v have I S

u v

Notice that the steps b efore the while lo op in the algorithm Query can b e

p erformed in time O m The algorithm lo oks for the kernel vertex in T going

W

from the ro ot along the leading path representing the prex pr ef as long as

p ossible The last reached vertex u is the kernel vertex u Then the corresp onding

ALGORITHM Query

b egin

u the kernel vertex in T u

W

vertex v such that u v is a feasible pair Find on the leading path in T

W

while vertex v exists do

if I \ S 6 ; then return There is a match

u v

u Parentu

v ChildonLeadingPathv

o d

return No match

end

is found if such a vertex exists Recall that vertex v on the leading path in T

W

a pair u v must b e a feasible pair At this p oint the following problem arises

How to p erform the test I S eciently

u v

Recall that the smallest index in the kernel interval I is called the key

for the query string and recall also that the key I for every vertex w

w

in the leading path in the trie T During the rst test I S the p osition

W u v

of the key in S is found in time log jS j log n m since W only contains

v v

binary strings we have log n m Let I l r a b e the left a and b

u

the right b neighb ors of in the set S Now the test I S can

v u v

b e stated as

I S l a b r

u v

If the ab ove test is p ositive the algorithm Query rep orts the prop er index

among a and b and stops Otherwise in the next round of the while lo op the

new neighb ors a and b of the key in the new list L are computed in constant

v

time by using the additional links b etween the elements of the old and new list

L

v

Theorem queries to a dictionary W of n strings of length m can be an

swered in time O m and space O nm

Proof The initial steps of the algorithm preceding the while lo op are p erformed

in time O m log n O m The feasible pair u v if such exists is simply

found in time O m Then the algorithm nds in time O log n the neighb ors

of in the list L which is held at the ro ot of T This is p ossible since the

r

W

list L is supp orted by a search tree Now the algorithm traverses the leading

r

path in T recovering at each level neighb ors of in constant time using the

W

additional links There are at most m iterations of the while lo op since there is

Every iteration of the while lo op is exactly m levels in b oth tries T and T

W

W

done in constant time since b oth neighb ors a and b of the key in the new

more sparse set S are found in constant time Thus the total running time of

v

the algorithm is O m ut

ALGORITHM Query

b egin

u the kernel vertex in T u

W

vertex v such that u v is a feasible pair Find on the leading path in T

W

Find the neighb ors a and b of the key in S

v

while vertex v exists do

if l  a then return String a is matched

if b  r then return String b is matched

u Parentu Set l and r according to the new interval I

u

v ChildonLeadingPathv

Find new neighb ors of a and b in the new list L

v

o d

return No match

end

We explain now how to mo dify the algorithm Query to an algorithm re

p orting all matches to a query string The main idea of the new algorithm is as

follows At any iteration of the while lo op instead of lo oking only for the left

and the right neighb or of the key index the algorithm Query searches one by

one all indices to the left and right of which b elong to the list L and to the

v

interval I To avoid multiple rep orting of the same index the algorithm searches

u

only that part of the new interval I which is an extension of the previous one

u

The variables a and b store the leftmost and the rightmost searched indices in

the list L

v

Theorem There exists a data structure of size O nm and constructible in

time O nm which supports the reporting of al l matched dictionary strings to a

query in time O m

Proof The algorithm Query works in time O m matched where matched

is the numb er of all rep orted strings Since there is at most m rep orted strings

one exact matching and at most m matches with one error the total time of

the rep orting algorithm is O m ut

A semidynamic data structure

In this section we describ e how the data structure presented in Sect can b e

made semidynamic such that new binary strings can b e inserted into W in

0

amortized time O m In the following w denotes a string to b e inserted into W

The data structure describ ed in Sect requires that the strings w are lex

i

icographically sorted and that each string has assigned its rank with resp ect to

0

the lexicographical ordering of the strings If we want to add w to W we can

0

use T to lo cate the p osition of w in the sorted list of w s in time O m If

W i

we continue to maintain the ranks explicitly assigned to the strings we have to

ALGORITHM Query

b egin

u the kernel vertex in T u

W

vertex v such that u v is a feasible pair Find on the leading path in T

W

Find the neighb ors a and b of the key in S

v

while vertex v exists do

while l  a do

rep ort String a is matched

a left neighb or of a in L

v

o d

while b  r do

rep ort String b is matched

b right neighb or of b in L

v

o d

u Parentu Set l and r according to new I

u

v ChildonLeadingPathv

Find a or the left neighb or of a in the new list L

v

Find b or the right neighb or of b in the new list L

v

o d

end

0

reassign new ranks to all strings larger than w This would require time n

To avoid this problem observe that the indices are used to store the endp oints

of the intervals I and to store the sets S and that the only op eration p er

u v

formed on the indices is the comparison of two indices to decide if one string is

lexicographically less than another string in constant time

Essentially what we need to know is if given the hand les of two strings from

W which one of the two strings is the lexicographically smallest A solution to

this problem was given by Dietz and Sleator They presented a data structure

that allows a new element to b e inserted into a linked list in constant time if the

new elements p osition is known and that can answer order queries in constant

time

By applying the data structure of Dietz and Sleator to maintain the ordering

b etween the strings an insertion can now b e implemented as follows First insert

0 0

w into T This requires time O m The p osition of w in T also determines

W W

its lo cation in the lexicographically order implying that the data structure of

Dietz and Sleator can b e up dated to o By traversing the path from the new leaf

0

representing w in T to the ro ot of T the endp oints of the intervals I can

W W u

b e up dated in time O m

0R

without up dating the asso ciated elds can b e The insertion of w into T

W

done in time O m Analogously to the query algorithm in Sect the p ositions

0

in the S sets along the insertion path of w in T where to insert the handle

v

W

0

of w can b e found in time O m

The problem remaining is to up date the additional links b etween the elements

in the L lists For this purp ose we change our representation to the following

v

Let v b e a no de with sons x and y In the following we only consider how to

handle the links b etween L and L The links b etween L and L are handled

v x v y

analogously For each element e L L we maintain a p ointer from the

v x

p osition of e in L to the p osition of e in L For each element e L n L

v x v x

the p ointer is null Let e L We can now nd the closest element to e in

v

L by nding the closest element in L that has a non null p ointer We denote

x v

such an element to b e marked For this purp ose we use the ndsplitadd data

structure of Imai and Asano an extension of a data structure by Gab ow

and Tarjan The data structure supp orts the following op erations Given a

p ointer to an element in a list to nd the closest marked element nd to mark

an unmarked list element split and to insert a new unmarked element into the

list adjacent to an element in the list add The op erations split and add can b e

p erformed in amortized constant time and nd in worst case constant time on

a RAM Going from e in L to es closest neighb or in L can still b e p erformed

v x

in worst case constant time b ecause this only requires one nd op eration to b e

p erformed When a new element e is added to L we just p erform add once and

v

in case e is added to L to o we also p erform split on e This requires amortized

x

constant time Totally we can therefore up date all the links b etween the L lists

v

in amortized time O m when inserting a new string into the dictionary

Theorem There exists a data structure which supports the reporting of al l

matched dictionary strings to a query in worst case time O m and that al lows

new dictionary strings to be inserted in amortized time O m

Conclusion

We have presented a data structure for the approximate dictionary query prob

lem that can b e constructed in time O nm stored in O nm space and that

can answer queries in time O m We have also shown that the data structure

can b e made semidynamic by allowing insertions in amortized time O m when

we start with an initially empty dictionary For the general d case the presented

P

d1

m

data structure allows dqueries to b e answered in time O m by ask

i=0

i

ing queries for all strings within Hamming distance d of the query string

This improves the query time of a nave algorithm by a factor of m We leave as

an op en problem if the ab ove query time for the general d case can b e improved

2

when the size of the data structure is O nm For example is there any om

query algorithm

Another interesting problem which is related to the approximate query prob

lem and the approximate string matching problem can b e stated as follows Given

a binary string t of length n is it p ossible to create a data structure for t of size

O n which allows queries ie queries for o ccurrences of a query string with

at most one mismatch in time O m where m is the size of the query string

By creating a compressed sux tree of size O n for the string queries can

2

b e answered in time O m by an exhaustive search

Acknowledgment

The authors thank Dany Breslauer for p ointing out the relation to the ndsplit

add problem

References

Alfred V Aho John E Hop croft and Jerey D Ullman Data Structures and

Algorithms AddisonWesley Reading MA

B H Blo om Spacetime tradeos in hash co ding with allowable errors Com

munications of the ACM

Paul F Dietz and Daniel D Sleator Two algorithms for maintainin g order in a

list In Proc th Ann ACM Symp on Theory of Computing STOC pages

Danny Dolev Yuval Harari Nathan Linial Noam Nisan and Michael Parnas

Neighb orho o d preserving hashing and approximate queries In Proc th ACM

SIAM Symposium on Discrete Algorithms SODA pages

Danny Dolev Yuval Harari and Michael Parnas Finding the neighb orho o d of a

query in a dictionary In Proc nd Israel Symposium on Theory of Computing and

Systems pages

E Fredkin Trie memory Communications of the ACM

Michael L Fredman Janos Komlos and Endre Szemeredi Storing a sparse table

with O worst case access time Journal of the ACM

Harold N Gab ow and Rob ert Endre Tarjan A lineartime algorithm for a sp ecial

case of disjoint set union Journal of Computer and System Sciences

Dan Greene Michal Parnas and Frances Yao Multiindex hashing for information

retrieval In Proc th Ann Symp on Foundations of Computer Science FOCS

pages

Hiroshi Imai and Taka Asano Dynamic orthogonal segment intersection search

Journal of Algorithms

Udi Manb er and Sun Wu An algorithm for approximate memb ership checking

with applicatio n to password security Information Processing Letters

M Minsky and S Pap ert Perceptrons MIT Press Cambridge Mass

P van Emde Boas Machine mo dels and simulations In J van Leeuwen editor

Handbook of Theoretical Computer Science Volume A Algorithms and Complexity

MIT PressElsevier

Andrew C Yao Should tables b e sorted Journal of the ACM

Andrew C Yao and Frances F Yao Dictionary lo okup with small errors In Proc

th Combinatorial volume of Lecture Notes in Computer

Science pages Springer Verlag Berlin

A

This article was pro cessed using the L T X macro package with LLNCS style E