A Metric Index

?

for Approximate String Matching

1 2

Edgar Chavez and Gonzalo Navarro

1

Escuela de Ciencias FsicoMatematicas Universidad Michoacana Edicio B

Ciudad Universitaria Morelia Mich Mexico elchavezfismatumichmx

On leave of absence at Universidad de Chile

2

Depto de Ciencias de la Computacion Universidad de Chile Blanco Encalada

Santiago Chile gnavarrodccuchilecl

Abstract We present a radically new indexing approach for approxi

mate string matching The scheme uses the metric prop erties of the edit

distance and can b e applied to any other metric b etween strings We

build a metric space where the sites are the no des of the sux of

the text and the approximate query is seen as a proximity query on that

metric space This p ermits us nding the R o ccurrences of a pattern of

2 2

length m in a text of length n in average time O m log n m R using

2

O n log n space and O n log n index construction time This complex

ity improves by far over all other previous metho ds We also show a

simpler scheme needing O n space

Intro duction and Related Work

Indexing text to p ermit ecient approximate searching on it is one of the

main op en problems in combinatorial pattern matching The approximate string

matching problem is Given a long text T of length n a comparatively short

pattern P of length m and a threshold r retrieve all the pattern o ccurrences

that is text substrings whose to the pattern is at most r The edit

distance b etween two strings is dened as the minimum numb er of character in

sertions deletions and substitutions needed to make them equal This distance

is used in many applications but several other distances are of interest

In the online version of the problem the pattern can b e prepro cessed but

the text cannot There are numerous solutions to this problem but none

is acceptable when the text is to o long since the search time is prop ortional

to the text length Indexing text for approximate string matching has received

attention only recently Despite some progress in the last decade the indexing

schemes for this problem are still rather immature

There exist some indexing schemes sp ecializ ed to wordwise searching on

natural language text These indexes p erform quite well in that case but

?

Supp orted by CYTED VI I RIBIDI pro ject b oth authors CONACyT grant

rst author and Fondecyt grant second author

they cannot b e extended to handle the general case Extremely imp ortant ap

plications such as DNA proteins music or oriental languages fall outside this

case

The indexes that solve the general problem can b e divided into three classes

Backtracking uses the sux tree sux array or DAWG

of the text in order to factor out its rep etitions A sequential algorithm on

the text is simulated by on the These algorithms

take time exp onential on m or r but in many cases indep endent of n the text

size This makes them attractive when searching for very short patterns

Partitioning partitions the pattern into pieces to ensure that some

of the pieces must app ear without alterations inside every o ccurrence An index

able of exact searching is used to detect the pieces and the text areas that

have enough evidence of containing an o ccurrence are checked with a sequential

algorithm These algorithms work well only when r m is small

The third class is a hybrid b etween the other two The pattern is

divided into large pieces that can still contain less errors they are searched

for using backtracking and the p otential text o ccurrences are checked as in the

partitioning metho ds The hybrid algorithms are more eective b ecause they

can nd the right p oint b etween length of the pieces to search for and error

level p ermitted Using the appropriate partition of the pattern these metho ds

achieve on average O n search time for some that dep ends on r

They tolerate mo derate error ratios r m

We prop ose in this pap er a brand new approach to the problem We take into

account that the edit distance satises the triangle inequality and hence it denes

a metric space on the set of text substrings We can reexpress the approximate

search problem as a range search problem on this metric space This approach

has b een attempted b efore but in those cases the particularities of the

problem made it p ossible to index O n elements In the general case we have

2

O n text substrings

The main contribution of this pap er is to devise a metho d based on the suf

2

x tree of the text to meaningfully collapse the O n text substring into O n

sets and to nd a way to build a metric space out of those sets The result is an

indexing metho d that at the cost of requiring on average O n log n space and

2

O n log n construction time p ermits nding the R approximate o ccurrences of

2

2

the pattern in O m log n m R average time This is a complexity break

through over previous work and it is easier than in other approaches to extend

the idea to other distance functions such as reversals Moreover it represents

an original approach to the problem that op ens a vast numb er of p ossibiliti es

for improvements We consider also a simpler version of the index needing O n

space and that despite not involving a complexity breakthrough promises to b e

b etter in practice

We use the following notation in the pap er Given a string s we denote

its length as jsj We also denote s the ith character of s for an integer i

i

fjsjg We denote s s s s which is the empty string if i j and

ij i i+1 j

s s The empty string is denoted as A string x is said to b e a prex

i

ijsj

of xy a sux of y x and a substring of y xz

Metric Spaces

We describ e in this section some concepts related to searching metric spaces We

have concentrated only in the part that is relevant for this pap er There exist

recent surveys if more complete information is desired

A metric space is informally a set of blackb ox ob jects and a distance func

tion dened among them which satises the triangle inequality The problem

of proximity searching in metric spaces consists of indexing the set such that

later given a query all the elements of the set that are close enough to the

query can b e quickly found This has applications in a vast numb er of elds

such as nontraditional databases where the concept of exact search is of no

use and we search for similar ob jects eg databases storing images ngerprints

or audio clips machine learning and classication where a new element must

b e classied according to its closest existing element image quantization and

compression where only some vectors can b e represented and those that cannot

must b e co ded as their closest representable p oint text retrieval where we lo ok

for do cuments that are similar to a given query or do cument computational

biology where we want to nd a DNA or protein sequence in a database allow

ing some errors due to typical variations function prediction where we want

to search for the most similar b ehavior of a function in the past so as to predict

its probable future b ehavior etc

Formally a metric space is a pair X d where X is a universe of ob jects

+

and d X X R is a distance function dened on it that returns non

negative values This distance satises the prop erties of reexivity dx x

strict p ositiveness x y dx y symmetry dx y dy x and

triangle inequality dx y dx z dz y

A nite subset U of X of size n jUj is the set of ob jects we search Among

the many queries of interest on a metric space we are interested in the socalled

range queries Given a query q X and a tolerance radius r nd the set of all

elements in U that are at distance at most r to q Formally the outcome of the

query is q r fu U dq u r g The goal is to prepro cess the set so as

d

to minimize the computational cost of pro ducing the answer q r

d

From the plethora of existing algorithms to index metric spaces we fo cus on

the socalled pivotbased ones which are built on a single general idea Select

k elements fp p g from U called pivots and identify each element u

1 k

U with a k dimensional p oint du p du p ie its distances to the

1 k

pivots The index is basically the set of k n co ordinates At query time map

q to the k dimensional p oint dq p dq p With this information at

1 k

hand we can lter out using the triangle inequality any element u such that

jdq p du p j r for some pivot p since in that case we know that dq u

i i i

r without need to evaluate du q Those elements that cannot b e ltered out

using this rule are directly compared against q

An interesting feature of pivotbased algorithms is that they can reduce the

numb er of nal distance evaluations by increasing the numb er of pivots Dene

D x y max jdx p dy p j Using the pivots p p is equivalent

k 1j k j j 1 k

to discarding elements u such that D q u r As more pivots are added

k

we need to p erform more distance evaluations exactly k to compute D q

k

these are called internal evaluations but on the other hand D q increases

k

its value and hence it has a higher chance of ltering out more elements those

comparisons against elements that cannot b e ltered out are called external It

follows that there exists an optimum k

If one is not only interested in the numb er of distance evaluations p erformed

but also in the total cpu time required then scanning all the n elements to lter

out some of them may b e unacceptable In that case one needs multidimensional

range search metho ds which include data structures such as the k dtree Rtree

X tree etc Those structures p ermit indexing a set of ob jects in k

dimensional space in order to pro cess range queries

In this pap er we are interested in a metric space where the universe is the

set of strings over some alphab et ie X and the distance function is the

socalled edit distance or This is dened as the minimum

numb er of character insertions deletions and substitutions necessary to make

two strings equal The edit distance and in fact any other distance

dened as the b est way to convert one element into the other is reexive strictly

p ositive as long as there are no zerocost op erations symmetric as long as the

op erations allowed are symmetric and satises the triangle inequality

The algorithm to compute the edit distance ed is based on dynamic pro

gramming Imagine that we need to compute edx y A matrix C is

0jxj0jy j

lled where C edx y so C edx y This is computed as

ij 1i 1j

jxjjy j

C i C j

i0 0j

C if x y then C else minC C C

ij i j i1j 1 i1j ij 1 i1j 1

The algorithm takes O jxjjy j time The matrix can b e lled columnwise or

rowwise there are more sophisticated ways as well For reasons that will b e

made clear later we prefer the rowwise lling The space required is only O jy j

since only the previous row must b e stored in order to compute the new one

and therefore we just keep one row and up date it

Text Indexing

Sux trees are widely used data structures for text pro cessing Any p osition

i in a text T denes a sux of T namely T A sux is a trie data structure

i

built over all the suxes of T At the leaf no des the p ointers to the suxes are

stored Every substring of T can b e found by traversing a path from the ro ot

Roughly sp eaking each sux trie leaf represents a sux and each internal no de

represents a dierent substring of T

To improve space utilization this trie is compacted into a Patricia tree

by compressing unary paths The edges that replace a compressed path store

the whole string that they represent via two p ointers to their initial and nal

text p osition Once unary paths are not present the trie now called sux tree

2

has O n no des instead of the worstcase O n of the trie The sux tree can

b e directly built in O n time Any algorithm on a sux trie can b e

simulated at the same cost in the sux tree

We call explicit those sux trie no des that survive in the sux tree and

implicit those that are collapsed Figure shows the sux trie and tree of the

text abracadabra Note that a sp ecial endmarker smaller than any other

character is app ended to the text so that all the suxes are external no des

1 2 3 45 6 7 8 9 10 11 ab r a c a d a b r a Text

c 3 a 9 3 c $ $ 8 10 Suffix Trie 9 10 r ra d 7 d 7 c c 0 5 0 5 9 9 b $ bra $ r a c c 5 6 7 2 7 2

a a 6 6 c 4 c 4 d 1 d c b r a $ bra c 1 2 3 4 8 1 4 1 $ $ $ 11 11 8

Suffix Array

11 8 1 4 6 2 9 57 10 3

1 2 3 4 5 6 7 8 9 10 11

Fig The sux trie sux tree and sux array of the text abracadabra

The gure shows the internal no des of the trie numb ered to in italics

inside circles which represent text substrings that app ear more than once

and the external no des numb ered to inside squares which represent text

substrings that app ear just once Those leaves do not only represent the unique

substrings but all their extensions until the full sux In the sux tree only

some internal no des are left and they represent the same substring as b efore

plus the prexes that may have b een collapsed For example the internal no de

of the sux tree represents now the compressed no des and and hence

the strings b br and bra The external no de represents abrac but

also abraca abracad etc until the full sux abracadabra

Finally the sux array is a more compact version of the sux tree

which requires much less space and p oses a small p enalty over the search time

If the leaves of the sux tree are traversed in lefttoright order all the suxes of

the text are retrieved in lexicographical order A sux array is simply an array

containing all the p ointers to the text suxes listed in lexicographical order as

shown in Figure The sux array stores one p ointer p er text p osition

The sux array can b e directly built without building the sux tree in

O n log n worst case time and O n log log n average time While sux

trees are searched as sux arrays are binary searched However almost

every algorithm on sux trees can b e adapted to work on sux arrays at an

O log n p enalty factor in the time cost This is b ecause each subtree of the sux

tree corresp onds to an interval in the sux array namely the one containing all

the leaves of the subtree To follow an edge of the sux trie we use binary search

to nd the new limits in the sux array For example the internal no de in

the sux tree corresp onds to the interval h i in the sux array Note that

implicit no des have the same interval than their representing explicit no de

Our Algorithm

Indexing

A straightforward approach to text indexing for approximate string matching

2

using metric spaces techniques has the problem that in principle there are O n

2

dierent substrings in a text and therefore we should index O n ob jects which

is unacceptable

The sux tree provides a concise representation of all the substrings of a

text in O n space So instead of indexing all the text substrings we index only

the explicit sux tree no des Therefore we have O n ob jects to b e indexed

as a metric space under the edit distance

Now each explicit internal no de represents itself and the no des that descend

to it by a unary path Hence each explicit no de that corresp onds to a string xy

and its parent corresp onds to the string x represents the following set of strings

xy fxy xy y xy g

1 1 2

where xy is a notation we have just intro duced For example the internal no de

in Figure represents the strings abra fab abr abrag

The leaves of the sux tree represent a unique text substring and all its

extensions until the full text sux is obtained Hence if T z xy and x is a

unique text substring whose prexes are not unique then the corresp onding

sux tree no de is an explicit leaf which for us represents the set fxg xy

Table shows the substrings represented by each no de in our running example

Note that the external no des that descend by the terminator character ie

e represent a substring that is also represented at its parent and hence it can b e disregarded

No de Sux trie Sux tree No de Sux trietree

i e abracadabra

i a a e bracadabra

i ab e racadabra

i abr e acadabra

i abra abra e cadabra

i b e adabra

i br e dabra

i bra bra e abra

i r e bra

i ra ra e ra

e a

Table The text substrings represented by each no de of the sux trie and tree of

Figure Internal no des are represented as ix and externals as ex

2

Hence instead of indexing all the O n text substrings we index O n sets

of strings which are the sets represented by the explicit internal and the external

no des of the sux tree In our example this set is U f a abra bra ra

abracadabra bracadabra racadabra acadabra cadabra adabra

dabrag

We have now to decide how to index this metric space formed by O n sets

of strings Many options are p ossible but we have concentrated on a pivot based

approach We select at random k dierent text substrings that will b e our pivots

For reasons that are made clear later we cho ose to select pivots of lengths

k For each explicit sux tree no de xy and each pivot p we

i

compute the distance b etween p and all the strings represented by xy From

i

the set of distances from a no de xy to p we store the minimum and maximum

i

ones Since all these strings are of the form fxy y j jy jg all the edit

1 j

distances can b e computed in O jp jjxy j time

i

Following our example let us assume that we have selected k pivots

p p a p br p cad and p raca Figure left

0 1 2 3 4

shows the computation of the edit distances b etween i abra and p

3

cad The result shows that the minimum and maximum values of this no de

with resp ect to this pivot are and resp ectively

In the case of external sux tree no des the string y tends to b e quite long

O n length on average which yields a very high computation time for all the

edit distances and anyway a very large value for the maximum edit distance note

that edp xy jxy j jp j We solve this by p essimistical ly assuming that the

i i

maximum distance is n when the sux tree no de is external The minimum edit

distance can b e found in O jp j maxjp j jxj time b ecause it is not necessary

i i

to consider arbitrarily long strings xy y If we compute the matrix row by

1 j

row then after having pro cessed x we have a minimum value seen up to now v

Then there is no p oint in considering rows j such that jxj j jp j v Hence

i

we work until row j v jp j jxj jp j

i i

c a d

abra

c a d

c

a

a

d

b

a

r

b

a

r

a

Fig The dynamic programming matrix to compute the edit distance b etween cad

and abra left or abracadabra right The emphasized area is where the

minima and maxima are taken from

Figure right illustrates this case with e abracadabra and the

same p cad Note that to compute the new set of edit distances we have

4

started from i which is the parent no de of e in the sux tree This can

always b e done in a depth rst traversal of the sux tree and saves construction

time Note also that it is not necessary to compute the last rows since they

measure the edit distance b etween strings of length or more against one of

length The distance cannot b e smaller than and we have found at that

p oint a minimum equal to In fact we just assume that the maximum is

so the minimum and maximum value for this external no de and this pivot are

and In particular since when indexing external no des xy we always have

2

edp x already computed they can b e indexed in O jp j time

i i

Once this is done for all the sux tree no des and all the pivots we have a set

of k minimum and maximum values for each explicit sux tree no de This can

b e regarded as a hyp errectangle in k dimensions

xy h minedxy p minedxy p

0 k 1

maxedxy p maxedxy p i

0 k 1

where we are sure that all the strings in xy lie inside the rectangle In our

example the minima and maxima for i with resp ect to p to p are h i

0 4

h i h i h i and h i Therefore i is represented by the hyp errec

tangle h i On the other hand the ranges for e are

h i h i h i h i and h i and its hyp errectangle is therefore

h i

Searching

Let us now consider a given query P searched for with at most r errors This

is a range query with radius r in the metric space of the sux tree no des As

for pivot based algorithms we compare the pattern P against the k pivots and

obtain a k dimensional co ordinate edP p edP p

1 k

Let p b e a given pivot and xy a given no de If it holds that

i

edP p r minedxy p edP p r maxedxy p

i i i i

then by the triangle inequality we know that edP xy r for any xy xy

The elimination can b e done using any pivot p In fact the no des that are

i

not eliminated are those whose rectangle has nonempty intersection with the

rectangle hedP p r edP p r edP p r edP p r i

1 k 1 k

Figure illustrates The no de contains a set of p oints and we store its min

imum and maximum distance to two pivots These dene a dimensional

rectangle where all the distances from any substring of the no de to the pivots

lie The query is a pattern P and a tolerance r which denes a circle around P

After taking the distances from P to the pivots we create a hyp ercub e a square

in this case of width r If the square do es not intersect the rectangle then

no substring in the no de can b e close enough to P

P 2r+1  ed(node,p2) p1 d1      P   d2  2r+1   d2 max2 min1

min2 max1 min2 node

max2 p2

ed(node,p1) node min1 max1

d1

Fig The eliminatio n rule using two pivots

We have to solve the problem of nding all the k dimensional rectangles

that intersect a given query rectangle This is a classical multidimensional range

search problem We could for example use some variant of Rtrees

which would also yield a go o d data structure to work on secondary memory

Those no des xy that cannot b e eliminated using any pivot must b e directly

compared against P For those whose minimum distance to P is at most r we

rep ort all their o ccurrences whose starting p oints are written in the leaves of

the subtree ro oted by the no de that has matched In our running example if

we are searching for abr with tolerance r then no de i qualies so we

rep ort the text p ositions in the corresp onding tree leaves and

Observe that in order to compare P against a given sux tree no de xy

the edit distance algorithm forces us to compare it against every prex of x as

well Those prexes corresp ond to sux tree no des in the path from the ro ot to

xy In order not to rep eat work we mark in the sux tree the no des that we

have to compare explicitly against P and also mark every no de in their path

to the ro ot Then we backtrack on the sux tree entering every marked no de

and keeping track of the edit distance b etween P and the no de The new row is

computed using the row of the parent just as done with the pivots This avoids

recomputing the same prexes for dierent sux tree no des and incidentally is

similar to the simplest backtracking approach except that in this case we

only follow marked paths In this resp ect our algorithm can b e thought of as a

prepro cessing to a backtracking algorithm which lters out some paths

As a practical matter note that this is the only step where the sux tree is

required We can even print the text substrings that match the pattern without

the help of the sux tree but we need it in order to rep ort all their text p ositions

For this sake a sux array is much cheap er and do es a b etter job b ecause all

the text p ositions are listed in a contiguous interval In fact the sux array

can also replace the sux tree at indexing time

Analysis

Let us rst consider the construction cost The maximum length of a rep eated

text substring or which is the same the average height of the sux trie is

O log n Recall also that our pivots are of length O k Hence comput

2

ing all the k n minima and maxima takes time O k n k log n O k n log n

where O k log n is the time to compute each distance However we start the

computation of the edit distances of a no de from the last row of its parent This

2

reduces the average construction cost to O k n since for each pivot we compute

one dynamic programming row p er sux trie no de and on average the numb er

of no des in the sux trie is O n The n leaves are also computed in

2 2

O jp j n O k n time

i

The total space required by the data structure is O k n since we need to

store for each explicit no de a p ointer to the sux tree and its k co ordinates

The sux tree itself takes O n space

It remains to determine the average search time A key element of the analy

sis is a constant which is the probability that for a random hyp errectangle

of the set along some xed co ordinate the corresp onding segment of the query

hyp ercub e intersects with the corresp onding segment of the hyp errectangle An

other way to put it is that along that co ordinate the query p oint falls inside

the hyp errectangle pro jection onto that co ordinate after it is enlarged in r units

along each dimension In op erational terms is the probability that some given

pivot that corresp onding to the selected co ordinate do es not p ermit discarding

a given element Note that do es not dep end on k only on r

The rst part of the search is the computation of the edit distances b etween

2

the k pivots and the pattern P of length m This takes O k m time

The second part is the search for the rectangles that intersect the query

rectangle Many analyses of the p erformance of Rtrees exist in the literature

Despite that most of them deal with the exact numb er of disk

accesses their abstract result is that the exp ected amount of work on the Rtree

k

and variants such as the KDBtree is O n log n

The third part nally is the direct check of the pattern against the sux

tree no des whose rectangles intersect the query rectangle Since we discard using

k

any of k random pivots the probability of not discarding a no de is As there

k

are O n sux tree no des we check on average n no des with a total cost of

k 2 2

O n m The m is the cost to compute the edit distance b etween a pattern

of length m and a candidate whose length must b e b etween m r and m r

This is b ecause the pivot removes all shorter or longer candidates

At the end we rep ort the R results in O R time using a sux tree traversal

2 k k 2

Hence our total average cost is b ounded by k m n log n n m R for

This is optimized for k log n O log log n log n

1 1

log n

2

2

If we use log n pivots the search cost b ecomes O m log n m R on

1

average Note that the inuence of the search radius r is emb edded in This

is much b etter complexity than all previous work which obtains O mn time

for some Moreover much of previous work requires m log n to

obtain sublinearity while our approach do es not

2

The price is in the construction time and space which b ecome O n log n

and O n log n resp ectively Esp ecially the latter can b e prohibitive and we may

have to content ourselves with a smaller k There seems to b e no go o d tradeo

b etween space and time eg to obtain O n time we also need log n pivots

Most other indexes require O n space and construction time

Finally it is worth mentioning that since we automatically discard any in

ternal no de not in the length m r m r thanks to the pivot p there is

0

m+r

a worstcase limit on the numb er of sux tree no des to consider for the

last phase Although this limit is exp onential on m and r it is indep endent of

n Other indexing schemes based on the sux tree share the same prop erty

Towards a Practical Implementation

Despite that we have obtained an imp ortant reduction in time complexity with

resp ect to n and m our result is hiding a multiplying factor that dep ends on the

search radius It is p ossible that this constant is to o large that is to o close

to and makes the whole approach useless Also the extra space requirement

which also increases as tends to can b e unmanageable In this section

we consider an alternative approach that is simpler and likely to obtain b etter

results in practice despite not involving a complexity breakthrough

Indexing only Suxes

A simpler index that derives from the same ideas of the pap er considers only the

n text suxes and no internal no des Each sux T represents all the text

j

substrings starting at i and it is indexed according to the minimum distance

b etween those substrings and each pivot

The go o d p oint of the approach is reduced space Not only the set U can have

up to half the elements of the original approach but also only k values not k

are stored for each element since all the maximum values are the same This

p ermits using up to four times the numb er of pivots of the previous approach at

the same memory requirement Note that we do not even need to build or store

the sux array We just read the suxes from the text and index them Our

only storage need is that of the metric index

The bad p oint is that the selectivity of the pivots is reduced and some redun

dant work is done The rst is a consequence of storing only minimum values

while the second is a consequence of not factoring out rep eated text substrings

That is if some substring P of T is close enough to P and it app ears many

times in T we will have to check all its o ccurrences one by one

Without using a sux tree structure the construction of the index can b e

done in time O k jp jn as follows The algorithm depicted in Section to compute

i

edit distance can b e mo died so as to make C in which case C b ecomes

0j ij

the minimum edit distance b etween x and a sux of y If x is the reverse

1i 1j

of jp j and y the reverse of T then C will b e the minimum edit distance

i

jp jj

i

b etween jp j and a prex of T which is precisely minedp T

i nj +1 i nj +1

So we need O jp jn time p er pivot The space to compute this is just O jp j by

i i

doing the computation columnwise

Using an Index for High Dimensions

The space of strings has a distance distribution that is rather concentrated

around its mean The same happ ens to the distances b etween a pivot p

i

and suxes T or the pattern P Since we can only discard suxes T

j j

such that edp P r minedp T only the suxes with a large

i i j

minedp T value are likely to b e discarded using p Storing all the other

i j i

O n distances to p is likely to b e a waste of space Moreover we can use that

i

memory to intro duce more pivots Figure illustrates

The idea is to x a numb er s and for each pivot p store only the s largest

i

minedp T values Only those suxes can b e discarded using pivot p

i j i

The space of this index is O k s and its construction time is unchanged We

can still use an Rtree for the search although the rectangles will cover all the

space except on s co ordinates The selectivity is likely to b e similar since we have

discarded uninteresting co ordinates and we can tune numb er k versus selectivi ty

s of the pivots for the same space usage O k s

One can go further to obtain O n space as follows Cho ose the rst pivot

and determine its s farthest suxes Store a list in increasing distance order

of those suxes and their distance to the rst pivot and remove them from

further consideration Then cho ose a second pivot and nd its s farthest suxes

from the remaining set Continue until every sux has b een included in the

list of some pivot Note that every sux app ears exactly in one list At search

time compare P against each pivot p and if edP p r is smaller than the

i i

smallest rst distance in the list of p skip the whole list Otherwise traverse

i

the list until its end or until edP p r is smaller than the next element Each

i

traversed sux must b e directly compared against P A variant of this idea has

proven extremely useful to deal with concentrated histograms It also p ermits ed(p,[T(j...)])

ed(p,P)

+r

Fig The distance distribution to a pivot p including that of pattern P The grayed

area represents the suxes that can b e discarded using p

ecient secondary storage implementation by packing the pivots in disk pages

and storing the lists consecutively in the same order of the pivots

Since we cho ose k ns pivots the construction time is high namely

2

O n jp js However the space is O n with a low constant close to in

i

practice that makes it comp etitive against the most economical structures for

the problem The search time is O jp jmns to compare P against the pivots

i

while the time to traverse the lists is dicult to analyze

The pivots chosen must not b e very short b ecause their minimum distance

to any T is at most jp j In fact any pivot not longer than m r is useless

j i

Using Sp ecic Strings Prop erties

We can complement the information given by the metric index with knowledge

of the string prop erties we are indexing For example if sux T is proven

j

to b e at distance r t from P then we can also discard suxes starting in the

range j t j t

Another idea is to compute the edit distance b etween the reverse pivot and

the reverse pattern Although the result is the same we learn also the distances

b etween the pivot and suxes of the pattern This can also b e useful to discard

0

and we know from the index suxes at verication time If d edP T

1 ii

0

that edP T r d then a match is not p ossible

+1 i +1

Other ideas such as hybrid algorithms that partition the pattern and search

for the pieces p ermitting less errors can b e implemented over our metric

index instead of over a sux tree or array Indeed our data structure should

comp ete in the area of backtracking algorithms as the others are orthogonal

Conclusions

We have presented a novel approach to the approximate string matching prob

lem The idea is to give the set of text substrings the structure of a metric

space and then use an algorithm for range queries on metric spaces The sux

2

tree is used as a conceptual device to map the O n text substrings to O n

sets of strings We show analytically that the search complexity is b etter than

that obtained in previous work at the price of slightly higher space usage More

2

2

precisely we can search at an average cost of O m log n m R time using

O n log n space while by using a sux tree the b est known technique one

can search in O mn time for using O n space Moreover our

technique can b e extended to any other distance function among strings some

of which like the reversals distance are problematic to handle with the previous

approaches

The prop osal op ens a numb er of p ossibiliti es for future work We plan to

explore other metho ds to reduce the numb er of substrings we have used the

sux tree no des and the suxes other metric space indexing metho ds we have

used pivots other multidimensional range search techniques we have used R

trees other pivot selection techniques we to ok them at random etc A more

practical setup needing O n space has b een describ ed in Section

Finally the metho d promises an ecient implementation on secondary mem

ory eg with Rtrees which is a weak p oint in most current approaches

References

A Ap ostolico The myriad virtues of subword trees In Combinatorial Algorithms

on Words NATO ISI Series pages SpringerVerlag

A Ap ostolico and Z Galil Combinatorial Algorithms on Words SpringerVerlag

New York

R BaezaYates and G Navarro Blo ckaddressin g indices for approximate text

retrieval In Proc ACM CIKM pages

R BaezaYates and G Navarro Fast approximate string matching in a dictionary

In Proc SPIRE pages IEEE Computer Press

R BaezaYates and G Navarro A practical q gram index for text retrieval allowing

errors CLEI Electronic Journal httpwwwcleicl

R BaezaYates and G Navarro A hybrid indexing metho d for approximate string

matching J of Discrete Algorithms JDA Sp ecial issue on

Matching Patterns



N Beckmann H Kriegel R Schneider and B Seeger The R tree an ecient

and robust access metho d for p oints and rectangles In Proc ACM SIGMOD

pages

E Bugnion T Ro os F Shi P Widmayer and F Widmer Approximate multiple

string matching using spatial indexes In Proc st South American Workshop on

String Processing WSP pages

E Chavez and G Navarro An eective clustering algorithm to index high dimen

sional metric spaces In Proc SPIRE pages IEEE CS Press

E Chavez G Navarro R BaezaYates and J Marro qun Searching in metric

spaces ACM Comp Surv To app ear

A Cobbs Fast approximate matching using sux trees In Proc CPM pages

LNCS

M Cro chemore Transducers and rep etitions Theor Comp Sci

C Faloutsos and I Kamel Beyond uniformity and indep endenc e analysis of R

trees using the concept of fractal dimension In Proc ACM PODS pages

V Gaede and O Gun ther Multidimen sio nal access metho ds ACM Comp Surv

G Gonnet A tutorial intro duction to Computational Bio chemistry using Darwin

Technical rep ort Informatik ETH Zuerich Switzerland

A Guttman Rtrees a dynamic index structure for spatial searching In Proc

ACM SIGMOD pages

P Jokinen and E Ukkonen Two algorithms for approximate string matching in

static texts In Proc MFCS volume pages

I Kamel and C Faloutsos On packing Rtrees In Proc ACM CIKM pages

V Levenshtein Binary co des capable of correcting spurious insertions and dele

tions of ones Problems of Information Transmission

U Manb er and E Myers Sux arrays a new metho d for online string searches

SIAM J on Computing pages

U Manb er and S Wu glimpse A to ol to search through entire le systems In

Proc USENIX Technical Conference pages Winter

E McCreight A spaceeconomical sux tree construction algorithm J of the

ACM

D Morrison PATRICIA practical algorithm to retrieve information co ded in

alphanumeric J of the ACM

E Myers A sublinear algorithm for approximate keyword searching Algorithmica

OctNov

G Navarro A guided tour to approximate string matching ACM Comp Surv

B Pagel H Six H Tob en and P Widmayer Towards an analysis of range queries

In Proc ACM PODS pages

D Papadias Y Theo doridis and E Stefanakis Multidimen sio nal range queries

with spatial relations Geographical Systems

J Robinson The KDBtree a search structure for large multidimens ion al dy

namic indexes In Proc ACM PODS pages

R Sedgewick and P Fla jolet Analysis of Algorithms AddisonWesley

F Shi Fast approximate string matching with qblo cks sequences In Proc

WSP pages Carleton University Press

E Sutinen and J Tarhio Filtration with q samples in approximate string match

ing In Proc CPM LNCS pages

W Szpankowski Probabili stic analysis of generalized sux trees In Proc

CPM LNCS pages

Y Thedoridis and T Sellis A mo del for the prediction of Rtree p erformance In

Proc ACM PODS pages

E Ukkonen Approximate string matching over sux trees In Proc CPM

pages

E Ukkonen Constructing sux trees online in linear time Algorithmica

D White and R Jain Algorithms and strategies for similarity retrieval Technical

Rep ort VCL Visual Comp Lab Univ of California July