A Metric Index for Approximate String Matching ?

A Metric Index ? for Approximate String Matching 1 2 Edgar Chavez and Gonzalo Navarro 1 Escuela de Ciencias FsicoMatematicas Universidad Michoacana Edicio B Ciudad Universitaria Morelia Mich Mexico elchavezfismatumichmx On leave of absence at Universidad de Chile 2 Depto de Ciencias de la Computacion Universidad de Chile Blanco Encalada Santiago Chile gnavarrodccuchilecl Abstract We present a radically new indexing approach for approxi mate string matching The scheme uses the metric prop erties of the edit distance and can b e applied to any other metric b etween strings We build a metric space where the sites are the no des of the sux tree of the text and the approximate query is seen as a proximity query on that metric space This p ermits us nding the R o ccurrences of a pattern of 2 2 length m in a text of length n in average time O m log n m R using 2 O n log n space and O n log n index construction time This complex ity improves by far over all other previous metho ds We also show a simpler scheme needing O n space Intro duction and Related Work Indexing text to p ermit ecient approximate searching on it is one of the main op en problems in combinatorial pattern matching The approximate string matching problem is Given a long text T of length n a comparatively short pattern P of length m and a threshold r retrieve all the pattern o ccurrences that is text substrings whose edit distance to the pattern is at most r The edit distance b etween two strings is dened as the minimum numb er of character in sertions deletions and substitutions needed to make them equal This distance is used in many applications but several other distances are of interest In the online version of the problem the pattern can b e prepro cessed but the text cannot There are numerous solutions to this problem but none is acceptable when the text is to o long since the search time is prop ortional to the text length Indexing text for approximate string matching has received attention only recently Despite some progress in the last decade the indexing schemes for this problem are still rather immature There exist some indexing schemes sp ecializ ed to wordwise searching on natural language text These indexes p erform quite well in that case but ? Supp orted by CYTED VI I RIBIDI pro ject b oth authors CONACyT grant rst author and Fondecyt grant second author they cannot b e extended to handle the general case Extremely imp ortant ap plications such as DNA proteins music or oriental languages fall outside this case The indexes that solve the general problem can b e divided into three classes Backtracking uses the sux tree sux array or DAWG of the text in order to factor out its rep etitions A sequential algorithm on the text is simulated by backtracking on the data structure These algorithms take time exp onential on m or r but in many cases indep endent of n the text size This makes them attractive when searching for very short patterns Partitioning partitions the pattern into pieces to ensure that some of the pieces must app ear without alterations inside every o ccurrence An index able of exact searching is used to detect the pieces and the text areas that have enough evidence of containing an o ccurrence are checked with a sequential algorithm These algorithms work well only when r m is small The third class is a hybrid b etween the other two The pattern is divided into large pieces that can still contain less errors they are searched for using backtracking and the p otential text o ccurrences are checked as in the partitioning metho ds The hybrid algorithms are more eective b ecause they can nd the right p oint b etween length of the pieces to search for and error level p ermitted Using the appropriate partition of the pattern these metho ds achieve on average O n search time for some that dep ends on r They tolerate mo derate error ratios r m We prop ose in this pap er a brand new approach to the problem We take into account that the edit distance satises the triangle inequality and hence it denes a metric space on the set of text substrings We can reexpress the approximate search problem as a range search problem on this metric space This approach has b een attempted b efore but in those cases the particularities of the problem made it p ossible to index O n elements In the general case we have 2 O n text substrings The main contribution of this pap er is to devise a metho d based on the suf 2 x tree of the text to meaningfully collapse the O n text substring into O n sets and to nd a way to build a metric space out of those sets The result is an indexing metho d that at the cost of requiring on average O n log n space and 2 O n log n construction time p ermits nding the R approximate o ccurrences of 2 2 the pattern in O m log n m R average time This is a complexity break through over previous work and it is easier than in other approaches to extend the idea to other distance functions such as reversals Moreover it represents an original approach to the problem that op ens a vast numb er of p ossibiliti es for improvements We consider also a simpler version of the index needing O n space and that despite not involving a complexity breakthrough promises to b e b etter in practice We use the following notation in the pap er Given a string s we denote its length as jsj We also denote s the ith character of s for an integer i i fjsjg We denote s s s s which is the empty string if i j and ij i i+1 j s s The empty string is denoted as A string x is said to b e a prex i ijsj of xy a sux of y x and a substring of y xz Metric Spaces We describ e in this section some concepts related to searching metric spaces We have concentrated only in the part that is relevant for this pap er There exist recent surveys if more complete information is desired A metric space is informally a set of blackb ox ob jects and a distance func tion dened among them which satises the triangle inequality The problem of proximity searching in metric spaces consists of indexing the set such that later given a query all the elements of the set that are close enough to the query can b e quickly found This has applications in a vast numb er of elds such as nontraditional databases where the concept of exact search is of no use and we search for similar ob jects eg databases storing images ngerprints or audio clips machine learning and classication where a new element must b e classied according to its closest existing element image quantization and compression where only some vectors can b e represented and those that cannot must b e co ded as their closest representable p oint text retrieval where we lo ok for do cuments that are similar to a given query or do cument computational biology where we want to nd a DNA or protein sequence in a database allow ing some errors due to typical variations function prediction where we want to search for the most similar b ehavior of a function in the past so as to predict its probable future b ehavior etc Formally a metric space is a pair X d where X is a universe of ob jects + and d X X R is a distance function dened on it that returns non negative values This distance satises the prop erties of reexivity dx x strict p ositiveness x y dx y symmetry dx y dy x and triangle inequality dx y dx z dz y A nite subset U of X of size n jUj is the set of ob jects we search Among the many queries of interest on a metric space we are interested in the socalled range queries Given a query q X and a tolerance radius r nd the set of all elements in U that are at distance at most r to q Formally the outcome of the query is q r fu U dq u r g The goal is to prepro cess the set so as d to minimize the computational cost of pro ducing the answer q r d From the plethora of existing algorithms to index metric spaces we fo cus on the socalled pivotbased ones which are built on a single general idea Select k elements fp p g from U called pivots and identify each element u 1 k U with a k dimensional p oint du p du p ie its distances to the 1 k pivots The index is basically the set of k n co ordinates At query time map q to the k dimensional p oint dq p dq p With this information at 1 k hand we can lter out using the triangle inequality any element u such that jdq p du p j r for some pivot p since in that case we know that dq u i i i r without need to evaluate du q Those elements that cannot b e ltered out using this rule are directly compared against q An interesting feature of pivotbased algorithms is that they can reduce the numb er of nal distance evaluations by increasing the numb er of pivots Dene D x y max jdx p dy p j Using the pivots p p is equivalent k 1j k j j 1 k to discarding elements u such that D q u r As more pivots are added k we need to p erform more distance evaluations exactly k to compute D q k these are called internal evaluations but on the other hand D q increases k its value and hence it has a higher chance of ltering out more elements those comparisons against elements that cannot b e ltered out are called external It follows that there exists an optimum k If one is not only interested in the numb er of distance evaluations p erformed but also in the total cpu time required then scanning all the n elements to lter out some of them may b e unacceptable In that case one needs multidimensional range search metho ds which include data structures such as the k dtree Rtree X tree etc Those structures p ermit indexing

Load more