Fast Approximate String Matching with Finite Automata

Procesamiento del Lenguaje Natural, núm. 43 (2009), pp. 57-64 recibido 1-05-2009; aceptado 5-06-2009 Fast approximate string matching with finite automata Rapida búsqueda aproximada con autómatas de estado finito Mans Hulden University of Arizona [email protected] Resumen: En este art´ıculo se presenta un algoritmo eficiente para dada una cadena de caracteres extraer las cadenas más cercanas de un autómata de estado finitos según alguna métrica de distancia. El algoritmo puede ser adaptado con el fin de beneficiarse de una variedad de métricas para determinar la similitud entre palabras. Palabras clave: búsqueda aproximada, autómata Abstract: We present a fast algorithm for finding approximate matches of a string in a finite-state automaton, given some metric of similarity. The algorithm can be adapted to use a variety of metrics for determining the distance between two words. Keywords: approximate search, finite automata 1 Introduction input word is likely to take too much time. In this paper we shall present a promising ap- Naturally, if we perform the search by cal- proach to a classic search problem: given a culating the edit distance between each word single word w and a large set of words W , on a list and our target word w, there are quickly deciding which of the words in W a number of possible optimizations that can most closely resembles w, measured by some be made, such as aborting the calculation for metric of similarity, such as minimum edit a word-pair as soon as the MED comparison distance. exceeds the lowest MED encountered so far. Performing this type of search quickly is Even so, the search space remains too large important for many applications, not least in for this strategy to be of practical use. natural language processing. Spelling correc- tors, Optical Character Recognition applica- 1.1 A more general problem tions and syntactic parsers, among others, all In addressing this problem, we shall inves- rely on quick approximate matching of some tigate a much more general problem: that string against a pattern of strings. of finding the closest approximate match be- The standard edit distance algorithm tween a word w and words encoded in finite- given in most textbooks on the subject does state automaton A. The problem is more not scale well to large search problems. Cer- general because a finite automaton can not tainly, calculating the edit distance between only encode a finite number of words (as a two individual words can be done quickly, deterministic acyclic automaton), but an in- even with different costs for different charac- finite number of words, or a ‘pattern.’ ter substitutions and insertions. The ubiq- Considering the problem of finding an ap- uitous textbook algorithm for finding the proximate match to word w against some au- minimum edit distance (MED) between two tomaton A instead of a word list L has many words based on the dynamic programming advantages. First, a finite word list can al- method takes quadratic time in the length ways be converted into an acyclic determinis- of the longer word. tic finite automaton. This means that a good However, finding the closest match be- solution to the problem of matching against tween an input word w and a list of, say an automaton is also a good solution to the 1,000,000 words, is a much more demanding word list case. Second, a finite automaton task. The strategy of calculating the edit dis- can represent an infinite number of strings— tance between every word on the list and the i.e. a pattern of strings—by virtue of the ISSN: 1135-5948 © 2009 Sociedad Española para el Procesamiento del Lenguaje Natural Mans Hulden fact that it can contain cycles. Some natu- possible symbols that can be seen in future ral languages, for instance, allow very long paths that extend out from some state in the compound words, the patterns of which can automaton—can function as a guess about be compactly modeled as a cyclic automaton. profitable paths to explore, and thus fits well Third: natural language morphological ana- in the A∗-paradigm. lyzers are often implemented as finite-state The A∗-algorithm essentially requires that transducers (Beesley and Karttunen, 2003). we have access to two types of costs during Creating a deterministic minimal finite-state our search: a cost of the path in a partial ex- automaton by extracting the domain or range ploration (g), and a guess about the future from a finite-state transducer is trivial: one cost (h), which may not be an overestimate simply ignores the input or the output labels, for the heuristic to be admissible. At every and determinizes and minimizes the resulting step of choosing which partial path to expand automaton. This means, for instance, that if next, we take into account the combined ac- one has access to a morphology of a language tual cost so far (g), and the guess (h), yield- represented as a transducer, that morphology ing f = g + h. can easily be used as a spelling corrector. Searching an automaton for matches against a word with edit distance naturally 2 Solution based on informed yields g, which is the number of changes made search so far in reaching a state s in an automaton Most previous solutions to this problem have comparing against the word w. The guess h been based on simple brute-force breadth or is based on the heuristic we already briefly depth-first search through the automaton A, introduced. comparing paths in the automaton with the word at hand and aborting searches on a path 2.2 The search algorithm as soon as the required number of changes in For our initial experiments, we have consid- the word at a point in the search reaches some ered the restricted problem of finding the specified cutoff.1 shortest Levenshtein distance between w and Our observation, however, has been that paths in an automaton A. This is the case finite-state automata contain useful informa- of MED where insertion, substitution, and tion that can be extracted cheaply and used deletion all cost 1 unit. However, the al- profitably as a reliable guide in the search gorithm can easily be extended to capture process, avoiding a brute-force strategy and varying insertion, substitution, and deletion the exponential growth of the search com- costs through so-called confusion matrices, plexity it entails. In particular, as we shall and even to context-dependent costs. show, one piece of information that can be Essentially, the search begins with a sin- extracted from an automaton with little ef- gle node represented by the start state and fort is knowledge about what kinds of sym- the word at position 0 (the starting posi- bols can be encountered in the future. That tion). We then consider each possible edge is, for each state in an automaton A, we can in the automaton from the state v. If the extract information about all the possible fu- edge matches the current word position sym- ture symbols that can be encountered for the bol, we create a new node by advancing to next n steps, given any n. the target state v, advancing pos (the word ∗ position counter pos) by 1, recalculating the 2.1 A -search costs f = g+h and storing this as a new node When performing searches with some type to the agenda marked as x:x (which indicates of additional information to guide the search no change was made). We also consider the process, the preferred family of algorithms cases where we insert a symbol (0:x), delete to use is usually some variant of the A∗- a symbol (x:0), and (if the edge currently in- algorithm (Hart, Nilsson, and Raphael, spected does not match the symbol in the 1968). This was an obvious choice for us as word at position pos) substituting a symbol well. The additional information we use—the (x:y) with another one. When we are done with the node, we find the node with the low- 1The most prominent ones given in the literature are Oflazer (1996) who presents a depth-first search est score so far, expand that node, and keep algorithm, and Schulz and Mihov (2002), which is going until we find a solution. See figure 1 essentially the same algorithm. for a partially expanded search of the word 58 Procesamiento del Lenguaje Natural, núm. 43 (2009) Fast approximate string matching with finite automata state=0 pos=1 d:0 [o,r] [] 1 6 f=1 g=1 h=0 f r s state=1 d o g pos=1 d:f [a,c,d,f,o,r] [o,g] [g,s] [s] f=3 g=1 h=2 0 2 47 c state=1 t Input word: dat pos=0 0:f * a f=4 g=1 h=3 [a,t] [s,t] initial node 35 state = 0 state=2 pos = 0 pos=1 d:d Figure 2: An automaton where information f = 0 g = 0 h =0 f=2 g=0 h=2 about the possible symbols of future paths of length n (for the case n =2) is stored in each state=2 state. pos=0 0:d f=3 g=1 h=2 [g,o,r,s] [] f state=3 r s pos=1 d:c f=1 g=1 h=0 d o g [a,c,d,f,g, [o,g,s] [g,s] [s] o,r,s,t] state=3 c pos=0 0:c t f=2 g=1 h=1 * a [a,s,t] [s,t] Figure 1: Initial steps in searching for the approximate match against the word dat and the automaton depicted in figure 3.

Load more