Fast Approximate String Matching with Suffix Arrays and A* Parsing

Fast Approximate String Matching with Suffix Arrays and A* Parsing Philipp Koehn Jean Senellart University of Edinburgh Systran 10 Crichton Street La Grande Arche Edinburgh, EH8 9AB 1, Parvis de la Defense´ Scotland, United Kingdom 92044 Paris, France [email protected] [email protected] Abstract language segment that is most similar to the input. Note that translation memories are rarely used in We present a novel exact solution to the ap- recent machine translation research, since that work proximate string matching problem in the con- is driven by open domain translation tasks such as text of translation memories, where a text seg- news translation. In the practical commercial world ment has to be matched against a large corpus, while allowing for errors. We use suf- of human translation, however, translation tasks of- fix arrays to detect exact n-gram matches, A* ten involve the translation of material that is very search heuristics to discard matches and A* similar to prior translations. Consider, for instance, parsing to validate candidate segments. The the translation of a manual for an updated product. method outperforms the canonical baseline by This paper defines the problem of approximate a factor of 100, with average lookup times of string matching (Section 2) and reviews related work 4.3–247ms for a segment in a realistic sce- (Section 3). Our method uses a suffix array to find n- nario. gram matches (Section 4), principles of A* search to filter the matches, and an A* parsing method to iden- 1 Introduction tify the most similar segment match (Section 5.3). It retrieves the most similar fuzzy match in an average Approximate string matching is a pervasive problem time of 4.3–247 milliseconds (Section 6). in natural language processing applications, such as spelling correction, information retrieval, evalu- 2 Approximate String Matching ation, and translation memory systems. Often we cannot find an exact match for a query string in a Let us first formally define approximate string large corpus, but we are still interested in an approx- matching. We follow the definition by Navarro imate match that is similar according to some metric, (2001): typically string edit distance. The problem involves: The problem is of great concern in genetic se- • a finite alphabet Σ of size jΣj = σ quence retrieval. In fact, as we discuss below, most • a corpus T 2 Σ∗ of length jT j = n of the work on approximate string matching ad- • a pattern P 2 Σ∗ of length jP j = m ∗ ∗ dresses this task. • a distance function d :Σ × Σ ! R We present a new method that was developed in • a maximum error allowed k 2 R the context of translation memory systems, which are used by human translators. When translating an The problem is typically defined as: given input sentence (or segment), the human translator T ,P ,d(),k, retrieve all text positions j so that there may be interested in the translation of a similar seg- exists an interval Ti::j with d(P; Ti:::j) ≤ k. ment, which is stored in a translation memory. The In this paper, we modify the definition of problem, term translation memory is roughly equivalent to a as follows: parallel corpus: a collection of segments and their Segments: The corpus T tiles into a sequence of translations. The approximate string matching prob- segments fS1; :::; Ssg so that lem for translation memories is to find the source • start(S1) = 1 A B C D A B E problem, where the pattern may be matched against any of the segments of the corpus. The complexity 0 1 2 3 4 5 6 7 of the algorithm is O(nm) — linear both with re- E 1 1 2 3 4 5 6 6 spect to the corpus size and the pattern size. Our method has sub-linear complexity with respect to A 2 1 2 3 4 4 5 6 corpus size. B 3 2 1 2 3 4 5 6 Most of the work in this area deals with genetic sequence retrieval. Biologists search for genes or E 4 3 2 2 3 4 5 5 other DNA sequences consisting of the letters A, C 5 4 3 2 3 4 5 6 C, G, and T in the genetic code of a living being. Our problem differs from this application in various 6 5 4 3 2 3 4 5 D points: E 7 6 5 4 3 3 4 4 • the lexicon is much larger: 10,000s or more dis- tinct words found in a text vs. four bases Figure 1: Canonical dynamic programming solution to • the allowable error is larger: in translation the approximate string matching problem under a unit memories, we may accept up 30% error cost string edit distance. Each alignment point between • sequences are shorter, typically 10-50 words the two strings ABCDABE and EABECDE is filled with • there are natural starting and end points, i.e., the minimal cost and a backpointer to its prior alignment segment boundaries point. • we search for the best matches, not all matches • end(Ss) = jT j, and The first two differences make our problem harder, while the last three make it easier, and we • 81<z≤send(Sz−1) + 1 = start(Sz), exploit them in our method. i.e., the segments start at the beginning of the text, The main building blocks of approximate string end at the end of the text, and follow each other. matching methods are corpus indexing, filtering, Minimum matching cost: Given the sequence pattern processing, and improvements to the dy- of all segments fS1; :::; Ssg as defined above, the namic programming techniques (Navarro, 2001). minimum matching cost is c = minfd(P; Sz)jSz 2 Common corpus indexing methods are suffix trees fS1; :::; Ssg; d(P; Sz) ≤ kg. and suffix arrays. Suffix trees compile the cor- Task: retrieve all segments Sz with distance pus into a trie, which allows the quick retrieval of d(P; Sz) = c. any substring in the corpus. However, worst case quadratic space requirements are prohibitive for our 3 Related Work application. We use suffix arrays (Manber and My- The canonical method for the approximate string ers, 1990), which are described in detail in Section 4. matching problem under the string edit distance met- The idea of filtering is to discard most of the ric has been discovered many times. It uses dynamic search space and to focus on the promising areas. programming, as illustrated in Figure 1. Given the The method is divided into a filtering and valida- matrix of possible alignment points f(i; j)j0 ≤ i ≤ tion stage. The focus is typically on filtering, while jaj; 0 ≤ j ≤ jbjg between the two strings a and b, standard techniques are used for validating poten- we compute for each point the optimal path, with tial matches. Filters based on suffix arrays are com- only depends on the directly preceding points, as monly used, see for instance work by Karkk¨ ainen¨ defined by the deletion, insertion, substitution, and and Na (2007). Our method is partly a filtering match operations. method. It gets most of its gains from the filtering The method can be adapted straightforwardly to stage, although we also optimize validation based on the problem where the pattern may match at any the information gained in the filtering stage. starting point and end point in the corpus (the gen- There are various ways to process the pattern, for eral approximate string matching problem), or our instance splitting it into sub-patterns for which exact matching is performed, or compiling it into a finite 1 government of the people , by the people , for the people state machine. Our method utilizes exact matches of 2 of the people , by the people , for the people 3 the people , by the people , for the people sub-patterns. 4 people , by the people , for the people The dynamic programming techniques can be im- 5 , by the people , for the people proved in many ways. To give an example, in the 6 by the people , for the people 7 the people , for the people canonical algorithm, we do not need to compute the 8 people , for the people entire matrix but can focus on the alignment points 9 , for the people with the lowest cost. 10 for the people sort alphabetically 11 the people The only description of a method addressing the 12 people approximate string matching problem in translation memories, that we are aware of, is work by Man- 5 , by the people , for the people 9 , for the people dreoli et al. (2002), which uses simple filtering 6 by the people , for the people methods using a database. They report speeds of 10 for the people ”8 seconds to compare 419 query sentences against 1 government of the people , by the people , for the people 2 of the people , by the people , for the people 1497 reference sentences”, and ”1.5 seconds per 12 people query sentence” with a larger corpus, which is a few 4 people , by the people , for the people orders of magnitude slower than our method. 8 people , for the people 11 the people Suffix arrays have been applied to a related prob- 3 the people , by the people , for the people lem in machine translation, namely looking up 7 the people , for the people phrases in a word-aligned parallel corpus to compute phrase translation probabilities. Work by Callison- Burch et al. (2005); Zhang and Vogel (2005); Mc- suffix array: sorted index of corpus positions Namee and Mayfield (2006) was extended to so- called hierarchical phrases, essentially phrases with Figure 2: Suffix array: When sorting all suffixes of a gaps, by Lopez (2007).

Fast Approximate String Matching with Suffix Arrays and A* Parsing

FM-Index Reveals the Reverse Suffix Array

Optimal Time and Space Construction of Suffix Arrays and LCP

A Quick Tour on Suffix Arrays and Compressed Suffix Arrays✩ Roberto Grossi Dipartimento Di Informatica, Università Di Pisa, Italy Article Info a B S T R a C T

Suffix Trees, Suffix Arrays, BWT

1 Suffix Trees

Suffix Trees and Suffix Arrays in Primary and Secondary Storage Pang Ko Iowa State University

Approximate String Matching Using Compressed Suffix Arrays

Suffix Array

Exhaustive Exact String Matching: the Analysis of the Full Human Genome

Distributed Text Search Using Suffix Arrays$

Optimal In-Place Suffix Sorting

4. Suffix Trees and Arrays