How Hard Is Computing the Edit Distance?1
Total Page:16
File Type:pdf, Size:1020Kb
Information and Computation 165, 1–13 (2001) doi:10.1006/inco.2000.2914, available online at http://www.idealibrary.com on How Hard Is Computing the Edit Distance?1 Giovanni Pighizzini Dipartimento di Scienze dell’Informazione, Universita` degli Studi di Milano, via Comelico 39, 20135 Milan, Italy E-mail: [email protected] Received October 26, 1995 The notion of edit distance arises in very different fields such as self-correcting codes, parsing theory, speech recognition, and molecular biology. The edit distance between an input string and a language L is the minimum cost of a sequence of edit operations (substitution of a symbol in another incorrect symbol, insertion of an extraneous symbol, deletion of a symbol) needed to change the input string into a sentence of L. In this paper we study the complexity of computing the edit distance, discovering sharp boundaries between classes of languages for which this function can be efficiently evaluated and classes of languages for which it seems to be difficult to compute. Our main result is a parallel algorithm for computing the edit distance for the class of languages accepted by one-way nondeterministic auxiliary pushdown automata working in polynomial time, a class that strictly contains context–free languages. Moreover, we show that this algorithm can be extended in order to find a sentence of the language from which the input string has minimum distance. C 2001 Academic Press Key Words: formal languages; computational complexity; string correction; error correction; edit distance; dynamic programming. 1 INTRODUCTION Consider a language L 6 and consider receiving as input, strings that represent instances of L, but possibly containing errors (due, for example, to faulty data entry or to noise in the communication channel). We want to find the original sentences of L from which it is most likely the strings we have received were originated. In order to find such sentences, we adopt a minimum distance criterion; i.e., given an input string x, we are interested in determining a sentence y ∈ L from which x has minimal edit distance. The edit distance between two strings is the minimum cost of a sequence of edit operations (i.e., the substitution of a symbol by an incorrect symbol, the insertion of an extraneous symbol, and the deletion of a symbol) that transform one string into another. Thus, the edit distance of a given language L is the function which associates with any input string x the minimal distance between strings belonging to L and x. The notion of edit distance is due to Levenshtein [20]. However, it is not easy to exactly date the concept of string comparison. Among former works, we recall Damerau [12], Alberga [3], Morgan [23], Wagner [29]. Concerning applications of this notion, we briefly recall that edit distances were initially studied in code theory (Levenshtein [20] and Ullman [27]), parsing theory (Aho and Peterson [2], Lyon [21], and Pighizzini [25]), and speech recognition (Okuda et al. [24] and Ackroyd [1]). Interesting combinatorial problems on strings, as for example the computation of longest common subsequences (Apostolico and Guerra [7]), can be reduced to edit distances. Very recently they regained a certain popularity due to their connections with problems in molecular biology (Eppstein et al. [13] and Karp [18]). (For an extended overview of the history and applications of edit distances, we refer the reader to Kruskal [19].) In this paper, we are interested in establishing general results about the complexity of computing edit distances between languages and strings. In particular, we want to delineate the boundaries among classes of languages for which at least a strong evidence can be given on the hardness of edit distances and classes for which edit distances can be efficiently computed. The results we obtain are quite tight. More precisely, in Section 3, we show that P = PSPACE if and only if the edit distance of each 1 Partially supported by Ministero dell’Universit`ae della Ricerca Scientifica e Tecnologica, under the project “Modelli di calcolo innovativi: metodi sintattici e combinatori.” A preliminary version of this work appeared in Fundamentals of Computation Theory (FCT’95), Proceedings, Lecture Notes in Computer Science, Vol. 965, pp. 383–392, Springer-Verlag, Berlin/NewYork, 1995. 1 0890-5401/01 $35.00 Copyright C 2001 by Academic Press All rights of reproduction in any form reserved. 2 GIOVANNI PIGHIZZINI context-sensitive language is computable in polynomial time, and P = NP if and only if the edit distance of each language belonging to the class co-NTIME(log n) (a small subclass of AC0) is computable in polynomial time. On the other hand, if we restrict our attention to one-way devices, edit distance becomes feasible. In fact, our main result, proved in Section 4, states that the edit distance of languages accepted by one-way nondeterministic auxiliary pushdown automata (i.e., logspace bounded Turing machines augmented with a pushdown store) working in polynomial time can be efficiently computed. (This class properly contains the context-free languages.) In general people are not really interested in computing edit distances, but rather in finding a sentence with minimal edit distance from the input string (correction problem). In Section 3, we consider the correction problem and we state hardness results similar to those proved for the edit distance. On the other hand, in Section 5, we show that in the case of languages accepted by one-way nondeterministic auxiliary pushdown automata working in polynomial time, the correction problem can be solved in the same bounds given in Section 4 for computing the edit distance. This paper is a continuation of a stream of research whose aim is that of identifying properties other than the membership which are easily computable for certain classes of languages (see, e.g., Huynh [15, 16] and Allender et al. [4–6]). 2 PRELIMINARY DEFINITIONS AND RESULTS Given a finite alphabet 6, we denote by 6 the free monoid of strings over 6 and by the empty string. Given a string x ∈ 6 , we will denote by |x| its length and by xi the ith symbol of x, i = 1,...,n. Now, we briefly recall the notion of edit distance between two strings and, subsequently, that of edit distance between a language and a string. An edit operation is a pair (a, b) 6= (, ), where |a|, |b|1, and is usually denoted as a → b. A string y can be derived from a string x applying the edit operation → a → b, in symbols x =⇒a b y ,ifx =vawand y = vbw, for suitable strings v,w.Fora,b ∈6,a 6= b, the operations a → b, a → , and → a are called substitution, deletion, and insertion, respectively. Given an edit sequence S = (s1,...,sm), namely a sequence of edit operations, and two strings x, y ∈ 6 , we say that y can be derived from x applying the sequence S, in symbols x =⇒S y , if there exist strings si z0, z1,...,zm such that z0 = x, zm = y, and, for i = 1,...,m,zi1 =⇒ z i . With each edit operation we associate a nonnegative cost. More precisely, we consider a cost function where (a → b) is the cost of the edit operation a → b. Then, the cost of an edit sequence S can be defined as the sum of costs of operations in S. DEFINITION 2.1. Given two strings x, y ∈ 6 , the edit distance d(x, y) from x to y is the minimum of the costs of edit sequences which transform x into y. Given a string x ∈ 6 and a language L 6, the edit distance from L to x is the minimum among the costs of edit sequences which transform strings of L into x, i.e., d(L, x) = min{d(y, x) | y ∈ L}, with the convention that d(∅, x) =∞. For a fixed language L 6, the function d(L, x), in the variable x, is called the edit distance of L. We assume that, for a, b, c ∈ 6 ∪{},(a→b)=0 if and only if a = b, and (a → b) + (b → c) (a → c). Note that if is symmetric, then d is a metric on 6. EXAMPLE. Let 6 ={0,1},(a→b)=1 for a, b ∈ 6 ∪{},a 6= b, x = 0011, y = 010, and z = 100. String x can be transformed into y using the sequence S = (0 → 1, 1 → , 1 → 0). In fact: → → → x = 0011 =⇒0 1 0111 =⇒1 011 =⇒1 0 010 = y. The cost of S is 3. However, this is not optimal. In fact, it can be shown that d(x, y) = 2, considering the following sequence: → → x = 0011 =⇒0 011 =⇒1 0 010 = y. Consider now the language L ={0n1n |n 0}. It can be shown that d(L, 100) = 2. A string belonging to L from which 100 has minimal distance is 01. COMPUTING THE EDIT DISTANCE 3 The edit distance between two strings can be easily computed from edit distances between substrings. In particular, it is immediate to prove the following relationships: ∀u,v,w,z ∈6:d(uv,wz) d(u,w)+d(v, z); ∀x0, x00, y ∈ 6: d(x0x00, y) = min{d(x0, y0) + d(x00, y00) | y = y0 y00}. Now, we show two properties of edit distances useful to state our results. The first property gives an upper bound for the edit distance from a language L to a string x. The second property states an upper bound for the length of strings belonging to L which minimize the distance. LEMMA 2.1. Given a nonempty language L 6 and a cost function , there exist constants , , 0,0 such that d(L, x) |x|+and |y|0|x|+0 for each x ∈ 6 and for each y ∈ L such that d(L, x) = d(y, x).