Pattern Selector Grammars and Several Parsing Algorithms in the Context-Free Style
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector JOURNAL OF COMPUTER AND SYSTEM SCIENCES 30, 249-273 (1985) Pattern Selector Grammars and Several Parsing Algorithms in the Context-free Style J. GONCZAROWSKI AND E. SHAMIR* Institute of Mathematics and Computer Science, The Hebrew University of Jerusalem, Jerusalem 91904, Israel Received March 30, 1982; accepted March 5, 1985 Pattern selector grammars are defined in general. We concentrate on the study of special grammars, the pattern selectors of which contain precisely k “one% (0*( 10*)k) or k adjacent “one? (O*lkO*). This means that precisely k symbols (resp. k adjacent symbols) in each sen- tential form are rewritten. The main results concern parsing algorithms and the complexity of the membership problem. We first obtain a polynomial bound on the shortest derivation and hence an NP time bound for parsing. In the case k = 2, we generalize the well-known context- free dynamic programming type algorithms, which run in polynomial time. It is shown that the generated languages, for k = 2, are log-space reducible to the context-free languages. The membership problem is thus solvable in log2 space. 0 1985 Academic Press, Inc. 1. INTRODUCTION The parsing and membership testing algorithms for context-free (CF) grammars occupy a peculiar border position in the complexity hierarchy. The dynamic programming algorithm runs in 0(n3) steps [21], but more relined methods reduce the problem to matrix multiplication with O(n*+‘) run-time, where a < 1 ([20]. Earley’s algorithm runs in time O(n*) for grammars with bounded ambiguity [4]. This is too much for compiler applications, so that restricted grammars such as U(k) with linear complexity are used for the syntactic analysis of programming languages. On the other hand, even modest attempts to extend the model of con- text-free grammars run the risk of escalating the parsing complexity to the NP-hard zone. In the popular model of EOL systems (we use [lS] as a standard reference), the parsing can still be done in o(n”) (actually, O(n3 + a)) runtime [14]. EDTOL membership is still in nondeterministic log space, and thus in P [9]. For ETOL systems, however, it is NP-complete-the reduction proving this was first given in [Z] and later in [12]. For an overview of the time and space complexity of the membership testing problem for various L families, see [ 15 and lo]. In particular, * This work was partially supported by the U.S.-Israel Binational Science Foundation, Grant 3432183. 249 OO22OOOO/85$3.00 Copyright 0 1985 by Academic Press, Inc. All rights of reproduction in any form reserved. 250 GONCZAROWSKI AND SHAMIR the tape complexity for CF and EOL is O(log* n). Moreover, the parsing problem for EOL, as well as for various other language families, was shown to be log-space reducible to the corresponding problem for CF [lS, 191. In the present article, we study extensions in the spirit of EOL systems. In suc- cessive sentential forms that constitute a derivation, one insists on parallel (or syn- chronized) application of productions to all the symbols in the EOL case, or to selected subsequences in our case here. A subsequence to be rewritten is specified by assigning “1” to symbols in it and “0” to symbols outside. The entire sequence of O’s and l’s is called a pattern or a mask. A grammar is now defined by context-free like productions plus a language P of patterns over (0, 1 }, which is called the pattern selector. For context-free grammars, the pattern selector is O*lO* or its star closure (0, 1 } *. For EOL systems, it is l*. Even for very simple regular pattern selectors, one can obtain families (models) which are very hard to classify (i.e., to say what their generative power is in com- parison to EOL or ETOL). On the other hand, regular pattern grammars are extremely powerful. In [ 111 it was, for instance, shown that there is a regular pat- tern selector grammar that generates all context-sensitive (and thus all RE) languages through weak identity. Our emphasis in this article is, however, not on generative power, but on parsing algorithms. We concentrate on P=O*lkO*, i.e., rewriting of k adjacent symbols at every step, O*(lO*)k, i.e., rewriting any k sym- bols, and their star closures. In both cases, we can obtain all the context-free languages, for any k > 1. It is easy to see that, in the O*(lO*)k case, one can generate non-context-free languages (cf. Example 2.1), for all k > 2. It was recently shown [3] that there are non-context-free (in fact, also non-EOL) languages that can be generated rewriting k = 2 adjacent symbols together. But nothing is known about hierarchy in k. For the parsing problems, we find polynomial time (or log* n space) algorithms for the families with pattern selectors O*llO* and O*lO*lO*. These algorithms are nontrivial extensions of the classical algorithms for context-free grammars. We shall also see that the languages families with pattern O*lkO* and O*(lO*)k are parsable in non-deterministic polynomial time (NP) and generate thus proper sub-families of the context-sensitive languages. Further results about the complexity of these problems, in particular, a polynomial time parsing algorithm for the pattern selec- tor O*(lO*)k, were found by [6] during the revision process of this paper. The plan of this paper is as follows: The formal definition of pattern selector grammars is given in Section 2. In Section 3 we give combinatorial results that bound the length of the shortest derivations for a word w by a linear function of its length. This kind of bounds is essential in analyzing the complexity of parsing, as they limit the height of the derivation trees that have to be considered. Bounds of this nature, with different techniques and difficulties, were also used, e.g., in Cl, 13, 19,41. In Section 4, we present dynamic programming algorithms for the languages defined with the pattern selectors O*llO* and O*lO*lO*. In the last sec- tion, we show that the membership problem for these languages is log-space reducible to context-free membership. Its space complexity is thus log* n. Our PATTERN SELECMR GRAMMARS PARSING 251 results are summarized in table form at the end of this paper. We thank the referee for suggesting this, and many other useful remarks incorporated in the revised ver- sion. 2. BASIC CONCEPTS AND DEFINITIONS We assume the reader to be familiar with formal language theory as, e.g., in the scope of [ 161. An overview of L system theory is given in [15]. Some notations need, perhaps, an additional explanation. For a word w, IwI denotes its length. 13. denotes the empty word. For a finite set X, #X denotes the cardinality of X. We shall usually identify a singleton set with its element. Alphabets are finite sets of symbols. Let L, and L, be languages. Then L, and L2 are considered equal if L,u{1}=L,u{i}. Let G be a rewriting system. Then L(G) denotes the language of G. Two rewriting systems are equivalent if the languages they generate are equal. Let Z and CDbe alphabets. We denote the family of total finite homomorphisms from Z* into @* by HOM(Z, @). In context-free grammars only non-terminal symbols can be rewritten. Very often it is convenient to permit the rewriting of terminal symbols as well. Thus we arrive at EOS systems (see, e.g., [ 11, 51). DEFINITION 2.1. An EOS system F is a quadruple (Z, h, S, A ), where .Y is the alphabet of F, h is a total substitution from C* into (nonempty subsets of) ,Z’* called the sub- stitution of F, with h(a) # Qr for all a E Z, SE Z - A is the start symbol of F, and A EC is the set of terminal symbols of F. As customary, if a EC and w E h(a) then (a, w) is called a production in F: Prod(F) denotes the set of all productions in F, Pprod(F) = Prod(F) n {(a, b): a E C - S and b E h(a)} (see remark below) and Maxr(F)=max{Iw]: (a, w)~Prod(F)}. Whenever an EOS system is propagating (it does not contain productions of the form (a, A), called erasing productions), we call it an EPOS system. Remark. (1) Throughout this paper, we will assume that the start symbol of an EOS system does not occur in any right-hand side of a production rule. We have introduced the set Pprod(F) to allow us to refer only to the “proper productions” of F, i.e., all productions but those that have S as their left-hand side. (2) Note that, unlike in context-free grammars, it is required that the sub- stitution of an EOS system is a total mapping. However, a finite substitution h’ on 252 GONCZAROWSKI AND SHAMIR z* that is not total, can be “completed” to a total finite substitution h as follows. Let f be a “new” non-terminal symbol, called the failure symbol, for which h(f) = J: Then, we let h(a) =f for all those symbols a for which h’ is not defined. The traditional way of defining derivations from production systems is rewriting a single symbol, or alternatively, all symbols in parallel, in each step.