Arxiv:1707.03457V1
Total Page:16
File Type:pdf, Size:1020Kb
Multiple Context-Free Tree Grammars: Lexicalization and Characterization Joost Engelfrieta, Andreas Malettib, Sebastian Manethc aLIACS, Leiden University, P.O. Box 9512, 2300 RA Leiden, The Netherlands bInstitute of Computer Science, Universit¨at Leipzig, P.O. Box 100 920, 04009 Leipzig, Germany cDepartment of Mathematics and Informatics, Universit¨at Bremen, P.O. Box 330 440, 28334 Bremen, Germany Abstract Multiple (simple) context-free tree grammars are investigated, where “simple” means “linear and non- deleting”. Every multiple context-free tree grammar that is finitely ambiguous can be lexicalized; i.e., it can be transformed into an equivalent one (generating the same tree language) in which each rule of the grammar contains a lexical symbol. Due to this transformation, the rank of the nonterminals increases at most by 1, and the multiplicity (or fan-out) of the grammar increases at most by the maximal rank of the lexical symbols; in particular, the multiplicity does not increase when all lexical symbols have rank 0. Multiple context-free tree grammars have the same tree generating power as multi-component tree adjoining grammars (provided the latter can use a root-marker). Moreover, every multi-component tree adjoining grammar that is finitely ambiguous can be lexicalized. Multiple context-free tree grammars have the same string generating power as multiple context-free (string) grammars and polynomial time parsing algorithms. A tree language can be generated by a multiple context-free tree grammar if and only if it is the image of a regular tree language under a deterministic finite-copying macro tree transducer. Multiple context-free tree grammars can be used as a synchronous translation device. Contents 1 Introduction 2 2 Preliminaries 5 2.1 Sequencesandstrings. ....................... 5 2.2 Treesandforests ................................. ..................... 7 2.3 Substitution .................................... ..................... 8 3 Multiple context-free tree grammars 10 3.1 Syntaxandleastfixedpointsemantics. ........................... 10 3.2 Derivationtrees ................................. ...................... 13 3.3 Derivations ..................................... .................... 18 4 Normal forms 21 4.1 Basicnormalforms ................................ ..................... 21 4.2 Lexicalnormalforms . .. .. .. .. .. ...................... 25 5 Lexicalization 33 arXiv:1707.03457v1 [cs.FL] 11 Jul 2017 6 MCFTG and MC-TAG 43 6.1 FootedMCFTGs .................................... .................. 43 6.2 MC-TALalmostequalsMCFT. ..................... 50 6.3 MonadicMCFTGs ................................... .................. 54 7 Multiple context-free grammars 55 7.1 StringgeneratingpowerofMCFTGs. ......................... 55 7.2 ParsingofMCFTGs................................. .................... 59 8 Characterization 61 9 Translation 68 10 Parallel and general MCFTG 71 11 Conclusion 74 Preprint submitted to [To be determined] July 13, 2017 1. Introduction Multiple context-free (string) grammars (MCFG) were introduced in [87] and, independently, in [92] where they are called (string-based) linear context-free rewriting systems (LCFRS). They are of interest to computational linguists because they can model cross-serial dependencies, whereas they can still be parsed in polynomial time and generate semi-linear languages. Multiple context-free tree grammars were introduced in [57], in the sense that it is suggested in [57, Section 5] that they are the hyperedge- replacement context-free graph grammars in tree generating normal form, as defined in [27]. Such graph grammars generate the same string languages as MCFGs [21, 94]. It is shown in [57] that they generate the same tree languages as second-order abstract categorial grammars (2ACG), generalizing the fact that MCFGs generate the same string languages as 2ACGs [82]. It is also observed in [57] that the set-local multi-component tree adjoining grammar (MC-TAG, see [53, 93]), well-known to computational linguists, is roughly the monadic restriction of the multiple context-free tree grammar, just as the tree adjoining grammar (TAG, see [49, 51]) is roughly the monadic restriction of the (linear and nondeleting) context- free tree grammar, see [37, 61, 71]. We note that the multiple context-free tree grammar could also be called the tree-based LCFRS; such tree grammars were implicitly envisioned already in [92]. In this paper we define the multiple context-free tree grammars (MCFTG) in terms of familiar concepts from tree language theory (see, e.g., [41, 42]), and we base our proofs on elementary properties of trees and tree homomorphisms. Thus, we do not use other formalisms such as graph grammars, λ-calculus, or logic programs. Since the relationship between MCFTGs and the above type of graph grammars is quite straightforward, it follows from the results of [27] that the tree languages generated by MCFTGs can be characterized as the images of the regular tree languages under deterministic finite-copying macro tree transducers (see [26, 34, 39]). However, since no full version of [27] ever appeared in a journal, we present that characterization here (Theorem 76). It generalizes the well-known fact that the string languages generated by MCFGs can be characterized as the yields of the images of the regular tree languages under deterministic finite-copying top-down tree transducers, cf. [94]. These two characterizations imply (by a result from [26]) that the MCFTGs have the same string generating power as MCFGs, through the yields of their tree languages. We also give a direct proof of this fact (Corollary 70), and show how it leads to polynomial time parsing algorithms for MCFTGs (Theorem 72). All trees that have a given string as yield, can be viewed as “syntactic trees” of that string. A parsing algorithm computes, for a given string, one syntactic tree (or all syntactic trees) of that string in the tree language generated by the grammar. It should be noted that, due to its context-free nature, an MCFTG, like a TAG, also has derivation trees (or parse trees), which show the way in which a tree is generated by the rules of the grammar. A derivation tree can be viewed as a meta level tree and the derived syntactic tree as an object level tree, cf. [51]. In fact, the parsing algorithm computes a derivation tree (or all derivation trees) for the given string, and then computes the corresponding syntactic tree(s). We define the MCFTG as a straightforward generalization of the MCFG, based on tree substitution rather than string substitution, where a (second-order) tree substitution is a tree homomorphism. How- ever, our formal syntactic definition of the MCFTG is closer to the one of the context-free tree grammar (CFTG) as in, e.g., [31, 37, 42, 58, 61, 81, 90]. Just as for the MCFG, the semantics of the MCFTG is a least fixed point semantics, which can easily be viewed as a semantics based on parse trees (Theorem 9). Moreover, we provide a rewriting semantics for MCFTGs (similar to the one for CFTGs and similar to the one in [78] for MCFGs) leading to a usual notion of derivation, for which the derivation trees then equal the parse trees (Theorem 19). Intuitively, an MCFTG G is a simple (i.e., linear and nondeleting) context-free tree grammar (spCFTG) in which several nonterminals are rewritten in one derivation step. Thus every rule of G is a sequence of rules of an spCFTG, and the left-hand side nonterminals of these rules are rewritten simultaneously. However, a sequence of nonterminals can only be rewritten if (earlier in the derivation) they were introduced explicitly as such by the application of a rule of G. Therefore, each rule of G must also specify the sequences of (occurrences of) nonterminals in its right-hand side that may later be rewritten. This restriction is called “locality” in [53, 78, 93]. Apart from the above-mentioned results (and some related results), our main result is that MCFTGs can be lexicalized (Theorem 44). Let us consider an MCFTG G that generates a tree language L(G) over the ranked alphabet Σ, and let ∆ ⊆ Σ be a given set of lexical items. We say that G is lexicalized (with respect to ∆) if every rule of G contains at least one lexical item (or anchor). Lexicalized grammars are of importance for several reasons. First, a lexicalized grammar is often more understandable, because the rules of the grammar can be grouped around the lexical items. Each rule can then be viewed as lexical information on its anchor, demonstrating a syntactical construction in which the anchor can 2 occur. Second, a lexicalized grammar defines a so-called dependency structure on the lexical items of each generated object, allowing to investigate certain aspects of the grammatical structure of that object, see [64]. Third, certain parsing methods can take significant advantage of the fact that the grammar is lexicalized, see, e.g., [86]. In the case where each lexical item is a symbol of the string alphabet (i.e., has rank 0), each rule of a lexicalized grammar produces at least one symbol of the generated string. Consequently, the number of rule applications (i.e., derivation steps) is clearly bounded by the length of the input string. In addition, the lexical items in the rules guide the rule selection in a derivation, which works especially well in scenarios with large alphabets (cf. the detailed account in [10]). We say that G is finitely ambiguous (with respect to ∆) if, for every n ≥ 0, L(G) contains only finitely many trees with n occurrences of lexical items. For simplicity, let us also assume here that every tree in L(G) contains at least one lexical item. Obviously, if G is lexicalized, then it is finitely ambiguous. Our main result is that for a given MCFTG G it is decidable whether or not G is finitely ambiguous, and if so, a lexicalized MCFTG G′ can be constructed that is (strongly) equivalent to G, i.e., L(G′) = L(G). Moreover, we show that G′ is grammatically similar to G, in the sense that their derivation trees are closely related: every derivation tree of G′ can be translated by a finite-state tree transducer into a derivation tree of G for the same syntactic tree, and vice versa.