Efficient Techniques for Parsing with Tree Automata

Efficient techniques for parsing with tree automata Jonas Groschwitz and Alexander Koller Mark Johnson Department of Linguistics Department of Computing University of Potsdam Macquarie University Germany Australia groschwi|[email protected] [email protected] Abstract ing algorithm which recursively explores substruc- tures of the input object and assigns them non- Parsing for a wide variety of grammar for- terminal symbols, but their exact relationship is malisms can be performed by intersecting rarely made explicit. On the practical side, this finite tree automata. However, naive im- parsing algorithm and its extensions (e.g. to EM plementations of parsing by intersection training) have to be implemented and optimized are very inefficient. We present techniques from scratch for each new grammar formalism. that speed up tree-automata-based pars- Thus, development time is spent on reinventing ing, to the point that it becomes practically wheels that are slightly different from previous feasible on realistic data when applied to ones, and the resulting implementations still tend context-free, TAG, and graph parsing. For to underperform. graph parsing, we obtain the best runtimes Koller and Kuhlmann (2011) introduced Inter- in the literature. preted Regular Tree Grammars (IRTGs) in order to address this situation. An IRTG represents 1 Introduction a language by describing a regular language of Grammar formalisms that go beyond context-free derivation trees, each of which is mapped to a term grammars have recently enjoyed renewed atten- over some algebra and evaluated there. Gram- tion throughout computational linguistics. Clas- mars from a wide range of monolingual and syn- sical grammar formalisms such as TAG (Joshi and chronous formalisms can be mapped into IRTGs Schabes, 1997) and CCG (Steedman, 2001) have by using different algebras: Context-free and tree- been equipped with expressive statistical mod- adjoining grammars use string algebras of differ- els, and high-performance parsers have become ent kinds, graph grammars can be captured by us- available (Clark and Curran, 2007; Lewis and ing graph algebras, and so on. In addition, IRTGs Steedman, 2014; Kallmeyer and Maier, 2013). come with a universal parsing algorithm based on Synchronous grammar formalisms such as syn- closure results for tree automata. Implementing chronous context-free grammars (Chiang, 2007) and optimizing this parsing algorithm once, one and tree-to-string transducers (Galley et al., 2004; could apply it to all grammar formalisms that can Graehl et al., 2008; Seemann et al., 2015) are be mapped to IRTG. However, while Koller and being used as models that incorporate syntac- Kuhlmann show that asymptotically optimal pars- tic information in statistical machine translation. ing is possible in theory, it is non-trivial to imple- Synchronous string-to-tree (Wong and Mooney, ment their algorithm optimally. 2006) and string-to-graph grammars (Chiang et In this paper, we introduce practical algorithms al., 2013) have been applied to semantic parsing; for the two key operations underlying IRTG pars- and so forth. ing: computing the intersection of two tree au- Each of these grammar formalisms requires its tomata and applying an inverse tree homomor- users to develop new algorithms for parsing and phism to a tree automaton. After defining IRTGs training. This comes with challenges that are both (Section 2), we will first illustrate that a naive practical and theoretical. From a theoretical per- bottom-up implementation of the intersection al- spective, many of these algorithms are basically gorithm yields asymptotic parsing complexities the same, in that they rest upon a CKY-style pars- that are too high (Section 3). We will then 2042 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 2042–2051, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics show how the parsing complexity can be im- automaton rule homomorphism proved by combining algebra-specific index data S r1(NP, VP) (x1, x2) → ∗ structures with a generic parsing algorithm (Sec- NP r2 John → tion 4), and by replacing bottom-up with top-down VP r3 walks → queries (Section 5). In contrast to the naive al- VP r4(VP, NP) (x1, (on, x2)) → ∗ ∗ gorithm, both of these methods achieve the ex- NP r5 Mars → pected asymptotic complexities, e.g. O(n3) for context-free parsing, O(n6) for TAG parsing, etc. Figure 1: An example IRTG. Furthermore, an evaluation with realistic grammars shows that our algorithms improve practi- alphabet E. Its domain is the set of all strings over cal parsing times with IRTG grammars encoding E. For each symbol a E, it has a nullary oper- ∈ context-free grammars, tree-adjoining grammars, ation symbol a with aE∗ = a. It also has a single and graph grammars by orders of magnitude (Sec- E binary operation symbol , such that ∗ (w1, w2) tion 6). Thus our algorithms make IRTG pars- ∗ ∗ is the concatenation of the strings w1 and w2. Thus ing practically feasible for the first time; for graph the term (John, (walks, (on, Mars))) in Fig. 2b ∗ ∗ ∗ parsing, we obtain the fastest reported runtimes. evaluates to the string “John walks on Mars”. A finite tree automaton M over the signature Σ 2 Interpreted Regular Tree Grammars is a structure M = (Σ, Q, R, XF ), where Q is a We will first define IRTGs and explain how the finite set of states and XF Q is a final state. ∈ universal parsing algorithm for IRTGs works. R is a finite set of transition rules of the form X r(X ,...,X ), where the terminal symbol → 1 k 2.1 Formal foundations r Σ is of arity k and X, X ,...,X Q.A ∈ 1 k ∈ First, we introduce some fundamental theoretical tree automaton can run non-deterministically on a tree t T by assigning states to the nodes concepts and notation. ∈ Σ A signature Σ is a finite set of symbols r, f, . ., of t bottom-up. If we have t = r(t1, . , tn) each of which has an arity ar(r) 0.A tree t over and M can assign the state Xi to each ti, written ≥ Xi ∗ ti, then we also have X ∗ t. We say that the signature Σ is a term of the form r(t1, . , tn), → → M accepts t if XF ∗ t, and define the language where the ti are trees and r Σ has arity n. We → ∈ L(M) T of M as the (possibly infinite) set identify the nodes of t by their Gorn addresses, i.e. ⊆ Σ of all trees that M accepts. An example of a tree paths π N∗ from the root to the node, and write ∈ automaton (with states S, NP, etc.) is shown in t(π) for the label of π. We write TΣ for the set of all trees over Σ, and T ( ) for the trees in which the “automaton rule” column of Fig. 1. It accepts, Σ Xk each node either has a label from Σ, or is a leaf among others, the tree τ1 in Fig. 2a. labeled with one of the variables x , . , x . Tree automata can be defined top-down or { 1 k} A (linear, nondeleting) tree homomorphism h bottom-up, and are equivalent to regular tree from a signature Σ to a signature ∆ is a mapping grammars. The languages that can be accepted by h : T T . It is defined by specifying, for each finite tree automata are called the regular tree lan- Σ → ∆ symbol r Σ of arity k, a term h(r) T ( ) guages. See e.g. Comon et al. (2008) for details. ∈ ∈ ∆ Xk in which each variable occurs exactly once. This symbol-wise mapping is lifted to entire trees by 2.2 Interpreted regular tree grammars letting h(r(t1, . , tk)) = h(r)[h(t1), . , h(tk)], We can combine tree automata, homomorphisms, i.e. by replacing the variable xi in h(r) by the re- and algebras into grammars that can describe lan- cursively computed value h(ti). guages of arbitrary objects, as well as relations be- Let ∆ be a signature. A ∆-algebra con- tween such objects – in a way that inherits many A sists of a nonempty set A, called the domain, and technical properties from context-free grammars, for each symbol f ∆ with arity k, a function while extending the expressive capacity. ∈ f : Ak A, the operation associated with An interpreted regular tree grammar A → f. We can evaluate any term t T to a value (IRTG, Koller and Kuhlmann (2011)) ∈ ∆ t A, by evaluating the operation symbols = (M, (h , ),..., (h , )) consists of A ∈ G 1 A1 n An bottom-up. In this paper, we will be particularly a tree automaton M over some signature Σ, interested in the string algebra E∗ over the finite together with an arbitrary number n of inter- 2043 ∗ r1 h John evaluate r2 r4 ∗ “John walks on Mars” −→ walks −−−−→ r3 r5 ∗ on Mars (a) Tree τ1. (b) Term h (τ1). (c) h (τ1) evaluated in E∗. Figure 2: The tree τ1, evaluated by the homomorphism h and the algebra E∗ pretations (h , ), where each is an algebra sent synchronous grammars and (bottom-up) tree- i Ai Ai over some signature ∆i and each hi is a tree to-tree and tree-to-string transducers. In general, homomorphism from Σ to ∆i. The automaton any grammar formalism whose grammars describe M describes a language L(M) of derivation derivations in terms of a finite set of states can typ- trees which represent abstract syntactic structures. ically be converted into IRTG. Each derivation tree τ is then interpreted n ways: we map it to a term hi(τ) T∆i , and then we 2.3 Parsing IRTGs ∈ evaluate h (τ) to a value a = h (τ) i A of the i i i A ∈ i algebra .

Efficient Techniques for Parsing with Tree Automata

Regular Description of Context-Free Graph Languages

Weighted Regular Tree Grammars with Storage

Tree Automata Techniques and Applications

Taxonomy of XML Schema Languages Using Formal Language Theory

Capturing Cfls with Tree Adjoining Grammars

Deterministic Top-Down Tree Automata: Past, Present, and Future

Tree Automata

Language, Automata and Logic for Finite Trees

A Model-Theoretic Description of Tree Adjoining Grammars 1

Uniform Vs. Nonuniform Membership for Mildly Context-Sensitive Languages: a Brief Survey

Programming Using Automata and Transducers

Context-Free Graph Grammars and Concatenation of Graphs