A Generalized View on Parsing and Translation

A Generalized View on Parsing and Translation Alexander Koller Marco Kuhlmann Dept. of Linguistics Dept. of Linguistics and Philology University of Potsdam, Germany Uppsala University, Sweden [email protected] [email protected] Abstract In this paper, we make two contributions. First, we provide a formal framework – interpreted reg- We present a formal framework that generalizes a variety of monolingual and syn- ular tree grammars – which generalizes both synchronous grammar formalisms for parsing chronous grammars, tree transducers, and LCFRS- and translation. Our framework is based on style monolingual grammars. A grammar of this regular tree grammars that describe deriva- formalism consists of a regular tree grammar (RTG, tion trees, which are interpreted in arbitrary Comon et al. (2007)) defining a language of deriva- algebras. We obtain generic parsing algo- tion trees and an arbitrary number of interpreta- rithms by exploiting closure properties of tions which map these trees into objects of arbi- regular tree languages. trary algebras. This allows us to capture a wide variety of (synchronous and monolingual) gram- 1 Introduction mar formalisms. We can also model heterogeneous Over the past years, grammar formalisms that relate synchronous languages, which relate e.g. trees with pairs of grammatical structures have received much strings; this is necessary for applications in ma- attention. These formalisms include synchronous chine translation (Graehl et al., 2008) and in pars- grammars (Lewis and Stearns, 1968; Shieber and ing strings with synchronous tree grammars. Schabes, 1990; Shieber, 1994; Rambow and Satta, Second, we also provide parsing and decoding al- 1996; Eisner, 2003) and tree transducers (Comon gorithms for our framework. The key concept that et al., 2007; Graehl et al., 2008). Weighted vari- we introduce is that of a regularly decomposable ants of both of families of formalisms have been algebra, where the set of all terms that evaluate to used for machine translation (Graehl et al., 2008; a given object form a regular tree language. Once Chiang, 2007), where one tree represents a parse an algorithm that computes a compact representa- of a sentence in one language and the other a parse tion of this language is known, parsing algorithms in the other language. Synchronous grammars and follow from a generic construction. All important tree transducers are also useful as models of the algebras in natural language processing that we are syntax-semantics interface; here one tree represents aware of – in particular the standard algebras of the syntactic analysis of a sentence and the other strings and trees – are regularly decomposable. the semantic analysis (Shieber and Schabes, 1990; In summary, we obtain a formalism that pulls Nesson and Shieber, 2006). together much existing research under a common When such a variety of formalisms are avail- formal framework, and makes it possible to ob- able, it is useful to take a step back and look for tain parsers for existing and new formalisms in a a generalized model that explains the precise for- modular, universal fashion. mal relationship between them. There is a long Plan of the paper. The paper is structured as tradition of such research on monolingual grammar follows. We start by laying the formal foundations formalisms, where e.g. linear context-free rewriting in Section 2. We then introduce the framework of systems (LCFRS, Vijay-Shanker et al. (1987)) gen- interpreted RTGs and illustrate it with some simple eralize various mildly context-sensitive formalisms. examples in Section 3. The generic parsing and However, few such results exist for synchronous decoding algorithms are described in Section 4. formalisms. A notable exception is the work by Section 5 discusses the role of binarization in our Shieber (2004), who unified synchronous tree- framework. Section 6 shows how interpreted RTGs adjoining grammars with tree transducers. can be applied to existing grammar formalisms. 2 Proceedings of the 12th International Conference on Parsing Technologies, pages 2–13, October 5-7, 2011, Dublin City University. c 2011 Association for Computational Linguistics 2 Formal Foundations where ti/xi i [n] represents the substitu- { | ∈ } tion that replaces all occurrences of x with the For n 0, we define [n] = i 1 i n . i ≥ { | ≤ ≤ } respective ti. A homomorphism is called linear A signature is a finite set Σ of function sym- if every term h(f) contains each variable at most bols f, each of which has been assigned a non- once; and a delabeling if every term h(f) is of the negative integer called its rank. Given a signature form g(x , . , x ) where n is the rank of f Σ, we can define a (finite constructor) tree over Σ π(1) π(n) and π a permutation of 1, . , n . as a finite tree whose nodes are labeled with sym- { } bols from Σ such that a node with a label of rank 3 Interpreted Regular Tree Grammars n has exactly n children. We write TΣ for the set of all trees over Σ. Trees can be written as terms; We will now present a generalized framework for f(t1, . , tn) stands for the tree with root label f synchronous and monolingual grammars in terms and subtrees t1, . , tn. The nodes of a tree can be of regular tree grammars, tree homomorphisms, identified by paths π N∗ from the root: The root and algebras. We will illustrate the framework with ∈ has address , and the i-th child of the node at path two simple examples here, but many other grammar π has the address πi. We write t(π) for the symbol formalisms can be seen as special cases too, as we at path π in the tree t. will show in Section 6. A Σ-algebra consists of a non-empty set A A called the domain and, for each symbol f Σ 3.1 An Introductory Example ∈ with rank n, a total function f : An A, the A → The derivation process of context-free grammar is operation associated with f. We can evaluate a usually seen as a string-rewriting process in which term t TΣ to an object t A by executing ∈ A ∈ nonterminals are successively replaced by the right- the operations: hand sides of production rules. The actual parse J K tree is explained as a post-hoc description of the f(t1, . , tn) = f A( t1 ,..., tn ) . A A A rules that were applied in the derivation. Sets of trees can be specified by regular tree However, we can alternatively view this as a two- J K J K J K grammars (RTGs) (Gécseg and Steinby, 1997; step process which first computes a derivation tree Comon et al., 2007). Formally, such a grammar is and then interprets it as a string. Say we have the a structure = (N, Σ, P, S), where N is a signa- G G CNF grammar in Fig. 2, and we want to derive ture of nonterminal symbols, all of which are taken the string w = “Sue watches the man with the to have rank 0, Σ is a signature of terminal sym- telescope”. In the first step, we use G to generate a bols, S N is a distinguished start symbol, and P ∈ derivation tree like the one in Fig. 2a. The nodes of is a finite set of productions of the form B t, → this tree are labeled with names of the production where B is a nonterminal symbol, and t TN Σ. rules in G; nodes with labels r and r are licensed ∈ ∪ 7 3 The productions of a regular tree grammar are used by G to be the two children of r1 because r1 has as rewriting rules on terms. More specifically, the the two nonterminals NP and VP in its right-hand derivation relation of is defined as follows. Let r r G side, and the left-hand sides of 7 and 3 are NP t1, t2 TN Σ be terms. Then derives t2 from t1 and VP, respectively. In a second step, we can then ∈ ∪ G in one step, denoted by t1 t2, if there exists a interpret the derivation tree into w by interpreting ⇒G production of the form B t and t can be ob- → 2 each leaf labeled with a terminal production (say, tained by replacing an occurrence of B in t1 by t. r7) as the string on its right-hand side (“Sue”), and The (regular) language L( ) generated by is the G G each internal node as a string concatenation opera- set of all terms t T that can be derived, in zero ∈ Σ tion which arranges the string yields of its subtrees or more steps, from the term S. in the order given by the right-hand side of the A (tree) homomorphism is a total function production rule. h: T T which expands symbols of Σ into Σ → ∆ This view differs from the traditional perspec- trees over ∆ while following the structure of the in- tive on context-free grammars in that it makes the put tree. Formally, h is specified by pairs (f, h(f)), derivation tree the primary participant in the deriva- where f Σ is a symbol with some rank n, and ∈ tion process. The string is only one particular in- h(f) T∆ x1,...,xn is a term with variables. terpretation of the derivation tree, and instead of ∈ ∪{ } Given t T , the value of t under h is defined as ∈ Σ a string we could also have interpreted it as some h(f(t , . , t )) = h(f) h(t )/x i [n] , other kind of object. For instance, if we had inter- 1 n { i i | ∈ } 3 string transducer); thesestring differences transducer); simply these amount differencesterms.

A Generalized View on Parsing and Translation

Regular Description of Context-Free Graph Languages

Weighted Regular Tree Grammars with Storage

Tree Automata Techniques and Applications

Taxonomy of XML Schema Languages Using Formal Language Theory

Tree Automata

Language, Automata and Logic for Finite Trees

A Model-Theoretic Description of Tree Adjoining Grammars 1

Uniform Vs. Nonuniform Membership for Mildly Context-Sensitive Languages: a Brief Survey

Context-Free Graph Grammars and Concatenation of Graphs

Hybrid Grammars for Parsing of Discontinuous Phrase Structures and Non-Projective Dependency Structures

Efficient Techniques for Parsing with Tree Automata

Regular Rooted Graph Grammars a Web Type and Schema Language