An Earley Parsing Algorithm for Range Concatenation Grammars

An Earley Parsing Algorithm for Range Concatenation Grammars Laura Kallmeyer Wolfgang Maier Yannick Parmentier SFB 441 SFB 441 CNRS - LORIA Universitat¨ Tubingen¨ Universitat¨ Tubingen¨ Nancy Universite´ 72074 Tubingen,¨ Germany 72074 Tubingen,¨ Germany 54506 Vandœuvre, France [email protected] [email protected] [email protected] Abstract class of LCFRS has received more attention con- cerning parsing (Villemonte de la Clergerie, 2002; We present a CYK and an Earley-style Burden and Ljunglof,¨ 2005). This article proposes algorithm for parsing Range Concatena- new CYK and Earley parsers for RCG, formulat- tion Grammar (RCG), using the deduc- ing them in the framework of parsing as deduction tive parsing framework. The characteris- (Shieber et al., 1995). The second section intro- tic property of the Earley parser is that we duces necessary definitions. Section 3 presents a use a technique of range boundary con- CYK-style algorithm and Section 4 extends this straint propagation to compute the yields with an Earley-style prediction. of non-terminals as late as possible. Ex- periments show that, compared to previ- 2 Preliminaries ous approaches, the constraint propagation The rules (clauses) of RCGs1 rewrite predicates helps to considerably decrease the number ranging over parts of the input by other predicates. of items in the chart. E.g., a clause S(aXb) S(X) signifies that S is → 1 Introduction true for a part of the input if this part starts with an a, ends with a b, and if, furthermore, S is also true RCGs (Boullier, 2000) have recently received a for the part between a and b. growing interest in natural language processing Definition 1. A RCG G = N, T, V, P, S con- h i (Søgaard, 2008; Sagot, 2005; Kallmeyer et al., sists of a) a finite set of predicates N with an arity 2008; Maier and Søgaard, 2008). RCGs gener- function dim: N N 0 where S N is → \{ } ∈ ate exactly the class of languages parsable in de- the start predicate with dim(S) = 1, b) disjoint fi- terministic polynomial time (Bertsch and Neder- nite sets of terminals T and variables V , c) a finite hof, 2001). They are in particular more pow- set P of clauses ψ0 ψ1 . ψm, where m 0 erful than linear context-free rewriting systems → ≥ and each of the ψi, 0 i m, is a predicate of (LCFRS) (Vijay-Shanker et al., 1987). LCFRS is ≤ ≤ the form Ai(α1, . , αdim(A )) with Ai N and unable to describe certain natural language phe- i ∈ αj (T V )∗ for 1 j dim(Ai). nomena that RCGs actually can deal with. One ∈ ∪ ≤ ≤ Central to RCGs is the notion of ranges on example are long-distance scrambling phenom- strings. ena (Becker et al., 1991; Becker et al., 1992). Other examples are non-semilinear constructions Definition 2. For every w = w1 . wn with such as case stacking in Old Georgian (Michaelis wi T (1 i n), we define a) P os(w) = ∈ ≤ ≤ and Kracht, 1996) and Chinese number names 0, . , n . b) l, r P os(w) P os(w) with { } h i ∈ × (Radzinski, 1991). Boullier (1999) shows that l r is a range in w. Its yield l, r (w) is the ≤ h i RCGs can describe the permutations occurring substring wl+1 . wr. c) For two ranges ρ1 = with scrambling and the construction of Chinese l1, r1 , ρ2 = l2, r2 : if r1 = l2, then ρ1 ρ2 = h i h i · number names. l1, r2 ; otherwise ρ1 ρ2 is undefined. d) A vec- h i · tor φ = ( x , y ,..., x , y ) is a range vector Parsing algorithms for RCG have been intro- h 1 1i h k ki of dimension k in w if x , y is a range in w for duced by Boullier (2000), who presents a di- h i ii 1 i k. φ(i).l (resp. φ(i).r) denotes then the rectional top-down parsing algorithm using pseu- ≤ ≤ docode, and Barthelemy´ et al. (2001), who add an 1In this paper, by RCG, we always mean positive RCG, oracle to Boullier’s algorithm. The more restricted see Boullier (2000) for details. 9 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 9–12, Suntec, Singapore, 4 August 2009. c 2009 ACL and AFNLP first (resp. second) component of the ith element in c with i = Υ(c, x): ρ(i).l+1 = ρ(i).r C. For ∈ of φ, that is xi (resp. yi). all x, y that are variables or occurrences of termi- In order to instantiate a clause of the grammar, nals in c such that xy is a substring of one of the arguments in c: ρ(Υ(c, x)).r = ρ(Υ(c, y)).l C. we need to find ranges for all variables in the ∈ clause and for all occurrences of terminals. For These are all constraints in C. convenience, we assume the variables in a clause The range constraint vector of a clause c cap- and the occurrences of terminals to be equipped tures all information about boundaries forming a with distinct subscript indices, starting with 1 and range, ranges containing only a single terminal, ordered from left to right (where for variables, and adjacent variables/terminal occurrences in c. only the first occurrence is relevant for this order). An RCG derivation consists of rewriting in- We introduce a function Υ: P N that gives the → stantiated predicates applying instantiated clauses, maximal index in a clause, and we define Υ(c, x) i.e. in every derivation step Γ Γ , we re- 1 ⇒w 2 for a given clause c and x a variable or an occur- place the lefthand side of an instantiated clause rence of a terminal as the index of x in c. with its righthand side (w.r.t. a word w). The lan- Definition 3. An instantiation of a c P with guage of an RCG G is the set of strings that can ∈ Υ(c) = j w.r.t. to some string w is given by a be reduced to the empty word: L(G) = w + { | range vector φ of dimension j. Applying φ to S( 0, w ) ε . h | |i ⇒G,w } a predicate A(~α) in c maps all occurrences of The expressive power of RCG lies beyond mild x (T V ) with Υ(c, x) = i in ~α to φ(i). If ∈ ∪ context-sensitivity. As an example, consider the the result is defined (i.e., the images of adjacent RCG from Fig. 3 that generates a language that is variables can be concatenated), it is called an in- not semilinear. stantiated predicate and the result of applying φ to For simplicity, we assume in the following with- all predicates in c, if defined, is called an instanti- out loss of generality that empty arguments (ε) ated clause. occur only in clauses whose righthand sides are We also introduce range constraint vectors, vec- empty.2 tors of pairs of range boundary variables together with a set of constraints on these variables. 3 Directional Bottom-Up Chart Parsing Definition 4. Let Vr = r1, r2,... be a set { } In our directional CYK algorithm, we move a dot of range boundary variables. A range constraint through the righthand side of a clause. We there- vector of dimension k is a pair ~ρ,C where a) h i fore have passive items [A, φ] where A is a pred- ~ρ (V 2)k; we define V (~ρ) as the set of range ∈ r r icate and φ a range vector of dimension dim(A) boundary variables occurring in ~ρ. b) C is a set and active items. In the latter, while traversing of constraints c that have one of the following r the righthand side of the clause, we keep a record forms: r = r , k = r , r + k = r , 1 2 1 1 2 of the left and right boundaries already found k r , r k, r r or r + k r ≤ 1 1 ≤ 1 ≤ 2 1 ≤ 2 for variables and terminal occurrences. This is for r1, r2 Vr(~ρ) and k N. ∈ ∈ achieved by subsequently enriching the range con- We say that a range vector φ satisfies a range straint vector of the clause. Active items have the constraint vector ρ, C iff φ and ρ are of the same form [A(~x) Φ Ψ, ρ, C ] with A(~x) ΦΨ a h i → • h i → dimension k and there is a function f : Vr N clause, ΦΨ = ε, Υ(A(~x ΦΨ)) = j and ρ, C → 6 → h i that maps ρ(i).l to φ(i).l and ρ(i).r to φ(i).r for a range constraint vector of dimension j. We re- all 1 i k such that all constraints in C are sat- quire that ρ, C be satisfiable.3 ≤ ≤ isfied. Furthermore, we say that a range constraint h i vector ρ, C is satisfiable iff there exists a range 2Any RCG can be easily transformed into an RCG satis- h i φ fying this condition: Introduce a new unary predicate Eps vector that satisfies it. with a clause Eps(ε) ε. Then, for every clause c with → Definition 5. For every clause c, we define its righthand side not ε, replace every argument ε that occurs in c with a new variable X (each time a distinct one) and add range constraint vector ρ, C w.r.t. a w with w = h i | | the predicate Eps(X) to the righthand side of c. n as follows: a) ρ has dimension Υ(c) and all 3Items that are distinguished from each other only by a bi- range boundary variables in ρ are pairwise differ- jection of the range variables are considered equivalent.

An Earley Parsing Algorithm for Range Concatenation Grammars

The Earley Algorithm Is to Avoid This, by Only Building Constituents That Are Compatible with the Input Read So Far

LATE Ain't Earley: a Faster Parallel Earley Parser

Lecture 10: CYK and Earley Parsers Alvin Cheung Building Parse Trees Maaz Ahmad CYK and Earley Algorithms Talia Ringer More Disambiguation

A Mobile App for Teaching Formal Languages and Automata

Principles and Implementation of Deductive Parsing

Analysis of Cocke-Younger-Kasami and Earley Algorithms on Probabilistic Context-Free Grammar

Syntactic Parsing: Introduction, CYK Algorithm

Parallel Parsing of Context-Free Grammars

Parsing of Context-Free Grammars

An Efficient Context-Free Parsing Algorithm for Natural Languages1

14.5 Supplemental: Context Free Grammars: the CYK Algorithm

Compiler Construction