Parsing Algorithms Based on Tree Automata

Parsing Algorithms Based on Tree Automata

Parsing Algorithms based on Tree Automata Andreas Maletti Giorgio Satta Departament de Filologies Romaniques` Department of Information Engineering Universitat Rovira i Virgili, Tarragona, Spain University of Padua, Italy [email protected] [email protected] Abstract resorting to so-called parental annotations (John- We investigate several algorithms related son, 1998), but this, of course, results in a different to the parsing problem for weighted au- tree language, since these annotations will appear tomata, under the assumption that the in- in the derived tree. put is a string rather than a tree. This Most of the theoretical work on parsing and es- assumption is motivated by several natu- timation based on PTA has assumed that the in- ral language processing applications. We put is a tree (Graehl et al., 2008), in accordance provide algorithms for the computation of with the very definition of these devices. How- parse-forests, best tree probability, inside ever, both in parsing as well as in machine transla- probability (called partition function), and tion, the input is most often represented as a string prefix probability. Our algorithms are ob- rather than a tree. When the input is a string, some tained by extending to weighted tree au- trick is applied to map the problem back to the tomata the Bar-Hillel technique, as defined case of an input tree. As an example in the con- for context-free grammars. text of machine translation, assume a probabilistic tree transducer T as a translation model, and an 1 Introduction input string w to be translated. One can then inter- Tree automata are finite-state devices that recog- mediately construct a tree automaton Mw that rec- nize tree languages, that is, sets of trees. There ognizes the set of all possible trees that have w as is a growing interest nowadays in the natural yield, with internal nodes from the input alphabet language parsing community, and especially in of T . This automaton Mw is further transformed the area of syntax-based machine translation, for into a tree transducer implementing a partial iden- probabilistic tree automata (PTA) viewed as suit- tity translation, and such a transducer is composed able representations of grammar models. In fact, with T (relation composition). This is usually probabilistic tree automata are generatively more called the ‘cascaded’ approach. Such an approach powerful than probabilistic context-free gram- can be easily applied also to parsing problems. mars (PCFGs), when we consider the latter as de- In contrast with the cascaded approach above, vices that generate tree languages. This difference which may be rather inefficient, in this paper we can be intuitively understood if we consider that a investigate a more direct technique for parsing computation by a PTA uses hidden states, drawn strings based on weighted and probabilistic tree from a finite set, that can be used to transfer infor- automata. We do this by extending to weighted mation within the tree structure being recognized. tree automata the well-known Bar-Hillel construc- As an example, in written English we can em- tion defined for context-free grammars (Bar-Hillel pirically observe different distributions in the ex- et al., 1964) and for weighted context-free gram- pansion of so-called noun phrase (NP) nodes, in mars (Nederhof and Satta, 2003). This provides the contexts of subject and direct-object positions, an abstract framework under which several pars- respectively. This can be easily captured using ing algorithms can be directly derived, based on some states of a PTA that keep a record of the dif- weighted tree automata. We discuss several appli- ferent contexts. In contrast, PCFGs are unable to cations of our results, including algorithms for the model these effects, because NP node expansion computation of parse-forests, best tree probability, should be independent of the context in the deriva- inside probability (called partition function), and tion. This problem for PCFGs is usually solved by prefix probability. 1 Proceedings of the 11th International Conference on Parsing Technologies (IWPT), pages 1–12, Paris, October 2009. c 2009 Association for Computational Linguistics 2 Preliminary definitions is, an alphabet whose symbols have an associated arity. We write Σ to denote the set of all k-ary Let S be a nonempty set and be an associative k · symbols in Σ. We use a special symbol e Σ0 binary operation on S. If S contains an element 1 ∈ to syntactically represent the empty string ε. The such that 1 s = s = s 1 for every s S, then · · ∈ set of Σ-trees, denoted by TΣ, is the smallest set (S, , 1) is a monoid. A monoid (S, , 1) is com- · · satisfying both of the following conditions mutative if the equation s1 s2 = s2 s1 holds · · for every α Σ0, the single node labeled α, for every s1, s2 S.A commutative semiring • ∈ ∈ written α(), is a tree of TΣ, (S, +, , 0, 1) is a nonempty set S on which a bi- · for every σ Σk with k 1 and for every nary addition + and a binary multiplication have • ∈ ≥ · t1, . , tk TΣ, the tree with a root node la- been defined such that the following conditions are ∈ beled σ and trees t , . , t as its k children, satisfied: 1 k written σ(t , . , t ), belongs to T . (S, +, 0) and (S, , 1) are commutative 1 k Σ • · As a convention, throughout this paper we assume monoids, that σ(t , . , t ) denotes σ() if k = 0. The size distributes over + from both sides, and 1 k •· of the tree t TΣ, written t , is defined as the s 0 = 0 = 0 s for every s S. ∈ | | • · · ∈ number of occurrences of symbols from Σ in t. A weighted string automaton, abbreviated WSA, Let t = σ(t , . , t ). The yield of t is recur- (Schutzenberger,¨ 1961; Eilenberg, 1974) is a sys- 1 k sively defined by tem M = (Q, Σ, , I, ν, F ) where S Q is a finite alphabet of states, • σ if σ Σ0 e Σ is a finite alphabet of input symbols, ∈ \{ } • yd(t) = ε if σ = e = (S, +, , 0, 1) is a semiring, •S · I : Q S assigns initial weights, yd(t1) yd(tk) otherwise. • → ··· ν : Q Σ Q S assigns a weight to each • × × → The set ofpositions of t, denoted by Pos(t), is transition, and recursively defined by F : Q S assigns final weights. • → We now proceed with the semantics of M. Let Pos(σ(t1, . , tk)) = w Σ∗ be an input string of length n. For each ∈ ε iw 1 i k, w Pos(ti) . integer i with 1 i n, we write w(i) to denote { } ∪ { | ≤ ≤ ∈ } ≤ ≤ the i-th character of w. The set Pos(w) of posi- Note that t = Pos(t) and, according to our con- | | | | tions of w is i 0 i n .A run of M on w k = 0 { | ≤ ≤ } vention, when the above definition provides is a mapping r : Pos(w) Q. We denote the set Pos(σ()) = ε . We denote the symbol of t at → { } of all such runs by RunM (w). The weight of a position w by t(w) and its rank by rkt(w). run r Run (w) is ∈ M A weighted tree automaton (WTA) is a system M = (Q, Σ, , µ, F ) where n S Q is a finite alphabet of states, wtM (r) = ν(r(i 1), w(i), r(i)) . • − Σ is a finite ranked alphabet of input symbols, i=1 • Y = (S, +, , 0, 1) is a semiring, We assume the right-hand side of the above equa- •S · µ is an indexed family (µk)k N of mappings • k ∈ tion evaluates to 1 in case n = 0. The WSA M µ :Σ SQ Q , and k k → × recognizes the mapping M :Σ∗ S, which is F : Q S → 1 assigns final weights. defined for every w Σ of length n by • → k ∈ ∗ In the above definition, Q is the set of all strings over Q having length k, with Q0 = ε . Fur- M(w) = I(r(0)) wtM (r) F (r(n)) . Q Qk { } · · ther note that S × is the set of all matrices r Run (w) ∈ XM with elements in S, row index set Q, and column In order to define weighted tree automata (Bers- index set Qk. Correspondingly, we will use the tel and Reutenauer, 1982; Esik´ and Kuich, 2003; common matrix notation and write instances of µ Borchardt, 2005), we need to introduce some addi- in the form µk(σ)q0,q1 qk . Finally, we assume ··· tional notation. Let Σ be a , that q1 qk = ε if k = 0. ranked alphabet ··· We define the semantics also in terms of runs. 1We overload the symbol M to denote both an automaton Let t TΣ.A run of M on t is a mapping and its recognized mapping. However, the intended meaning ∈ will always be clear from the context. r : Pos(t) Q. We denote the set of all such runs → 2 σ by Run (t). The weight of a run r Run (t) M ∈ M is @ σ γ wtM (r) = µk(t(w))r(w),r(w1) r(wk) . δ ··· w Pos(t) ∈Y γ δ α @ rkt(w)=k α β σ α Note that, according to our convention, the string β @ r(w1) r(wk) denotes ε when k = 0. The ··· β α σ α WTA M recognizes the mapping M : T S, Σ → which is defined by @ β α M(t) = wt (r) F (r(ε)) M · r Run (t) ∈ XM Figure 1: Input tree t and encoded tree enc(t). for every t TΣ. We say that t is recognized ∈ by M if M(t) = 0.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    12 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us