Parsing Japanese with a PCFG Treebank Grammar

言語処理学会第20回年次大会発表論文集 (2014年3月) Parsing Japanese with a PCFG treebank grammar Tsaiwei FANG* Alastair BUTLER†‡ Kei YOSHIMOTO*‡ *Graduate School of International Cultural Studies, Tohoku University †PRESTO, Japan Science and Technology Agency ‡Center for the Advancement of Higher Education, Tohoku University Abstract Table 1: Keyaki Treebank content This paper describes constituent parsing of Domain Number of trees Japanese using a probabilistic context-free blog posts 217 grammar treebank grammar that is enhanced Japanese Law 484 newspaper 1600 with Parent Encoding, reversible tree trans- telephone calls 1177 formations, refinement of treebank labels and textbooks 7733 Markovisation. We evaluate the quality of the Wikipedia 2464 resulting parsing. Total 13675 1 Introduction 2 The parser The parsing of this paper is made possible The focus of this paper is on the task of ob- because of the unlexicalised statistical parser taining a constituent parser for Japanese us- BitPar (Schmid, 2004), which allows any ing a treebank (Keyaki Treebank) as training grammar rule files in the proper format to be data and a Probabilistic Context-Free Gram- used for parsing. BitPar uses a fast bitvector- mar (PCFG) as parsing method. We report re- based implementation of the Cocke-Younger- sults for a vanilla PCFG model that is directly Kasami algorithm, storing the parse chart as read off the training data and for an enhanced a large bit vector. This enables full parsing PCFG model obtained with transformations of (without search space pruning) with large tree- the training data that aim to tune the treebank bank grammars. BitPar can extract from the representations to the specific needs of proba- parse chart the most likely parse tree (Viterbi bilistic context-free parsers, while allowing for parse), or the full set of parses in the form of a the original annotation to be restored. Specif- parse forest, or the n-best parse trees. ically, we follow best practices from Johnson (1998), Klein and Manning (2003) and Fraser et al. (2013) among others of Parent Encoding, 3 The treebank reversible tree transformations, refinement of treebank labels and Markovisation. PCFGs so The grammar and lexicon used by the Bit- derived can be used by a parser to construct Par parser are extracted from the Keyaki Tree- maximal probability (Viterbi) parses. We eval- bank (Butler et al. 2012). The current com- uate the quality of the resulting parsing using position of the Keyaki Treebank is detailed standard PARSEVAL constituency measures. in Table 1. The treebank uses an annota- Our fully labelled bracketing score for a held- tion scheme that follows, with adaptations for out portion of the Keyaki Treebank (1,300 Japanese, the general scheme proposed in the trees) is 79.97 (recall), 80.61 (precision) and Annotation manual for the Penn Historical 80.29 (F-score). We show a learning curve Corpora and the PCEEC (Santorini 2010). suggestive that parser performance will con- Constituent structure is represented with la- tinue to strongly improve with access to more belled bracketing and augmented with gram- training data. matical functions and notation for recovering ― 432 ― Copyright(C) 2014 The Association for Natural Language Processing. All Rights Reserved. discontinuous constituents. Primary motiva- tion forthe annotationhas been to facilitate au- Figure 1: growth of phrase structure rules tomated searches for linguistic research (e.g., 14000 1 enhanced model all enhanced model f > 1 via CorpusSearch ), and to provide a syntactic enhanced model f > 2 enhanced model f > 4 12000 vanilla model all base that is sufficiently rich to enable an au- vanilla model f > 1 vanilla model f > 2 vanilla model f > 4 tomatic generation of (higher-order) predicate 10000 logic based meaning representations.2 A typi- cal parse in tree form looks like: 8000 6000 ✘IP-MAT✘❍❤❤❤❤❤ PP NP-SBJ PP VB P VB2 AX AXD PU ✥✥✥❵❵❵ 4000 Number of phrase structure rules 。 NP P * ✭✭✭✭NP❤❤❤❤ P 寝てしまいました N は✭✭✭✭IP-EMB✏✏✭❳❤❳❤❳❤❤❤ N で 2000 弟 PP NP-OB1 VB AXDまま 0 NP P *を* つけた 0 2000 4000 6000 8000 10000 12000 14000 Number of training sentences N をテレビ Every word has a word level part-of-speech la- this was created with techniques from Johnson bel. Phrasal nodes (NP, PP, ADJP, etc.) im- (1998), Klein and Manning (2003) and Fraser mediately dominate the phrase head (N, P, et al. (2013) among others, of: ADJ, etc.), so that the phrase head has as sis- ters both modifiers and complements. Mod- 1. eliminating discontinuous constituents ifiers and complements are distinguished be- and (Section 4.1), cause there are extended phrase labels to mark function (e.g., -EMB encodes that the clause 2. transforming and augmenting treebank テレビをつけた is a complement of the annotations (Section 4.2), and phrase head まま). All noun phrases immediately dominated by IP are marked for func- 3. following the extraction of phrase struc- tion (NP-SBJ=subject, NP-OB1=direct object, ture rules, lexical rules, and their fre- NP-TMP=temporal NP, etc.). The PP la- quencies from the annotated parse trees, bel is never extended with function mark- markovising the grammar (Section 4.3). ing. However the immediately following sibling of a PP may be present in the an- Note that all curves of Figure 1 remain steep at notation to provide disambiguation informa- the maximum training set size of 12,375 trees, suggesting more data would lead to more sig- tion for the PP. Thus, (NP-OB1 *を*) indicates the immediately preceeding PP (with nificant growth. As a comparison, the treebank case particle を) is the object, while (NP-SBJ grammar of Schmid (2006) extracted from the Penn Treebank of English has 52,297 phrase *) indicates the immediately preceeding PP (without case particle) is the subject. All structure rules (enabling a labelled bracketing clauses have extended labels to mark function F-score of 86.6%). (IP-MAT=matrix clause, IP-ADV=adverbial clause, IP-REL=relative clause, etc.). 4.1 Discontinuous constituents The Keyaki Treebank annotates trace nodes, 4 Extracted grammars for example with relative clauses. But un- like the Penn Treebank (Bies et al. 1995) trace Figure 1 shows the growth of extracted phrase nodes are not indexed and typically appear structure rules for a vanilla grammar model di- clause initially, with precise attachment points rectly read off the Keyaki Treebank and also unspecified since it is enough to assume that for an enhanced model. The enhanced model constituents at the IP level are dependents of is obtained after reversible changes are made the main verb of the clause. For trace nodes to the treebank aimed at improving parsing within embedded contexts (cases of long dis- quality. This section focuses on the changes tance dependency) it is the phrase level of at- made for the enhanced model. Specifically tachment for the trace node that is the relevant 1http://corpussearch.sourceforge.net indicator of the dependency. For the current 2http://www.compling.jp/ts work we assume parse trees from which trace ― 433 ― Copyright(C) 2014 The Association for Natural Language Processing. All Rights Reserved. nodes are removed and aim to recognize dis- section 3 results in the following tree repre- continuous constituents in a post-processing sentation: step, following for example Johnson (2001). IP-MATˆTOP 。 ✭✭✭IP✭-MATˆTOP✭❤❤❤❤ PU- 。 ✥✥PPÎP✥❵❵❵ ✭✭✭✭PPÎP✭❤❤❤❤❤ VB-INTRNS P-CONJ-て VB2 AX2 AXD 4.2 Transforming and augmenting NPˆPP P-OPTR-は ✭✭✭✭NPˆPP✭❤❤❤❤❤ P-CASE-で寝てしまいました N ✭は✭✭IP-EMBˆNP✭✭❤❤❤❤❤N-STRUCTURAL で annotations 弟 ✥✥PPÎP✥❵❵❵ VB-TRNS AXD まま NPˆPP P-CASE-をつけた N を In addition to removing trace nodes, transfor- テレビ mations and augmentations of the trees are performed. Specifically interjection, punctu- ation or parenthetical materials occurring at 4.3 Markovisation the left or right periphery of a constituent The Keyaki Treebank uses rather flat struc- are moved to a new projection of the con- tures, particularly at the clause level, with stituent. There is also Parent Encoding, fol- nodes having up to 31 child nodes. As Fraser lowing Johnson (1998), which copies the syn- et al. (2013) note this causes problems be- tactic label of a parent node (minus the func- cause only some rules of that length appear tional information) onto the labels of its chil- in the training data. This sparse data prob- dren. Finally there are refinements to the POS lem is solved by markovisation (Collins 1997), tags. which splits long rules into a set of shorter Refinements to the POS tags include the rules. The following shows consequences to P (particle) tag becoming either P-CASE, P- our running example of markovising the rule CONJ, P-COORD, P-CP-THT, P-FINAL, P- IP-MATˆTOP–> PPÎP PPÎP VB-INTRNS P- NOUN or P-OPTR depending on the func- CONJ-て VB2 AX2 AXD. tional role of the particle and/or the syntac- ✭✭IP-MATˆTOP✭✭❤❤❤❤ tic context in which the particle occurs. Verbs IP-MATˆTOP PU-。。 are also split to inform information about the <IP-MATˆTOP[VB-INTRNS]AX2|AXD>✭✭✭✭✭✭ AXD clause of occurrence: VB-THT (verb with <IP-MATˆTOP[VB-INTRNS]VB2|AX2> AX2 た <IP-MATˆTOP[VB-INTRNS]P-CONJ-て|VB2> VB2 まし <IP-MATˆTOP[VB-INTRNS]VB-INTRNS|P-CONJ-て> P-CONJ-てしまい <IP-MATˆTOP[VB-INTRNS]PPÎP|VB-INTRNS> VB-INTRNS て ✥✥PPÎP✥❵❵❵ ✭✭✭✭PPÎP✭❤❤❤❤❤ 寝 NPˆPP P-OPTR-は NPˆPP P-CASE-で ✭✭✭✭✭❤❤❤❤❤ The auxiliary symbols that are created en- N は IP-EMBˆNP✭✭❤❤ N-STRUCTURAL で ✭✭✭ ❤❤❤ code information about the parent category, 弟 ✥✥PPÎP✥❵❵❵ VB-TRNS AXD まま NPˆPP P-CASE-をつけた the head child, the child that is generated next N を and the previously generated child.

Load more