言語処理学会 第20回年次大会 発表論文集 (2014年3月)

Parsing Japanese with a PCFG grammar

Tsaiwei FANG* Alastair BUTLER†‡ Kei YOSHIMOTO*‡ *Graduate School of International Cultural Studies, Tohoku University †PRESTO, Japan Science and Technology Agency ‡Center for the Advancement of Higher Education, Tohoku University

Abstract Table 1: Keyaki Treebank content

This paper describes constituent parsing of Domain Number of trees Japanese using a probabilistic context-free blog posts 217 grammar treebank grammar that is enhanced Japanese Law 484 newspaper 1600 with Parent Encoding, reversible tree trans- telephone calls 1177 formations, refinement of treebank labels and textbooks 7733 Markovisation. We evaluate the quality of the Wikipedia 2464 resulting parsing. Total 13675

1 Introduction 2 The parser The parsing of this paper is made possible The focus of this paper is on the task of ob- because of the unlexicalised statistical parser taining a constituent parser for Japanese us- BitPar (Schmid, 2004), which allows any ing a treebank (Keyaki Treebank) as training grammar rule files in the proper format to be data and a Probabilistic Context-Free Gram- used for parsing. BitPar uses a fast bitvector- mar (PCFG) as parsing method. We report re- based implementation of the Cocke-Younger- sults for a vanilla PCFG model that is directly Kasami algorithm, storing the parse chart as read off the training data and for an enhanced a large bit vector. This enables full parsing PCFG model obtained with transformations of (without search space pruning) with large tree- the training data that aim to tune the treebank bank grammars. BitPar can extract from the representations to the specific needs of proba- parse chart the most likely (Viterbi bilistic context-free parsers, while allowing for parse), or the full set of parses in the form of a the original annotation to be restored. Specif- parse forest, or the n-best parse trees. ically, we follow best practices from Johnson (1998), Klein and Manning (2003) and Fraser et al. (2013) among others of Parent Encoding, 3 The treebank reversible tree transformations, refinement of treebank labels and Markovisation. PCFGs so The grammar and lexicon used by the Bit- derived can be used by a parser to construct Par parser are extracted from the Keyaki Tree- maximal probability (Viterbi) parses. We eval- bank (Butler et al. 2012). The current com- uate the quality of the resulting parsing using position of the Keyaki Treebank is detailed standard PARSEVAL constituency measures. in Table 1. The treebank uses an annota- Our fully labelled bracketing score for a held- tion scheme that follows, with adaptations for out portion of the Keyaki Treebank (1,300 Japanese, the general scheme proposed in the trees) is 79.97 (recall), 80.61 (precision) and Annotation manual for the Penn Historical 80.29 (F-score). We show a learning curve Corpora and the PCEEC (Santorini 2010). suggestive that parser performance will con- Constituent structure is represented with la- tinue to strongly improve with access to more belled bracketing and augmented with gram- training data. matical functions and notation for recovering

― 432 ― Copyright(C) 2014 The Association for Natural Processing. All Rights Reserved. discontinuous constituents. Primary motiva- tion forthe annotationhas been to facilitate au- Figure 1: growth of structure rules tomated searches for linguistic research (e.g., 14000 1 enhanced model all enhanced model f > 1 via CorpusSearch ), and to provide a syntactic enhanced model f > 2 enhanced model f > 4 12000 vanilla model all base that is sufficiently rich to enable an au- vanilla model f > 1 vanilla model f > 2 vanilla model f > 4 tomatic generation of (higher-order) predicate 10000 logic based meaning representations.2 A typi- cal parse in tree form looks like: 8000

6000 ✘IP-MAT✘❍❤❤❤❤❤ PP NP-SBJ PP VB P VB2 AX AXD PU ✥✥✥❵❵❵ 4000 Number of phrase structure rules 。 NP P * ✭✭✭✭NP❤❤❤❤ P 寝 て しまい まし た N は✭✭✭✭IP-EMB✏✭✏❳❤❳❤❳❤❤❤ N で 2000 弟 PP NP-OB1 VB AXDまま 0 NP P *を* つけ た 0 2000 4000 6000 8000 10000 12000 14000 Number of training sentences N を テレビ Every word has a word level part-of-speech la- this was created with techniques from Johnson bel. Phrasal nodes (NP, PP, ADJP, etc.) im- (1998), Klein and Manning (2003) and Fraser mediately dominate the phrase head (N, P, et al. (2013) among others, of: ADJ, etc.), so that the phrase head has as sis- ters both modifiers and complements. Mod- 1. eliminating discontinuous constituents ifiers and complements are distinguished be- and (Section 4.1), cause there are extended phrase labels to mark function (e.g., -EMB encodes that the clause 2. transforming and augmenting treebank テ レ ビ を つ け た is a complement of the annotations (Section 4.2), and phrase head まま). All noun imme- diately dominated by IP are marked for func- 3. following the extraction of phrase struc- tion (NP-SBJ=subject, NP-OB1=direct object, ture rules, lexical rules, and their fre- NP-TMP=temporal NP, etc.). The PP la- quencies from the annotated parse trees, bel is never extended with function mark- markovising the grammar (Section 4.3). ing. However the immediately following sibling of a PP may be present in the an- Note that all curves of Figure 1 remain steep at notation to provide disambiguation informa- the maximum training set size of 12,375 trees, suggesting more data would lead to more sig- tion for the PP. Thus, (NP-OB1 *を*) in- dicates the immediately preceeding PP (with nificant growth. As a comparison, the treebank case particle を) is the object, while (NP-SBJ grammar of Schmid (2006) extracted from the Penn Treebank of English has 52,297 phrase *) indicates the immediately preceeding PP (without case particle) is the subject. All structure rules (enabling a labelled bracketing clauses have extended labels to mark function F-score of 86.6%). (IP-MAT=matrix clause, IP-ADV=adverbial clause, IP-REL=relative clause, etc.). 4.1 Discontinuous constituents The Keyaki Treebank annotates trace nodes, 4 Extracted grammars for example with relative clauses. But un- like the Penn Treebank (Bies et al. 1995) trace Figure 1 shows the growth of extracted phrase nodes are not indexed and typically appear structure rules for a vanilla grammar model di- clause initially, with precise attachment points rectly read off the Keyaki Treebank and also unspecified since it is enough to assume that for an enhanced model. The enhanced model constituents at the IP level are dependents of is obtained after reversible changes are made the main of the clause. For trace nodes to the treebank aimed at improving parsing within embedded contexts (cases of long dis- quality. This section focuses on the changes tance dependency) it is the phrase level of at- made for the enhanced model. Specifically tachment for the trace node that is the relevant 1http://corpussearch.sourceforge.net indicator of the dependency. For the current 2http://www.compling.jp/ts work we assume parse trees from which trace

― 433 ― Copyright(C) 2014 The Association for Natural Language Processing. All Rights Reserved. nodes are removed and aim to recognize dis- section 3 results in the following tree repre- continuous constituents in a post-processing sentation: step, following for example Johnson (2001). IP-MATˆTOP 。 ✭✭✭IP✭-MATˆTOP✭❤❤❤❤ PU- 。 ✥✥PPˆIP✥❵❵❵ ✭✭✭✭PPˆIP✭❤❤❤❤❤ VB-INTRNS P-CONJ-て VB2 AX2 AXD 4.2 Transforming and augmenting NPˆPP P-OPTR-は ✭✭✭✭NPˆPP✭❤❤❤❤❤ P-CASE-で 寝 て しまい まし た N ✭は✭✭IP-EMBˆNP✭✭❤❤❤❤❤N-STRUCTURAL で annotations 弟 ✥✥PPˆIP✥❵❵❵ VB-TRNS AXD まま NPˆPP P-CASE-をつけ た N を In addition to removing trace nodes, transfor- テレビ mations and augmentations of the trees are performed. Specifically interjection, punctu- ation or parenthetical materials occurring at 4.3 Markovisation the left or right periphery of a constituent The Keyaki Treebank uses rather flat struc- are moved to a new projection of the con- tures, particularly at the clause level, with stituent. There is also Parent Encoding, fol- nodes having up to 31 child nodes. As Fraser lowing Johnson (1998), which copies the syn- et al. (2013) note this causes problems be- tactic label of a parent node (minus the func- cause only some rules of that length appear tional information) onto the labels of its chil- in the training data. This sparse data prob- dren. Finally there are refinements to the POS lem is solved by markovisation (Collins 1997), tags. which splits long rules into a set of shorter Refinements to the POS tags include the rules. The following shows consequences to P (particle) tag becoming either P-CASE, P- our running example of markovising the rule CONJ, P-COORD, P-CP-THT, P-FINAL, P- IP-MATˆTOP–> PPˆIP PPˆIP VB-INTRNS P- NOUN or P-OPTR depending on the func- CONJ-て VB2 AX2 AXD. tional role of the particle and/or the syntac- ✭✭IP-MATˆTOP✭✭❤❤❤❤ tic context in which the particle occurs. IP-MATˆTOP PU-。 。 are also split to inform information about the ✭✭✭✭✭✭ AXD clause of occurrence: VB-THT (verb with AX2 た VB2 まし P-CONJ-て しまい VB-INTRNS て

✥✥PPˆIP✥❵❵❵ ✭✭✭✭PPˆIP✭❤❤❤❤❤ 寝 NPˆPP P-OPTR-は NPˆPP P-CASE-で ✭✭✭✭✭❤❤❤❤❤ The auxiliary symbols that are created en- N は IP-EMBˆNP✭✭❤❤ N-STRUCTURAL で ✭✭✭ ❤❤❤ code information about the parent category, 弟 ✥✥PPˆIP✥❵❵❵ VB-TRNS AXD まま NPˆPP P-CASE-をつけ た the head child, the child that is generated next N を and the previously generated child. Because テレビ all auxiliary symbols encode the head cate- gory, the head is selected by the first rule, CP-THT (clausal) complement), VB-DITRNS while being generated later. The markovisa- (triggered by NP-OB2), VB-TRNS (triggered tion strategy is currently set to transform rules by NP-OB1), VB-INTRNS (default encod- that occur less than 60 times in the training ing). data. In addition, the POS tags of the most fre- quent particles, の, は, に, を, が, て, と, で, も, か, から, etc., are marked with a feature to 5 Parsing experiments identify the specific particle. For example, と can be tagged as either P-CASE-と, P-CONJ- The treebank was randomised because of the と, P-COORD-と or P-CP-THT-と. This can diverse nature of the treebank (see Table 1) be seen as a restricted form of lexicalisation. and a dataset of 1,300 trees was held-out for In the same way, the auxiliary verbs あっ, あ testing and the remaining 12,375 trees were り, ある, あれ, あろ, で and な are “lexi- used for training. In all evaluations, we used calised”, as is the negation marker ず and cer- the original Keyaki parse trees as gold stan- tain punctuations, e.g., ‘。’. However, other dard and converted the parse trees generated similar enrichments of the POS information by our parser to the same format by undoing was found to hurt overall parsing performance. the transformations and removing the augmen- Applying the above changes to the tree of tations. Perfect segmentation was used for all

― 434 ― Copyright(C) 2014 The Association for Natural Language Processing. All Rights Reserved. Table 2: PARSEVAL results (results for sen- Figure 3: coverage results tence lengths ≤ 40 in brackets) 1300

1200 Model precision recall F-score

Vanilla 65.73 (68.10) 67.92 (70.27) 66.81 (69.17) 1100 Enhanced 79.97 (81.84) 80.61 (82.58) 80.29 (82.21) 1000 enhanced model <=40 enhanced model all data vanilla model <=40 vanilla model all data 900 Number of valid parsings evaluations, being necessary to obtain a PAR- 800 SEVAL score. The results of parsing using the vanilla and 700 enhanced PCFG models on the test data are 600 0 2000 4000 6000 8000 10000 12000 14000 given in Table 2, using the standard PARSE- Number of training sentences VAL measures (Black et al., 1991), i.e., values for bracketing precision, recall, and F-score, be competitive with other attempts at con- but for fully labelled evaluation only. Figure 2 stituency parsing of Japanese, notably Tanaka shows learning curves for all sentences and for and Nagata (2013). Our results strongly sug- lengths ≤ 40. In contrast to the two gest that the enhanced parsing model would curves of the vanilla model, the two curves of benefit considerably from the availability of the enhanced model remain steep at the maxi- more training data. More training data is ex- mum training set size of 12,375 trees. Figure pected to also enable improvementsfrom a yet 3 shows coverage results for the models with more fine-grained label set. differing amounts of training data. The en- References hanced model offers high coverage early on, a consequence of the markovisation. This also Alastair Butler, Zhu Hong, Tomoko Hotta, Ruriko explains some of the apparent lack of growth Otomo, Kei Yoshimoto and Zhen Zhou. 2012. or even loss in F-score, as F-score is only cal- Keyaki Treebank: phrase structure with func- culated from valid parsings. tional information for Japanese. In Proceed- ings of Text Annotation Workshop. Fraser, Alexander, Helmut Schmid, Richárd Figure 2: learning curves for all sentences and Farkas, Renjing Wang and Hinrich Schutze. sentence lengths ≤ 40 2013. Knowledge Sources for Constituent Parsing of German, a Morphologically Rich 84 and Less-Configurational Language. Compu-

82 tational Linguistics, 39(1):57–85, 2013.

80 Johnson, Mark. 1998. PCFG models of linguistic

78 tree representations. Computational Linguis-

76 tics, 24(4):613–632, 1998. enhanced model <=40 74 enhanced model all data Johnson, Mark. 2001. A simple pattern-matching vanilla model <=40 vanilla model all data algorithm for recovering empty nodes and 72

Labelled brackets F-score their antecedents. In ACL, pages 136–143. 70 Klein, Dan and Chris Manning. 2003. Accurate 68 unlexicalized parsing. In ACL-03. 66 Schmid, Helmut. 2006. Trace prediction and 64 0 2000 4000 6000 8000 10000 12000 14000 recovery with unlexicalized PCFGs and slash Number of training sentences features. In COLING-ACL, pages 177–184. 6 Conclusion Schmid, Helmut. 2004. Efficient parsing of highly ambiguous context-free grammars This paper has presented a PCFG treebank with bit vectors. In COLING, pages 162–168. grammar for Japanese trained on the Keyaki Tanaka, Takaaki and Masaaki Nagata. 2013. Con- treebank. Parsing performance was enhanced structing a Practical Constituent Parser from with Parent Encoding, reversible tree trans- a Japanese Treebank with Function Labels. Proceedings of the Fourth Workshop on Sta- formations, refinement of treebank labels and tistical Parsing of Morphologically Rich Lan- Markovisation. This establishes a significant guages, pages 108–118. parsing baseline for Japanese, that appears to

― 435 ― Copyright(C) 2014 The Association for Natural Language Processing. All Rights Reserved.