Constructing a Practical Constituent Parser from a Japanese Treebank with Function Labels

Constructing a Practical Constituent Parser from a Japanese Treebank with Function Labels Takaaki Tanaka and Masaaki Nagata NTT Communication Science Laboratories Nippon Telegraph and Telephone Corporation tanaka.takaaki, nagata.masaaki @lab.ntt.co.jp { } Abstract Since dependency between bunsetsus can treat flexible bunsetsu order, most publicly available Japanese We present an empirical study on construct- parsers including CaboCha (Kudo et al., 2002) and ing a Japanese constituent parser, which can KNP (Kawahara et al., 2006) return bunsetsu-based output function labels to deal with more de- dependency as syntactic structure. Such bunsetsu- tailed syntactic information. Japanese syntactic parse trees are usually represented as based parsers generally perform with high accuracy unlabeled dependency structure between bun- and have been widely used for various NLP applica- setsu chunks, however, such expression is in- tions. sufficient to uncover the syntactic information However, bunsetsu-based representations also about distinction between complements and have serious shortcomings for dealing with Japanese adjuncts and coordination structure, which is sentence hierarchy. The internal structure of a bun- required for practical applications such as syntactic reordering of machine translation. We setsu has strong morphotactic constraints in contrast describe a preliminary effort on constructing to flexible bunsetsu order. A Japanese predicate a Japanese constituent parser by a Penn Tree- bunsetsu consists of a main verb followed by a se- bank style treebank semi-automatically made quence of auxiliary verbs and sentence final parti- from a dependency-based corpus. The eval- cles. There is an almost one-dimensional order in uations show the parser trained on the tree- the verbal constituents, which reflects the basic hi- bank has comparable bracketing accuracy as erarchy of the Japanese sentence structure including conventional bunsetsu-based parsers, and can output such function labels as the grammatical voice, tense, aspect and modality. Bunsetsu-based role of the argument and the type of adnominal representation cannot provide the linguistic structure phrases. that reflects the basic sentence hierarchy. Moreover, bunsetsu-based structures are unsuit- 1 Introduction able for representing such nesting structure as co- ordinating conjunctions. For instance, bunsetsu rep- In Japanese NLP, syntactic structures are usually resentation of a noun phrase “技術-の (technology- represented as dependencies between grammatical GEN) / 向上-と (improvement-CONJ) / 経済-の chunks called bunsetsus. A bunsetsu is a grammat- (economy-GEN) / 発展 (growth) ” technology im- ical and phonological unit in Japanese, which con- provement and economic growth does not allow sists of an independent-word such as noun, verb us to easily interpret it, which means ((technol- or adverb followed by a sequence of zero or more ogy improvement) and (economic growth)) or (tech- dependent-words such as auxiliary verbs, postposi- nology (improvement and economic growth)), be- tional particles or sentence final particles. It is one cause bunsetsu-based dependencies do not con- of main features of Japanese that bunsetsu order is vey information about left boundary of each noun much less constrained than phrase order in English. phrase (Asahara, 2013). This drawback complicates 108 Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically Rich Languages, pages 108–118, Seattle, Washington, USA, 18 October 2013. c 2013 Association for Computational Linguistics operating syntactically meaningful units in such ap- Japanese treebank Hinoki based on JACY and used plications as statistical machine translation, which it for parser training. needs to recognize syntactic units in building a trans- Masuichi et al.(2003) developed a Japanese LFG lation model (e.g. tree-to-string and tree-to-tree) and (Lexicalized-Functional Grammar) (Kaplan et al., in preordering source language sentences. 1982) parser whose grammar is sharing the de- Semantic analysis, such as predicate-argument sign with six languages. Uematsu et al. (2013) structure analysis, is usually done as a pipeline pro- constructed a CCG (Combinatory Categorial Gram- cess after syntactic analysis (Iida et al., 2011 ; mar) bank based on the scheme proposed by Hayashibe et al., 2011 ); but in Japanese, the dis- Bekki (2010), by integrating several corpora includ- crepancy between syntactic and semantic units cause ing a constituent-based treebank converted from a difficulties integrating semantic analysis with syn- dependency-base corpus. tactic analysis. These approaches above use a unification-based Our goal is to construct a practical constituent parser, which offers rich information integrating parser that can deal with appropriate grammatical syntax, semantics and pragmatics, however, gener- units and output grammatical functions as semi- ally requires a high computational cost. We aim semantic information, e.g., grammatical or seman- at constructing a more light-weighted and practical tic roles of arguments and gapping types of relative constituent parser, e.g. a PCFG parser, from Penn clauses. We take an approach to deriving a grammar Treebank style treebank with function labels. Gab- from manually annotated corpora by training prob- bard et al. (2006) introduced function tags by modi- abilistic models like current statistical constituent fying those in Penn Treebank to their parser. Even parsers of de facto standards (Petrov et al., 2006; though Noro et al. (2005) built a Japanese corpus for Klein et al., 2003 ; Charniak, 2000; Bikel, 2004). deriving Japanese CFG, and evaluated its grammar, We used a constituent-based treebank that Uematsu they did not treat the predicate-argument structure or et al. (2013) converted from an existing bunsetsu- the distinction of adnominal phrases. based corpus as a base treebank, and retag the non- This paper is also closely related to the work of terminals and transform the tree structures in de- Korean treebank transformations (Choi et al., 2012). scribed in Section 3. We will present the results of Most of the Korean corpus was built using grammat- evaluations of the parser trained with the treebank in ical chunks eojeols, which resemble Japanese bun- Section 4, and show some analyses in Section 5. setsus and consist of content words and morphemes that represent grammatical functions. Choi et al. 2 Related work transformed the eojeol-based structure of Korean treebanks into entity-based to make them more suit- The number of researches on Japanese constituent- able for parser training. We converted an existing based parser is quite few compared to that of bunsetsu-based corpus into a constituent-based one bunsetsu-dependency-based parser. Most of them and integrating other information into it for training have been conducted under lexicalized grammatical a parser. formalism. HPSG (Head-driven Phrase Structure Gram- 3 Treebank for parser training mar) (Sag et al., 2003 ) is a representative one. Gunji et al. (1987) proposed JPSG (Japanese Phrase In this section, we describe the overview of our tree- Structure Grammar) that is theoretically precise to bank for training a parser. handle the free word order problem of Japanese. Na- gata et al. ( 1993 ) built a spoken-style Japanese 3.1 Construction of a base treebank grammar and a parser running on it. Siegel et al ( Our base treebank is built from a bunsetsu- 2002 ) constructed a broad-coverage linguistically dependency-based corpus, the Kyoto Corpus (Kuro- precise grammar JACY, which integrates semantics, hashi et al., 2003), which is a collection of news- MRS (Minimal Recursion Semantics) (Copestake, paper articles, that is widely used for training data 2005). Bond et al. ( 2008 ) built a large-scale for Japanese parsers and other applications. We 109 S S IP-MAT[nad]:A IP-MAT[nad]:A PP-OB2 VP[nad]:A PP-OB2 PP-OBJ VP[nad]:A NN PCS PP-OBJ VP[nad]:A NN PCS NN PCS VB[nad] AUX NN PCS VB[nad] AUX 生徒に本を与えた生徒に本を与えた student -DAT book -ACC give -PAST student -DAT book -ACC give -PAST (I) gave the student a book. binary tree n-ary (flattened) tree Figure 1: Verb Phrase with subcategorization and voice information automatically converted from dependency structure to phrase structure by the previously described method (Uematsu et al., 2013), and conversion er- rors of structures and tags were manually corrected. We adopted the annotation schema used in NN General noun NNP Proper noun Japanese Keyaki treebank (Butler et al., 2012) and NPR Pronoun Annotation Manual for the Penn Historical Corpora NV Verbal noun and the PCEEC (Santorini, 2010) as reference to re- NADJ Adjective noun NADV Adverbial noun (incl. temporal noun) tag the nonterminals and transform the tree struc- NNF Formal noun (general) tures. NNFV Formal noun (adverbial) PX Prefix The original Kyoto Corpus has fine-grained part- SX Suffix of-speech tags, which we converted into simpler NUM Numeral CL Classifier preterminal tags shown in Table 1 for training by VB Verb lookup tables. First the treebank’s phrase tags ex- ADJ Adjective cept function tags are assigned by simple CFG rule ADNOM Adnominal adjective ADV Adverb sets, then, function tags are added by integrating the PCS Case particle information from the other resources or manually PBD Binding particle PADN Adnominal particle annotated. We integrate predicate-argument infor- PCO Parallel particle mation from the NAIST Text Corpus (NTC) (Iida et PCJ Conjunctive particle al., 2007 ) into the treebank by automatically

Constructing a Practical Constituent Parser from a Japanese Treebank with Function Labels

Mandative Subjunctive in Arabic Syntax: a Minimalist Study

Syntactic Category: Noun Phrase

Introduction to Transformational Grammar

THE RELATIONSHIP BETWEEN NOUN PHRASE and VERB PHRASE Japensarage [email protected]´

An Introduction to the Study of Morphology

The Acquisition of Inflection: a Case Study Joseph Galasso California State University—Northridge

Minimum of English Grammar Glossary

'Merge' Over 'Move'

Approaches to the Structure of English Small Clauses Vivien Tomacsek

Finiteness in Jordanian Arabic: a Semantic and Morphosyntactic Approach

The Split IP Parameter in Second Language Learning

Pharasiot Greek Word Order and Clause Structure