A Metagrammar for Vietnamese LTAG 129

A Metagrammar for Vietnamese LTAG

Lê Hồng Phương Nguyễn Thị Minh Huyền Azim Roussanaly LORIA/INRIA Lorraine Hanoi University of Science LORIA/INRIA Lorraine Nancy, France Hanoi, Vietnam Nancy, France [email protected] [email protected] [email protected]

Abstract of natural language processing in general and in the task of parsing Vietnamese in particular. No We present in this paper an initial inves- work on formalizing Vietnamese is re- tigation into the use of a metagrammar ported before (Nguyen et al., 2004). In (Lê et for explicitly sharing abstract grammati- al., 2006), basic declarative structures and comple- cal specifications for the Vietnamese lan- ment clauses of Vietnamese sentences have been guage. We first introduce the essential syn- modeled using about thirty elementary trees, rep- tactic mechanisms of the Vietnamese lan- resenting as many subcategorization frames. We guage. We then show that the basic sub- show in this paper that these basic subcatego- categorization frames of Vietnamese can rization frames can be compactly represented by be compactly represented by classes us- classes in XMG formalism. ing the XMG formalism (eXtensible Meta- We first introduce the essential syntactic mech- Grammar). Finally, we report on the im- anisms of the . We then show plementation the first metagrammar pro- that the basic subcategorization frames of Viet- ducing verbal elementary trees recogniz- namese can be compactly represented by classes ing basic Vietnamese sentences. using the XMG formalism. We then report on the 1 Introduction implementation the first metagrammar producing verbal elementary trees recognizing basic Viet- Metagrammars (MG) have recently emerged as a namese sentences, before concluding. means to develop wide-coverage LTAG for well- studied languages like English, French and Ital- 2 Vietnamese Subcategorizations ian (Candito, 1999; Kinyon, 2003). MGs help avoid redundancy and reduce the effort of gram- As for other isolating languages, the most impor- mar development by making use of common prop- tant syntactic information source in Vietnamese is erties of LTAG elementary trees. word order. The basic word order is Subject – Verb We present in this paper an initial investiga- – Object. A verb is always placed after the sub- tion into the use of a metagrammar for explic- ject in both predicative and question forms. In a itly sharing abstract grammatical specifications for , the main noun precedes the adjec- the Vietnamese language. We use the eXtensible tives and the genitive follows the governing noun. MetaGrammar (XMG) tool which was developed The other syntactic means are function words, by Crabbé (Crabbé, 2005; Parmentier and L. Roux, , and, in the case of spoken language, 2005) to compile a TAG for Vietnamese. The built prosody (Nguyễn et al., 2006). grammar is called vnMG and is made available From the point of view of functional gram- 1 online for free access . mar, the syntactic structure of Vietnamese fol- Only in recent years have Vietnamese re- lows a topic-oriented structure. It belongs to the searchers begun to be involved in the domain topic-prominent languages as described by (Li and 1http://www.loria.fr/ lehong/tools/vnMG.php Thompson, 1976). In those languages, topics are ∼

Proceedings of The Ninth International Workshop on Tree Adjoining and Related Formalisms Tübingen, Germany. June 6-8, 2008. 130 Le, Nguyen and Roussanaly

coded in the surface structure and they tend to con- is feeble., Học cũng là làm việc / To study is trol co-referentiality. The topic-oriented “double to work. subject” construction is a basic sentence type. For example, “Cậu ấy khoẻ mạnh, là sinh viên y khoa 2.3 Third Type Predicates / He strong, be student medicine”, which means The third type predicates are predicates which con- that “He is strong, he is medicine student”. In Viet- nect directly to their subjects in the declarative namese, passive voice and cleft subject sentences form; however in the negative form, they are con- are rare or non-existent. nected to their subjects by a . Predicates of In general, Vietnamese predicates may be clas- this type are usually sified into three types depending on the need of a • A clause: Nó vẫn tên là Quýt. / His name is copula connecting them with their subjects in the still Quýt. declarative and negative forms (Nguyễn, 2004). Complex predicates can be constructed to form co- • A composition of a numeral and a noun: Lê ordinated predicative structures starting from these này mười ngàn đồng. / This pear costs ten basic types of predicates. We present briefly these thousand dongs. three types of Vietnamese predicates in the follow- ing subsections. • A composition of a preposition and a noun: Lúa này của chị Hoa. / This is the rice of Ms. 2.1 First Type Predicates Hoa. The first type predicates are predicates which con- • An expression: Thằng ấy đầu bò đầu bướu nect directly to their subjects without the need of lắm. / That guy is very stubborn. a copula in both of the declarative and negative forms. For example 2.4 Subcategorizations • Declarative form: Tôi đọc sách. / I am reading In the first grammar LTAG for Vietnamese pre- books. sented in (Lê et al., 2006), each subcategorization is represented by the same structure of elemen- • Negative form: Tôi không đọc sách. / I am not tary trees associcated with a considered . reading books. We view that the suject is subcategorized in the These predicates are assumed by verbal or same way like arguments. The verbs anchor thus adjectival phrases. The fact that an adjective can be elementary trees composed of a node for the sub- a predicate is a specificity of Vietnamese in com- ject and one or more nodes for each of its essential parison with predicates of occidental languages. In complements. English or French for instance, only verbal phrases We follow the de facto standard that in TAG, in can be predicates, adjectives in these languages al- which each subcategorization is represented by a ways signify properties of subjects and they are al- family of elementary trees. We define families of ways followed the verb “to be” in English or “être” verbal elementary trees in the Table 1. in French. We present in the next section a metagrammar that generates this set of elementary trees. 2.2 Second Type Predicates 3 A Metagrammar for Verbal Trees The second type predicates are predicates which are connected to their subjects by the copula “là” The subcategorizations of elementary trees de- in the declarative form and by copulas “không là” scribe only “canonical” constructions of predica- or “không phải”, or “không phải là” in the negative tive elements without taking into account for rela- form. Predicates of this type are rather rich. They tive or question structures. For the purpose of in- can be: vestigation, we constraint ourselves in developing at the first stage only the verb spines and argument • or noun phrases: Tôi là sinh viên. / I realizations shown in the subcategorizations pre- am student. sented in the previous section. • Verbs, adjectives, verbal phrases or adjecti- We have developed a XMG metagrammar that val phrases: Van xin là yếu đuối. / Begging consists of 11 classes (or tree fragments). The

Proceedings of The Ninth International Workshop on Tree Adjoining Grammars and Related Formalisms Tübingen, Germany. June 6-8, 2008. A Metagrammar for Vietnamese LTAG 131

Subcategorizations Families Examples S Intransitive N0V ngủ/sleep With a nominal N0V N1 đọc/to N0 ↓ PredP complement read With a clausal N0VS1 tin/to be- complement lieve tôi V⋄ N1 ↓ With modal com- N0V0V1 mong/to plement wish đọc sách Ditransitive N0V N1N2 cho/to give Figure 1: Declarative transitive structure αn0Vn1 Ditransitive with a N0V N1ON2 vay/to preposition borrow Ditransitive with a N0V0N1V1 lãnh 4 Conclusion and Future Work verbal complement đạo/to lead This paper presents an initial investigation into Ditransitive with an N0V N1A làm/to the use of XMG formalism for developing a first adjectival comple- make metagrammar producing a LTAG for Vietnamese ment which recognizes basic verbal constructions. We Movement verbs N0V0V1N1 ra/to go have shown that the essential subcategorization with a nominal out frames of Vietnamese predicates can be effectively complement encoded by means of XMG classes while retain- Movement verbs N0V0AV1 trở nên/to ing basic properties of the realized verbal trees. with an adjectival become This confirms that various syntactic phenomena of complement Vietnamese can be covered in a Vietnamese MG. Movement ditransi- N0V0N1V1N2 chuyển/to The first evaluation of the MG for Vietnamese tive transfer is promising but the lexical coverage has to be improved further. Moreover, the grammar cover- Table 1: Subcategorizations of Vietnamese verbs age needs to be revised by refining the constraints of agrammatical syntactic constructions. Although there are not many tree fragments in the current metagrammar is currently able to produce the metagrammar, we find that the current MG over- same set of elementary trees described in Table 1 generates some undesired structures. The MG will including intransitive, transitive, ditransitive fami- also be extended to deal with constructions not yet lies with and/or without optional complements. As covered like adjectival and construc- an illustration, the declarative transitive structure tions. We also intend to generate a test suite to doc- in Figure 1 can be defined by combining a canon- ument the grammars and perform realistic evalua- ical subject fragment with an active verb and a tions. canonical object fragment. There is an existing work on the development of metagrammars for not frequently studied lan- + + S S S guages like Korean and Yiddish and their rela- tions to a (Kinyon, 2006). They N↓ PredP V PredP showed that cross-linguistic generalizations, for example the verb-second phenomenon, can be in- corporated into a multilingual MG. We think that V V N↓ a comparison of the Vietnamese MG with this work would be useful. In particular, a study of the This combination is conveniently expressed by relative position of verbs and arguments of Viet- a statement in terms of XMG language as usual: namese and relate it to this work would be benefi- tial. TransitiveVerb = Subject ∧ ActiveVerb ∧ Object.

Proceedings of The Ninth International Workshop on Tree Adjoining Grammars and Related Formalisms Tübingen, Germany. June 6-8, 2008. 132 Le, Nguyen and Roussanaly

References Marie-Hélène Candito. 1999. Représentation modu- laire et paramétrable de grammaires électroniques lexicalisées : application au franc¸ais et à l’italien. Doctoral Dissertation, Université Paris 7. Benoit Crabbé. 2005. Représentation informatique de grammaires fortement lexicalisées. Doctoral Disser- tation, Université Nancy 2. Nguyễn Thị Minh Huyền, Laurent Romary, Mathias Rossignol and Vũ Xuân Lương. 2006. A Lexicon for Vietnamese Language Processing. Language Re- sources and Evaluation, Vol. 40, No. 3–4. Kinyon A. and Rambow O. 2003. Using the Meta- Grammar to generate cross-language and cross- framework annotated test-suites. In Proc. LINC- EACL, Budapest. Alexandra Kinyon and Carlos A. Prolo. 2002. A Clas- sification of Grammar Development Strategies. Pro- ceedings of the Workshop on Grammar Engineering, Taipei, Taiwan. Kinyon, Alexandra and Rambow, Owen and Schef- fler, Tatjana and Yoon, SinWon and Joshi, Aravind K. 2006. The Metagrammar Goes Multilingual: A Cross-Linguistic Look at the V2-Phenomenon. Pro- ceedings of the Eighth International Workshop on Tree Adjoining Grammar and Related Formalisms, Sydney, Australia Lê Hồng Phương, Nguyễn Thị Minh Huyền, Laurent Romary, Azim Roussanaly. 2006. A Lexicalized Tree-Adjoining Grammar for Vietnamese. Proceed- ings of LREC 2006, Genoa, Italia. Thanh Bon Nguyen, Thi Minh Huyen Nguyen, Lau- rent Romary, Xuan Luong Vu. 2004. Developing Tools and Building Linguistic Resources for Viet- namese Morpho-Syntactic Processing. Proceedings of LREC 2004, Lisbon, Portugal. Charles N. Li and Sandra A. Thompson. 1976. Subject and topic: a new typology of language. In Charles N. Li (ed.). Subject and Topic. London/New York: Academic Press, pp. 457-489.. Yannick Parmentier and Joseph L. Roux. 2005. XMG: a Multi-formalism Metagrammar Framework. Pro- ceedings of the Tenth ESSLLI Student Session. Nguyễn Minh Thuyết and Nguyễn Văn Hiệp. 2004. Thành phần câu tiếng Việt. NXB Giáo dục, Hà Nội, Vietnam.

Proceedings of The Ninth International Workshop on Tree Adjoining Grammars and Related Formalisms Tübingen, Germany. June 6-8, 2008.