<<

COLING 2010

23rd International Conference on Computational Linguistics

Proceedings of the 8th Workshop on Asian Language Resources

21-22 August 2010 Beijing, China Produced by Chinese Information Processing Society of China All rights reserved for Coling 2010 CD production.

To order the CD of Coling 2010 and its Workshop Proceedings, please contact:

Chinese Information Processing Society of China No.4, Southern Fourth Street Haidian District, Beijing, 100190 China Tel: +86-010-62562916 Fax: +86-010-62562916 [email protected]

ii Preface

Language resources play a central role in statistical and learning-based approaches to natural language processing. Thus, recent research has put great emphasis in building these resources for target languages. Parallel resources across various languages are also being developed for multilingual processing. These include lexica and corpora with multiple levels of annotations. Though significant progress has been achieved in modeling few of the Asian languages, with the wider spread of ICT use across the region, there is a growing interest in this field from other linguistic communities. As research in the field matures across Asia, there is a growing need for developing language resources. However the region is not only short in the linguistic resources for more than 2200 language spoken in the region, there is also lack of experience in the researchers to develop these resources. As the efforts to develop the linguistic resources increases, there is also need to coordinate the efforts to develop common frameworks and processes so that these resources can be used by various groups of researchers equally effectively.

The workshop is organised by the Asian Language Resources Committee (ALRC) of Asian Federation for Natural Language Processing (AFNLP). The aim are to chart and catalogue the status of Asian Language Resources, to investigate and discuss the problems related to the standards and specification on creating and sharing various levels of language resources, to promote a dialogue between developers and users of various language resources in order to address any gaps in language resources and practical applications, and to nurture collaboration in their development and use, to provide opportunity for researchers from Asia to collaborate with researchers in other regions.

This is the eighth workshop in the series, and has been representative, with 35 submissions for Asian languages, including Bahasa Indonesia, Chinese, Dzongkha, Hindi, Japanese, Khmer, Sindhi, Sinhala, Thai, Turkish and Urdu, of which 22 have been finalized for presentation. We would like to thank the authors for their submissions and the Program Committee for their timely reviews. We hope that ALR workshops will continue to encourage researchers to focus on developing and sharing resources for Asian languages, an essential requirement for research in NLP.

ALR8 Workshop Organizers

iii Organizers: Sarmad Hussain, CLE-KICS, UET, Pakistan Virach Sornlertlamvanich, NECTEC, Thailand Hammam Riza, BPPT, Indonesia (on behalf of ALRC, AFNLP)

Program Committee: Mirna Adriani Pushpak Bhatacharyya Francis Bond Miriam Butt Thatsanee Charoenporn Key-Sun Choi Ananlada Chotimongkol Jennifer Cole Li Haizhou Choochart Haruechaiyasak Hitoshi Isahara Alisa Kongthon Krit Kosawat Yoshiki Mikami Cholwich Nattee Rachel Roxas Dipti Sharma Kiyoaki Shirai Thepchai Supnithi Thanaruk Theeramunkong Takenobu Tokunaga Ruvan Weerasinghe Chai Wutiwiwatchai Yogendra Yadava

iv Table of Contents

A Thesaurus of Predicate-Argument Structure for Japanese Verbs to Deal with Granularity of Verb Meanings Koichi Takeuchi, Kentaro Inui, Nao Takeuchi and Atsushi Fujita ...... 1

Collaborative Work on Indonesian WordNet through Asian WordNet (AWN) Chairil Hakim, Budiono Budiono and Hammam Riza ...... 9

Considerations on Automatic Mapping Large-Scale Heterogeneous Language Resources: Sejong Se- mantic Classes and KorLex Heum Park, Ae sun Yoon, Woo Chul Park and Hyuk-Chul Kwon ...... 14

Sequential Tagging of Semantic Roles on Chinese FrameNet Jihong LI, Ruibo WANG and Yahui GAO ...... 22

Augmenting a Bilingual Lexicon with Information for Word Translation Disambiguation Takashi Tsunakawa and Hiroyuki Kaji ...... 30

Construction of bilingual multimodal corpora of referring expressions in collaborative problem solving Takenobu Tokunaga, Ryu Iida, Masaaki Yasuhara, Asuka Terai, David Morris and Anja Belz. .38

Labeling Emotion in Bengali Blog Corpus A Fine Grained Tagging at Sentence Level Dipankar Das and Sivaji Bandyopadhyay ...... 47

SentiWordNet for Indian Languages Amitava Das and Sivaji Bandyopadhyay ...... 56

Constructing Thai Opinion Mining Resource: A Case Study on Hotel Reviews Choochart Haruechaiyasak, Alisa Kongthon, Pornpimon Palingoon and Chatchawal Sangkeet- trakarn...... 64

The Annotation of Event Schema in Chinese Hongjian Zou, Erhong Yang, Yan Gao and Qingqing Zeng ...... 72

Query Expansion for Khmer Information Retrieval Channa Van and Wataru Kameyama ...... 80

Word Segmentation for Urdu OCR System Misbah Akram and Sarmad Hussain ...... 88

Dzongkha Word Segmentation Sithar Norbu, Pema Choejey, Tenzin Dendup, Sarmad Hussain and Ahmed Muaz ...... 95

Building NLP resources for Dzongkha: A Tagset and A Tagged Corpus Chungku Chungku, Jurmey Rabgay and Gertrud Faaß ...... 103

Unaccusative/Unergative Distinction in Turkish: A Connectionist Approach Cengiz Acarturk and Deniz Zeyrek ...... 111

v A Preliminary Work on Hindi Causatives Rafiya Begum and Dipti Misra Sharma ...... 120

A Supervised Learning based Chunking in Thai using Categorial Grammar Thepchai Supnithi, Chanon Onman, Peerachet Porkaew, Taneth Ruangrajitpakorn, Kanokorn Trakultaweekoon and Asanee Kawtrakul ...... 129

A hybrid approach to Urdu verb phrase chunking Wajid Ali and Sarmad Hussain...... 137

Development of the Korean Resource Grammar: Towards Grammar Customization Sanghoun Song, Jong-Bok Kim, Francis Bond and Jaehyung Yang ...... 144

An Open Source Urdu Resource Grammar Shafqat Mumtaz Virk, Muhammad Humayoun and Aarne Ranta ...... 153

A Current Status of Thai Categorial Grammars and Their Applications Taneth Ruangrajitpakorn and Thepchai Supnithi ...... 161

Chained Machine Translation Using Morphemes as Pivot Language Li Wen, Chen Lei, Han Wudabala and Li Miao ...... 169

vi Conference Program

Saturday August 21, 2010

Semantics

9:00–9:25 A Thesaurus of Predicate-Argument Structure for Japanese Verbs to Deal with Granularity of Verb Meanings Koichi Takeuchi, Kentaro Inui, Nao Takeuchi and Atsushi Fujita

9:25–9:50 Collaborative Work on Indonesian WordNet through Asian WordNet (AWN) Chairil Hakim, Budiono Budiono and Hammam Riza

9:50–10:15 Considerations on Automatic Mapping Large-Scale Heterogeneous Language Re- sources: Sejong Semantic Classes and KorLex Heum Park, Ae sun Yoon, Woo Chul Park and Hyuk-Chul Kwon

10:15–10:40 Sequential Tagging of Semantic Roles on Chinese FrameNet Jihong LI, Ruibo WANG and Yahui GAO

10:40–11:00 Coffee Break

Semantics, Sentiment and Opinion

11:00–11:25 Augmenting a Bilingual Lexicon with Information for Word Translation Disam- biguation Takashi Tsunakawa and Hiroyuki Kaji

11:25–11:50 Construction of bilingual multimodal corpora of referring expressions in collabo- rative problem solving Takenobu Tokunaga, Ryu Iida, Masaaki Yasuhara, Asuka Terai, David Morris and Anja Belz

11:50–12:15 Labeling Emotion in Bengali Blog Corpus A Fine Grained Tagging at Sentence Level Dipankar Das and Sivaji Bandyopadhyay

12:15–12:40 SentiWordNet for Indian Languages Amitava Das and Sivaji Bandyopadhyay

12:40–14:10 Lunch Break

vii Saturday August 21, 2010 (continued)

Opinion and Information Retrival

14:10–14:35 Constructing Thai Opinion Mining Resource: A Case Study on Hotel Reviews Choochart Haruechaiyasak, Alisa Kongthon, Pornpimon Palingoon and Chatchawal Sang- keettrakarn

14:35–15:00 The Annotation of Event Schema in Chinese Hongjian Zou, Erhong Yang, Yan Gao and Qingqing Zeng

15:00–15:25 Query Expansion for Khmer Information Retrieval Channa Van and Wataru Kameyama

15:25–16:00 Coffee Break

Text Corpus

16:00–16:25 Word Segmentation for Urdu OCR System Misbah Akram and Sarmad Hussain

16:25–16:50 Dzongkha Word Segmentation Sithar Norbu, Pema Choejey, Tenzin Dendup, Sarmad Hussain and Ahmed Muaz

16:50–17:15 Building NLP resources for Dzongkha: A Tagset and A Tagged Corpus Chungku Chungku, Jurmey Rabgay and Gertrud Faaß

17:15–17:40 Unaccusative/Unergative Distinction in Turkish: A Connectionist Approach Cengiz Acarturk and Deniz Zeyrek

viii Sunday August 22, 2010

Grammars and Parsing

9:00–9:25 A Preliminary Work on Hindi Causatives Rafiya Begum and Dipti Misra Sharma

9:25–9:50 A Supervised Learning based Chunking in Thai using Categorial Grammar Thepchai Supnithi, Chanon Onman, Peerachet Porkaew, Taneth Ruangrajitpakorn, Kanokorn Trakultaweekoon and Asanee Kawtrakul

9:50–10:15 A hybrid approach to Urdu verb phrase chunking Wajid Ali and Sarmad Hussain

10:15–10:40 Development of the Korean Resource Grammar: Towards Grammar Customization Sanghoun Song, Jong-Bok Kim, Francis Bond and Jaehyung Yang

10:40–11:00 Coffee Break

Grammars and Applications

11:00–11:25 An Open Source Urdu Resource Grammar Shafqat Mumtaz Virk, Muhammad Humayoun and Aarne Ranta

11:25–11:50 A Current Status of Thai Categorial Grammars and Their Applications Taneth Ruangrajitpakorn and Thepchai Supnithi

11:50–12:15 Chained Machine Translation Using Morphemes as Pivot Language Li Wen, Chen Lei, Han Wudabala and Li Miao

12:15 Concluding Remarks and Discussion

ix A Thesaurus of Predicate-Argument Structure for Japanese Verbs to Deal with Granularity of Verb Meanings Koichi Takeuchi Kentaro Inui Nao Takeuchi Atsushi Fujita Okayama University / Tohoku University / Free Language Future University Hakodate / [email protected]. inui@ecei. Analyst [email protected] okayama-u.ac.jp tohoku.ac.jp

Abstract ables us to deal with detailed paraphrase/non- paraphrase relations in NLP. In this paper we propose a framework From the view of the current language re- of verb semantic description in order to source, how the shared/different meanings of organize different granularity of similar- “He lent her a bicycle” and “He gave her a bi- ity between verbs. Since verb mean- cycle” can be described? The shared mean- ings highly depend on their arguments ing of lend and give in the above sentences is we propose a verb thesaurus on the ba- that they are categorized to Giving Verbs,as sis of possible shared meanings with in Levin’s English Verb Classes and Alterna- predicate-argument structure. Motiva- tions (EVCA) (Levin, 1993), while the different tions of this work are to (1) construct a meaning will be that lend does not imply own- practical lexicon for dealing with alter- ership of the theme, i.e., a bicycle. One of the nations, paraphrases and entailment re- problematic issues with describing shared mean- lations between predicates, and (2) pro- ing among verbs is that semantic classes such as vide a basic database for statistical learn- Giving Verbs should be dependent on the gran- ing system as well as a theoretical lex- ularity of meanings we assumed. For example, icon study such as Generative Lexicon the meaning of lend and give in the above sen- and Lexical Conceptual Structure. One tences is not categorized into the same Frame of the characteristics of our description in FrameNet (Baker et al., 1998). The reason is that we assume several granularities for this different categorization can be consid- of semantic classes to characterize verb ered to be that the granularity of the semantic meanings. The thesaurus form allows us class of Giving Verbs is larger than that of the to provide several granularities of shared Giving Frame in FrameNet1. From the view of meanings; thus, this gives us a further re- natural language processing, especially dealing vision for applying more detailed analy- the with propositional meaning of verbs, all of ses of verb meanings. the above classes, i.e., the wider class of Giv- 1 Introduction ing Verbs containing lend and give as well as the narrower class of Giving Frame containing In natural language processing, to deal with give and donate, are needed. Therefore, in this similarities/differences between verbs is essen- work, in order to describe verb meanings with tial not only for paraphrase but also textual en- several granularities of semantic classes, a the- tailment and QA system which are expected saurus form is adopted for our verb dictionary. to extract more valuable facts from massively Based on the background, this paper presents large texts such as the Web. For example, in a thesaurus of predicate-argument structure for the QA system, assuming that the body text verbs on the basis of a lexical decompo- says “He lent her a bicycle”, the answer of the sitional framework such as Lexical Concep- question “He gave her a bicycle?” should be tual Structure (Jackendoff, 1990); thus our “No”, however the answer of “She rented the 1We agree with the concept of Frame and FrameEle- bicycle?” should be “Yes”. Thus construct- ments in FrameNet but what we propose in this paper is the ing database of verb similarities/differences en- necessity for granularities of Frames and FrameElements.

1 Proceedings of the 8th Workshop on Asian Language Resources, pages 1–8, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing proposed thesaurus can deal with argument ing verbs would be little different due to the dif- structure level alternations such as causative, ferent background theory of each English lexical transitive/intransitive, stative. Besides, tak- database; it must be not easy to add another level ing a thesaurus form enables us to deal with of semantic granularity with keeping consistency shared/differenciate meaning of verbs with con- for all the lexical databases; thus, thesaurus form sistency, e.g., a verb class node of “lend” and is needed to be a core form for describing verb “rent” can be described in the detailed layer of meanings2. the node “give”. We constructed this thesaurus on Japanese verbs and the current status of the verb thesaurus 2.2 Lexical Resources in Japanese is this: we have analyzed 7,473 verb meanings In previous studies, several Japanese lexicons (4,425 verbs) and organized the semantic classes were published: IPAL (IPA, 1986) focuses on in a ve-layer thesaurus with 71 semantic roles morpho-syntactic classes but IPAL is small3. types. Below, we describe background issues, EDR (Jap, 1995) consists of a large-scale lex- basic design issues, what kind of problems re- icon and corpus (See Section 3.4). EDR is main, limitations and perspectives of applica- a well-considered and wide coverage dictio- tions. nary focusing on translation between Japanese 2 Existing Lexical Resources and and English, but EDR’s semantic classes were Drawbacks not designed with linguistically-motivated lex- ical relations between verbs, e.g., alternations, 2.1 Lexical Resources in English causative, transitive, and detransitive relations From the view of previous lexical databases between verbs. We believe these relations must In English, several well-considered lexical be key for dealing with paraphrase in NLP. databases are available, e.g., EVCA, Dorr’s Recently Japanese FrameNet (Ohara et al., LCS (Dorr, 1997), FrameNet, WordNet (Fell- 2006) and Japanese WordNet (Bond et al., 2008) baum, 1998), VerbNet (Kipper-Schuler, 2005) are proposed. Japanese FrameNet currently and PropBank (Palmer et al., 2005). Be- published only less than 100 verbs4. Besides sides there is the research project (Pustejovsky Japanese WordNet contains 87000 words and and Meyers, 2005) to nd general descriptional 46000 synsets, however, there are three major framework of predicate argument structure by difculty of dealing with paraphrase relations merging several lexical databases such as Prop- between verbs: (1) there is no argument informa- Bank, NomBank, TimeBank and PennDiscouse tion; (2) existing many similar synsets force us to Treebank. solve ne disambiguation between verbs when Our approach corresponds partly to each we map a verb in a sentence to WordNet; (3) the lexical database, (i.e., FrameNet’s Frame and basic verbs of Japanese (i.e., highly ambiguous FrameElements correspond to our verb class verbs) are wrongly assigned to unrelated synsets and semantic role labels, and the way to orga- because they are constructed by translation from nize verb similarity classes with thesaurus cor- English to Japanese. responds with WordNet’s synset), but is not 2As Kipper (Kipper-Schuler, 2005) showed in their exactly the same; namely, there is no lex- examples mapping between VerbNet and WordNet verb ical database describing several granularities senses, most of the mappings are many-to-many relations; of semantic classes between verbs with argu- this indicates that some two verbs grouped in a same se- mantic type in VerbNet can be categorized into different ments. Of course, since the above English lex- synsets in WordNet. Since WordNet does not have argu- ical databases have links with each other, it is ment structure nor syntactic information, we cannot pur- possible to produce a verb dictionary with sev- chase what is the different features for between the synsets. 3It contains 861 verbs and 136 adjectives. eral granularities of semantic classes with argu- 4We are supplying our database to Japanese FrameNet ments. However, the basic categories of classify- project.

2 3 Thesaurus of Predicate-Argument gives a book to her child”, “Kazuko rents a bicy- Structure cle from her friend”, and “Taro lend a car to his friend”. As Figure 1 shows that all of the above The proposed thesaurus of predicate-argument verb senses are involved in the verb class Moving structure can deal with several levels of verb of One’s Possession 5. The semantic description, classes on the basis of granularity of dened verb which expresses core meaning of the verb class meaning. In the thesaurus we incorporate LCS- Moving of One’s Possession is based semantic description for each verb class that can provide several argument structure such ([Agent] CAUSE) as construction grammar (Goldberg, 1995). This BECOME [Theme] BE AT [Goal]. must be high advantage to describe the different Where the brackets [] denote variables that can factors from the view of not only syntactic func- be lled with arguments in example sentences. tions but also internal semantic relations. Thus Likewise parentheses () denote occasional factor. this characteristics of the proposed thesaurus can “Agent” and “Theme” are semantic role labels be powerful framework for calculating similar- that can be annotated to all example sentences. ity and difference between verb senses. In the Figure 1 shows that the children of the verb following sections we explain the total design of class Moving of One’s Possession are the two thesaurus and the details. verb classes Moving of One’s Possession/Renting and Moving of One’s Possession/Lending.Inthe 3.1 Design of Thesaurus Renting class, rent, hire and borrow are there, The proposed thesaurus consists of hierarchy of while in the Lending class, lend and lease exist. verb classes we assumed. A verb class, which Both of the semantic descriptions in the children is a conceptual class, has verbs with a shared verb classes are more detailed ones than the par- meaning. A parent verb class includes concepts ent’s description. of subordinate verb class; thus a subordinate verb class is a concretization of the parent verb 7"2 '33 class. A verb class has a semantic description 07&)$0#)"H3033"33&0) $&7"@2")4@'")!@%&2"@ that is a kind of semantic skeleton inspired from 02208@'"3"@A DF $")4G" %E'%"%F%"("G lexical conceptual structure (Jackendoff, 1990; '% F*0'I1"230)G

Kageyama, 1996; Dorr, 1997). Thus a seman- 3"()5 !"3 2&150) tic description in a verb class describes core se- 07&)$0#)"H3033"33&0)C mantic relations between arguments and shadow ")!&)$ 2")4@%&2"@ '")!@'"3" arguments of a shared meaning of the verb class. 02208 DF $")4G" %ED'% /'")!&)$ Since verb can be polysemous, each verb sense is 07&)$0#)"B3033"33&0)C F%"("G40F*0'GE'%"%F%"("G ")5)$ '% F*0'I1"230)G designated with example sentences. Verb senses DF $")4G" %ED'% /F $")4G with a shared meaning are assigned to a verb 2")5)$F%"("GE'%"%F%"("G class. Every example sentence is analyzed into '% F*0'I1"230)G their arguments and semantic role types; and then their arguments are linked to variables in se- Figure 1: Example of verb classes and their se- mantic description of verb class. This indicates mantic descriptions in parent-children. that one semantic description in a verb class can provide several argument structure on the basis A semantic description in the Renting class, i.e., of syntactic structure. This architecture is related to construction grammar. ([Agent] CAUSE) Here we explain this structure using verbs 5The name of a verb class consists of hierarchy of such as rent, lend, give, hire, borrow, lease.We thesaurus; and Figure 1 shows abbreviated verb class name. Full length of the verb class name is Change of assume that each verb sense we focus on here is State/Change of Position (Physical)/Moving of One’s Pos- designated by example sentences, e.g., “Mother session.

3 (BY MEANS OF [Agent] renting [Theme]) the same results, e.g., “He destroyed the door,” BECOME [Theme] BE AT [Agent], we would like to regard them as having the same describes semantic relations between “Agent” meaning. and “Theme”. Since semantic role labels are an- We dene verb classes in intermediate hierar- notated to all of the example sentences, the vari- chy by grouping verb sense on the basis of aspec- ables in the semantic description can be linked tual category (i.e., action, state, change of state), to practical arguments in example sentences via argument type (i.e., physical, mental, informa- semantic role labels (See Figure 2). tion), and more detailed aspects depending on aspectual category. For example, walk the coun- )6%(#)"(!91)11!11%)(@ try, travel all over Europe and get up the stairs !(3(# can be considered to be in the Move on Path AC #!(2D" %BA'% /C #!(2D class. 0!(3(#C$!'!DB'%"%C$!'!D Verb class is essential for dealing with verb '% C #!(2D meanings as synsets in WordNet. Even if we had given an incorrect class name, the thesaurus will 75&)0!(21$)51!"0)'$!0"0%!(   #!(2$!'!)50! work well if the whole hierarchy keeps is-a rela- tion, namely, the hierarchy does not contain any 0)$%0!151 multiple inheritance.  #!(2$!'! The most ne-grained verb class before in- dividual verb sense is a little wider than alter- nations. Currently, for the ne-grained verb Figure 2: Linking between semantic description class, we are organizing what kind of differenti- and example sentences. ated classes can be assumed (e.g., manner, back- ground, presupposition, and etc.).

3.2 Construction of Verb Class Hierarchy 3.3 Semantic Role Labels To organize hierarchical semantic verb class, we The aim of describing arguments of a target verb take a top down and a bottom up approaches. As sense is (1) to link the same role arguments in for a bottom up approach, we use verb senses a related verb sense and (2) to provide disam- dened by a dictionary as the most ne-grained biguated information for mapping a surface ex- meaning; and then we group verbs that can be pression to a verb sense. The Lexeed database considered to share some meaning. As for a dic- provides a representative sentence for each word tionary, we use the Lexeed database (Fujita et sense. The sentence is simple, without adjunc- al., 2006), which consists of more than 20,000 tive elements such as unessential time, location verbs with explanations of word sense and ex- or method. Thus, a sentence is broken down into ample sentences. subject and object, and semantic role labels are As a top down approach, we take three se- annotated to them (Figure 3). mantic classes: State, Change of State,andAc- tivity as top level semantic classes of the the- ex.: nihon-ga shigen-wo yunyuu-suru saurus according to Vendler’s aspectual analy- trans.: Japan resouces import sis (Vendler, 1967) (See Figure 4). This is be- (NOM) (ACC) cause the above three classes can be useful for AS: Agent Theme dealing with the propositional, especially, resul- tative aspect of verbs. For example “He threw a ball” can be an Activity and have no special re- Figure 3: An example of semantic role label. sult; but “He broke the door” can be a Change of State and then we can imagine a result, i.e., bro- Of course, only one representative sentence ken door. When other verb senses can express would miss some essential arguments; also, we

4 hierarchical verb class verbs CAUSE Test Ken-ga hon-wo tana-ni oku Activity oku (put on), (Ken puts a book on a shelf.) Change of Position Move to idou-suru ACT ON BECOME Change of (physical) Goal (move to),.. hon-ga tana-ni idou-suru State (A book moves to a shelf.) Change of Position (animates) [Ken]x [book]y BE AT aru (be), hon-ga tana-ni aru State Exist sonzai-suru [book]y [shelf]z Swell/ Swell (A book is on a shelf.) (exit),.. Position (physical) Sag Sag antonymy Activity State Attribute

Change of State super-event sub-event

Figure 4: Thesaurus and corresponding lexical decomposition.

do not know how many arguments are enough. hierarchical verb class compositional description 6 test This can be solved by adding examples ;how- activity change of position move to ([A] CAUSE) BECOME [T] BE AT [G] ever, we consider the semantic role labels of each change of (physical) goal state representative sentence in a verb class as an ex- change of position (animates) partially correspond ample of assumed argument structure to a verb state exist class. That is to say, we regard a verb class as a [T] BE AT [G] concept of event and suppose it to be a xed ar- gument frame for each verb class. The argument Figure 5: Compositional semantic description. frame is described as compositional relations. The principal function of the semantic role la- bel name is to link arguments in a verb class. In this verb thesaurus, being different from One exception is the Agent label. This can be a previous LCS studies, we try to ensure the com- marker discriminating transitive and intransitive positional semantic description as much as pos- verbs. Since the semantic class of the thesaurus sible by means of linking each sub-event struc- Change of State focuses on , transitive alternation ture to both a semantic class and example sen- cases such as “The president expands the busi- tences. Therefore, we believe that our verb the- ness” and “ The business expands” can be cate- saurus can provide a basic example data base for gorized into the same verb class. Then, these two LCS study. examples are differentiated by the Agent label. 3.5 Intrinsic Evaluation on Coverage 3.4 Compositional Semantic Description We did manual evaluation that how the pro- As described in Section 3.1, we incorporate com- posed verb thesaurus covers verb meanings in positional semantic structure to each verb class news articles. The results on Japanese new cor- to describe syntactically motivated lexical se- pus show that the coverage of verbs is 84.32% mantic relations and entailment meanings that (1825/2195) in 1000 sentences randomly sam- will expand the thesaurus. The benet of com- pled from Japanese news articles7. Besides positional style is to link entailed meanings by we take 200 sentences and check whether the means of compositional manner. As an example of entailment, Figure 5 shows that a verb class verb meanings in the sentences can correspond Move to Goal entails Theme to be Goal, and this to verb meaning in our thesaurus. The result corresponds to a verb class Exist. shows that our thesaurus meaning covers 99.5% (199 verb meanings/200 verb meanings) of 200 6We are currently constructing an SRL annotated cor- pus. 7Mainichi news article in 2003.

5 verbs8. Table 1: Comparing to English resources FrameNet WordNet VerbNet 4 Discussions Concepts 825 N/A 237 (Frame) (Synset) (class) (Ver 1.3) (Ver 3.0) (Ver 2.2) 4.1 Comparison with Existing Resources 2007 2006 2006 Words 6100 155287 3819 WM N/A 117659 5257 Table 1 and Table 2 show a comparison of sta- Exp 13500 48349 N/A tistical characteristics with existing resources. In SRL 746 N/A 23 the tables, WN and Exp denote the number of POS V,N,A,Ad V,N,A,Ad V Lang E,O E,O E word meanings and example sentences, respec- tively. Also, SRL denotes the number corre- Table 2: Comparing to Japanese resources sponding to semantic role label. EDR JWordNet Our Thesaurus Looking at number of concepts, our Thesaurus Concepts 430000 N/A 709 has 709 types of concepts (verb classes) which (class) (Ver 3.0) (Ver 0.92) is similar to FrameNet and more than VerbNet. 2003 2009 2008 This seems to be natural because FrameNet is Words 410000 92241 4425 also language resource constructed on argument WM 270000 56741 7473 structure. Thanks to our thesaurus format, if we Exp 200000 48276 7473 SRL 28 N/A 71 need more ne grained concepts, we can expand POS all V,N,A,Ad V our thesaurus by adding concepts as new nodes Lang EJ E,J J at the lowest layer in the hierarchy. While at the number of SRL, FrameNet has much more types 4.2 Limitations of Developed Thesaurus than our thesaurus, and in the other resources One of the difculties of annotating the seman- VerbNet and EDR the number of SRL is less than tic class of word sense is that a word sense can our thesaurus. This comes from the different de- be considered as several semantic classes. The sign issue of semantic role labels. In FrameNet proposed verb thesaurus can deal with multiple they try to differentiate argument types on the ba- semantic classes for a verb sense by adding them sis of the assumed concept, i.e., Frame. In con- into several nodes in the thesaurus. However, trast with FrameNet we try to merge the same this does not seem to be the correct approach. type of meaning in arguments. VerbNet and For example, what kind of Change of State se- EDR also dened abstracted SRL; The differ- mantic class can be considered in the following ence between their resources and our thesaurus sentence? is that our SRLs are dened taking into account what kind of roles in the core concept i.e., verb a. He took on a passenger. class; while SRLs in VerbNet and EDR are not dependent on verb’s class. Assuming that passenger is Theme, Move to goal Table 2 shows that our thesaurus does not have could be possible when we regard the vehicle9 as large number in registered words and examples Goal. In another semantic class, Change State of comparing to EDR and JWordNet. As we stated Container could be possible when we regard the in Section 3.5, the coverage of our verb class to vehicle as a container. Currently, all of the verb newspaper articles are high, but we try to add ex- senses are linked to only one semantic class that amples by constructing annotated Japanese cor- can be considered as the most related semantic pus of SRL and verb class. class.

9Vehicle does not appear in the surface expression but 8This evaluation is done by one person. Of course vehicle can exist. We currently describe the shadow argu- we need to check this by several persons and take inter- ment in the compositional description, but it would be hard annotator agreement. to prove the existence of a shadow argument.

6 A C B From the user side, i.e., dealing with the event propositional meaning of the sentence (a.), vari- D expression ous meanings should be estimated. Consider the camera=person following sentence:

``I sold C to B’’ b. Thus, we were packed. give ``I bought C from A’’ present As the semantic class of the sentence (a.) difference of Change State of Container could better explain standing point is essentially A C B need to normalize why they are packed in the sentence (b.) expressed by estimation The other related issue is how we describe the D scope, e.g., Figure 6: Requirement of normalization to deal c. He is hospitalized. with different expressions from the different If we take the meaning as a simple manner, Move views. to Goal can be a semantic class. This can be correct from the view of annotation, but we can relations with “or” in logical form. Therefore, guess he cannot work or he will have a tough nding and describing these verb relations will time as following events. FrameNet seems to be be essential for dealing with propositional mean- able to deal with this by means of a special type ings of a sentence. of linking between Frames. For further view of application, event match- Consequently, we think the above issues of se- ing to nd a similar situation in Web documents mantic class should depend on the application is supposed to be a practical and useful appli- side’s demands. Since we do not know all of the cation. Assuming that a user is confronted with requirements of NLP applications currently, then the fact that wireless LAN in the user’s PC does it must be sufcient to provide an expandable de- not work, and the user wants to search for doc- scriptional framework of linguistically motivated uments that provide a solution, the problem is lexical semantics. that expressions of situations must be different 4.3 Remaining and Conceivable Ways of from the views of individual writers, e.g., “wire- Extension less LAN did not work” or “wireless LAN was disconnected”. How can we nd the same mean- One of the aims of the proposed dictionary is to ing in these expressions, and how can we ex- identify the sentences that have the same mean- tract the answers by nding the same situation ings among different expressions. One of the from FAQ documents? To solve this, a lexical challenging paraphrase relations is that the sen- database describing verb relations between “go tences expressed from the different view points. wrong” and “disconnect” must be the base for Given the buying and selling in Figure 6, a hu- estimating how the expressions can be similar. man can understand that both sentences denote Therefore, constructing a lexicon can be worth- almost the same event from different points of while for developing NLP applications. view. This indicates that the sentences made by humans usually contain the point of view of the 5Conclusion speaker. This is similar to a camera, and we need to normalize the expressions as to their original In this paper, we presented a framework of a verb meaning. dictionary in order to describe shared meaning as We consider that NLP application researchers well as to differentiate meaning between verbs need to relate these expressions. Logically, if we from the viewpoint of relating eventual expres- know “buy” and “sell” have shared meanings of sions of NLP. One of the characteristics is that giving and taking things, we can describe their we describe verb relations on the basis of several

7 semantic granularities using a thesaurus form hinoki treebank. In COLING/ACL06 Interactive with argument structure. Semantic granularity Presentation Sessions, pages 65–68. is the basis for how we categorize (or recognize Goldberg, Adele E. 1995. Constructions.TheUni- which semantic class relates to a verb meaning). versity of Chicago Press. Also, we ensure functions and limitations of se- IPA: Information-Technology Promotion Agency, mantic classes and argument structure from the Japan, 1986. IPA Lexicon of the Japanese Lan- viewpoint of dealing with paraphrases. That is, guage for Computers. required semantic classes will be highly depen- dent on applications; thus, the framework of the Jackendoff, Ray. 1990. Semantic Structures.MIT Press. verb-sense dictionary should have expandability. The proposed verb thesaurus can take several se- Japan Electronic Dictionary Research Institute, Ltd, mantic granularities; therefore, we hope the verb 1995. EDR: Electric Dictionary the Second Edi- tion. thesaurus will be applicable to NLP’s task10. In future work, we will continue to organize Kageyama, Taro. 1996. Verb Semantics. Kurosio differentiated semantic classes between verbs Publishers. (In Japanese). and develop a system to identify the same event Kipper-Schuler, K. 2005. VerbNet: A broad- descriptions. coverage, comprehensive verb lexicon.Ph.D.the- sis, PhD Thesis, University of Pennsylvania. Acknowledgments Levin, Beth. 1993. English Verb Classes and Alter- We would like to thank NTT Communication nations. University of Chicago Press. Research Laboratory for allowing us to use the Ohara, Kyoko Hirose, Seiko Fujii, Toshio Ohori, Lexeed database, and Professor Koyama for Ryoko Suzuki, Hiroaki Saito, and Shun Ishizaki. many useful comments. 2006. Frame-based contrastive lexical seman- tics and japanese framenet: The case of risk and kakeru. In Proceeding of the Fourth Inter- national Conference on Construction Grammar. References http://jfn.st.hc.keio.ac.jp/ja/publications.html. Baker, Collin F., Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet project. Palmer, Martha, Daniel Gildea, and Paul Kingsbury. In Proceedings of the 36th Annual Meeting of the 2005. The proposition bank: An annotated cor- Association for Computational Linguistics, pages pus of semantic roles. Computational Linguistics, 86–90. 31(1):71–105. Pustejovsky, J. and Martha P.and A. Meyers. 2005. Bond, Francis, Hitoshi Isahara, Kyoko Kanzaki, Merging propbank, nombank, timebank, penn dis- and Kiyotaka Uchimoto. 2008. Construction course treebank and coreference. In Proceedings of Japanese WordNet from Multi-lingual Word- of the Workshop on Frontiers in Corpus Annota- Net. In Proceedings of the 14th Annual Meeting tion II: Pie in the Sky, pages 5–12. of Japanese Natural Language Processing, pages 853–856. Vendler, Zeno. 1967. Linguistics in Philosophy. Cornell University Press. Dorr, Bonnie J. 1997. Large-Scale Dictionary Con- struction for Foreign Language Tutoring and In- terlingual Machine Translation. Machine Transla- tion, 12(4):271–325.

Fellbaum, Chistiane. 1998. WordNet an Electronic Lexical Database. MIT Press.

Fujita, Sanae, Takaaki Tanaka, Fransis Bond, and Hi- romi Nakaiwa. 2006. An implemented descrip- tion of japanese: The lexeed dictionary and the 10The proposed verb thesaurus is available at: http://cl.cs.okayama-u.ac.jp/rsc/data/. (in Japanese).

8 Collaborative Work on Indonesian Wordet through Asian Wordet (AW)

Hammam Riza Budiono Chairil Hakim

Agency for the Assessment Agency for the Assessment Agency for the Assessment and Application of and Application of and Application of Technology (BPPT), Technology (BPPT), Technology (BPPT), Indonesia Indonesia Indonesia [email protected] [email protected] [email protected]

Abstract The goal of Indonesian AWN database man- agement system is to share a multilingual lexical This paper describes collaborative work database of Indonesian language which are on developing Indonesian WordNet in structured along the same lines as the AWN. the AsianWordNet (AWN). We will de- AWN is the result of the collaborative effort in scribe the method to develop for colla- creating an interconnected Wordnet for Asian borative editing to review and complete languages. AWN provides a free and public plat- the translation of synset. This paper aims form for building and sharing among AWN. The to create linkage among Asian languages distributed database system and user-friendly by adopting the concept of semantic re- tools have been developed for user. AWN is lations and synset expressed in Word- easy to build and share. Net. This paper describes manual interpretation me- thod of Indonesian for AWN. Based on web ser- vices architecture focusing on the particular 1 Introduction cross-lingual distributed. We use collective in- telligence approach to build this English equiva- Multilingual lexicons is of foremost importance lent. In this sequel, in section 2 the collabora- for intercultural collaboration to take place, as tions builders works on web interface at multilingual lexicons are several multilingual www.asianwordnet.org . In section 3, Interpretation application such as Machine Translation, of Indonesian AWN, short description of terminology, multilingual computing. progress of English – Indonesian translation and WordNet is the resource used to identify shallow the obstacle of translation. semantic features that can be attached to lexical units. The original WordNet is English WordNet 2 Collaborative AW proposed and developed at Princeton University WordNet (PWN) by using bilingual dictionary. WordNet covers the vast majority of nouns, In the era of globalization, communication verbs, adjectives and adverbs from English lan- among languages becomes much more guage. The words are organized in synonym sets important. People has been hoping that natural called synset. Each synset represents a concept language processing and speech processing. We includes an impressive number of semantic rela- can assist in smoothening the communication tions defined across concepts. among people with different languages. The information encoded in WordNet is used in However, especially for Indonesian language, several stages in the parsing process. For in- there were only few researches in the past. stance, attribute relations, adjective/adverb clas- The Princeton WordNet is one of the semantical- sifications, and others are semantic features ex- ly English lexical banks containing semantic tracted from WordNet and stored together with relationships between words. Concept mapping the words, so that they can be directly used by is a process of organizing to forming meaningful the semantic parser. relationships between them.

9 Proceedings of the 8th Workshop on Asian Language Resources, pages 9–13, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing

To build language WordNet there are two main of discussion; the merge approach and the ex- NUM MCF pand approach. The merge approach is to build Mongolia Myanmar the taxonomies of the language (synset) using BPPT NAST English equivalent words from bilingual dictio- Indonesia Lao PDR naries. The expand approach is to map translate NICT AsianWordNet MPP local words the bilingual dictionaries. This ap- Japan Nepal proach show the relation between senses. The system manages the synset assignment accord- VAST UCSC Vietnam Sri Langka ing to the preferred score obtained from the revi- sion process. For the result, the community will TCL Thailand NE C- be accomplish into original form of WordNet database. The synset can generate a cross lan- Fig 1. Collaboration on Asian WordNet guage result. AWN also introduce a web-based collaborative 3 Interpretation of Indonesian AW workbench, for revising the result of synset as- signment and provide a framework to create Indonesian WordNet have been used as a gener- AWN via linkage through PWN synset. AWN al-purpose translation. Our approach was to enables to connect and collaborate among indi- generate the query for the web services engine in vidual intelligence in order accomplish a text English and then to translate every key element files. of the query (topic, focus, keywords) into Indo- At present, there are ten Asian language in the nesian without modifying the query. The dictio- community. The amount of the translated syn- nary is distinguished by set of entry word cha- sets had been increased. Many language have racteristic, clear definitions, its guidance on collaboration in AWN. usage. All dictionary information for entries is • Agency for the Assessment and structured such as entry word, multiple word Application of Technology (BPPT), entries, notes, contemporary definitions, deriva- Indonesia tions, example sentence, idioms, etc. All dictio- • National Institute of Information and nary are implemented as text-files and as lin- Communications Technology (NICT), guistic databases connected to Indonesian AWN. Japan The set of language tags consists of part of speech, case, gender, number, tense, person, • Thai Computational Linguistics voice, aspect, mood, form, type, reflexive, ani- Laboratory (TCL), Thailand mation. • National Electronics and Computer

Technology Center (NECTEC), 3.1 Progress English – Indonesian Thailand • National University of Mongolia Indonesian WordNet is used Word Net Man- (NUM), Mongolia agement System (WNMS) tools developed by • Myanmar Computer Federation (MCF), AsianWordNet to create web services among Myanmar Asia languages based on Princeton WordNet® • National Authority of Science and version 3.0, Co-operation by TCL and BPPT Technology (NAST), Lao PDR establish on October 2007. • Madan Puraskar Pustakalaya (MPP), As presented above, we follow the merge to Nepal create and share the Indonesian WordNet by • University of Colombo School of translating the each synonym translation. We Computing (UCSC), SriLanka expand an appropriate synset to a lexical entry • Vietnamese Academy of Science and by considering its English equivalent. Technology (VAST), Vietnam We plan to have reliable process to create and share Indonesian WordNet in AWN. We classify this work into four person AWN translators to participate in the project of Indonesian AWN.

10 Each person was given a target translator in a An automatic compilation of dictionary in AWN month should reach at least 3000 sense so that have a translational issues. There are many cases the total achievement 12000 senses in a month. in explanation sense. One word in English will From 117.659 senses that there is expected to be be translated into a lot of Indonesian words, completed within 10 months. On the process of glossary can be express more than one Indone- mapping, a unique word will be generated for sian word (Ex. 1). every lexical entry which contain. The grammat- One of the main obstacles in studying the ab- ical dictionaries contain normalized entry word sorption of English words in Indonesian words, with hyphenation paradigm plus grammatical is the fact that the original form of some words tags. that have been removed due to Indonesian reform process, in which some words have been Assignment TOTAL through an artificial process. There is no special March April May sense character in Indonesian word, especially in tech- nical word, so that means essentially the same as Noun 10560 14199 16832 82115 the English word (Ex. 2). verb 6444 6444 6499 13767 Adjective 1392 1392 1936 18156 Ex. 1. time frame Adverb 481 481 488 3621 POS noun time synset time_frame Total 18877 22516 25755 117659 gloss a time period during which Table 1. Statistic of synsets something occurs or is In the evaluation of our approach for synset as- expected to occur; signment, we selected randomly sense from the an agreement can the result of synset assignment to English – In- be reached in donesian dictionary for manually checking. The a reasonably random set cover all types of part-of-speech. short time frame" With the best information of English equivalents Indonesian jangka waktu, marked with CS=5. The word entry must be selang waktu translated into the appropriate words by meaning explanation. Ex. 2. resolution Table 1. presents total assignment translated POS noun phenomenon words into Indonesian for the second third synset resolution month. Following the manual to translate the gloss (computer science) English AWN to Indonesian, we resulted the the number progress of AWN at this time. of pixels per We start to translate or edit from some group of square inch on base type in “By Category”. These base types a computer are based on categories from PWN. There is on- generated display; ly 21.89% ( approved 25,755 of 117,659 senses ) the greater of the total number of the synsets that were able the resolution, the to be assigned to lexical entry in the Indonesian better the picture – English Dictionary. Indonesian resolusi

3.2 Obstacle of Indonesian Translation Using definitions from the WordNet electronic lexical database. A major problem in natural Wordnet has unique meaning of word which is language processing is that of lexical ambiguity, presented in synonym set. Each synset has glos- be it syntactic. Each single words must be con- sary which defines concept its representation. tainer for some part of the linguistic knowledge For examples word car, auto, automobile, and need to ambiguous wordnet sense. Therefore, motorcar has one synset.

11 not only a single heuristic translate Indonesian are selected in the glossaries of each noun, verb, words. The WordNet defined in some semantic and adjectives and its subordinates. relations, this categories using lexicographer file WordNet information, whose objective is to and glossary definitions relations are assigned to build designs for the association between sen- weight in the range. WordNet hierarchy for the tences and coherence relations as well as to find first sense of the word “ empty ” there are 10 lexical characteristics in coherence categories. synset words (I take three of ten) that are related WordNet became an ancillary tool for semantic to the meaning are the following in ( Ex. 3.) ontology design geared to high quality informa- Three concepts recur in WordNet literature that tion extraction from the web services. entail a certain amount of ambiguity : termino- A comparative analysis of trends in wordnet use logical distance, semantic distance and concep- : tual distance. Terminological distance, by con- trast, often appears to refer to the suitability of 1. Support for the design of grammatical the word selected to express a given concept. categories designed to classify Semantic distance is understood to mean the information by aspects and traits, but in contextual factor of precision in meaning. And particular to design and classify the conceptual distance between words, in which semantic ontologies. have relations proved. 2. Basis for the development of audio- visual and multi-media information Ex. 3. empty retrieval systems. POS noun art synset empty 4 Internet Viewer gloss a container that has been The pilot internet service based on Wordnet 3.0 emptied; "return is published at http://id.asianwordnet.org . all empties to the store" 5 Discussion and Conclusion Indonesian hampa Any multilingual process such as cross-lingual empty information must involve resources and lan- POS verb change guage pair. Language specific can be applied in synset empty, discharge parallel to achieve best result. gloss become empty or In this paper we describe manually sharing of void of its Indonesian in the AWN by using dictionaries. content; "The AWN provides a free and public platform for room emptied" building and sharing among AWN. We want Indonesian mengosongkan continue the work defined learning the service matching system. Our future work on AWN will empty focuses in development platform WordNet and POS adjectives all language technology web services. synset empty Although AWN application are going steadily, gloss emptied of emotion; the limitations are: "after the violent argument he felt 1. AWN designed for manual so empty" authenticity can not be a reference. Indonesian kosong, penat 2. Classification was performed manually, Disambiguation is unquestionably the most ab- which means that the reasons and depth undant and varied application. It precision and of classification may not be consistent. relevance in response to a query inconsisten- cies. Schematically the semantic disambiguation

12 References Valenina Balkova, Andrey Suhonogov, Sergey Yablonsky. 2004. Rusian WordNet: From UML-notation to Internet/Infranet Database Implementation. In Porceedings of the Second International WordNet Conference (GWC 2004), Riza, H., Budiono, Adiansya P., Henky M., (2008). I/ETS: Indonesian-English Machine Translation System using Collaborative P2P Corpus, Agency for the Assessment and Application of Technology (BPPT), Indonesia, University of North Texas. Shi, Lei., Rada Mehalcea, (2005), Putting Pieces Together : Combining FrameNet, VerbNet, and WordNet for Robust Semantic Parsing Thoongsup, S., Kergrit Robkop, Chumpol Mokarat, Tan Sinthurahat, (2009). Thai WordNet Construction. Thai Computational Linguistics Lab., Thailand Virach Sornlertlamvanich., The 5th International Conference of the Global WordNet Association (GWC-2010), , , 31st Jan. - 4th Feb., 2010. Fragos, Kostas, Yannis Maistros, Christos Skourlas, (2004). Word Sense Disambiguation using WORDNET relations. Dep. Of Computer Engineering NTUA, Greece. www.asianwordnet.org

13 Considerations on Automatic Mapping Large-Scale Heterogene- ous Language Resources: Sejong Semantic Classes and KorLex

Heum Park Aesun Yoon Woo Chul Park and Center for U-Port IT LI Lab. Dept. of French Hyuk-Chul Kwon* Research and Education Pusan National University AI Lab Dept. of Computer Pusan National University [email protected] Science [email protected] Pusan National University [email protected]

defined argument structures with the selection- Abstract al restrictions of each argument, 2) large and semantically well segmented lexica, 3) most This paper presents an automatic map- importantly, interrelationship between the ar- ping method among large-scale hetero- gument structures and lexica. A couple of lan- geneous language resources: Sejong guage resources have been developed or can be Semantic Classes (SJSC) and KorLex. used for this end. Sejong Electronic Dictiona- KorLex is a large-scale Korean Word- ries (SJD) for nouns and predicates (verbs and Net, but it lacks specific syntactic & adjectives) along with semantic classes (Hong semantic information. Sejong Electron- 2007) were developed for syntactic and seman- ic Dictionary (SJD), of which semantic tic analysis, but the current versions do not segmentation depends on SJSC, has contain enough entries for concrete applica- much lower lexical coverage than tions, and they show inconsistency problem. A KorLex, but shows refined syntactic & Korean WordNet, named KorLex (Yoon & al, semantic information. The goal of this 2009), which was built on Princeton WordNet study is to build a rich language re- 2.0 (PWN) as its reference model, can provide source for improving Korean semanti- means for shallow semantic processing but co-syntactic parsing technology. There- does not contain refined syntactic and semantic fore, we consider integration of them information specific to Korean language. Ko- and propose automatic mapping me- rean Standard Dictionary (STD) provides a thod with three approaches: 1) Infor- large number of entries but it lacks systematic mation of Monosemy/Polysemy of description and formal representation of word Word senses (IMPW), 2) Instances be- senses, like other traditional dictionaries for tween Nouns of SJD and Word senses humans. Given these resources which were of KorLex (INW), and 3) Semantically developed through long-term projects (5 – 10 Related words between Nouns of SJD years), integrating them should result in signif- and Synsets of KorLex (SRNS). We icant benefits to Korean syntactic and semantic obtain good performance using com- processing. bined three approaches: recall 0.837, The primary goal of our recent work includ- precision 0.717, and F1 0.773. ing the work reported in this paper is to build a language resource, which will improve Korean

1 Introduction semantico-syntactic parsing technology. We While remarkable progress has been made in proceed by integrating the argument structures Korean language engineering on morphologi- as provided by SJD, and the lexical-semantic cal level during last two decades, syntactic and hierarchy as provided by KorLex. SJD is a semantic processing has progressed more language resource, of which all word senses slowly. The syntactic and semantic processing are labeled according to Sejong semantic requires 1) linguistically and formally well classes (SJSC), and in which selectional re-

* Corresponding Author

14 Proceedings of the 8th Workshop on Asian Language Resources, pages 14–21, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing strictions are represented in SJSC as for the propose an automatic mapping method with argument structures of predicates. KorLex is a three approaches from semantic classes of SJD large scale language resource, of which the to synsets of KorLex. In Section 5, we com- lexical-semantic hierarchies and other lan- pare the results of automatic mapping with guage–independent semantic relations between those of manual mapping. In Section 6, we synsets (synonym sets) share with those of draw conclusions and future works. PWN, and of which Korean language specific information comes from STD. The secondary 2 Related Works goal is the improvement of three resources as a Most existing mappings of heterogeneous lan- result of comparing and integrating them. guage resources were conducted manually by In this paper, we report on one of the operat- language experts. The Suggested Upper ing steps toward to our goals. We linked each Merged Ontology (SUMO) had been fully word sense of KorLex to that of STD by hand, linked to PWN. For manual mapping of be- when the former was built in our previous tween PWN and SUMO, it was considered work (Yoon & al. 2009). All predicates in SJD synonymy, hypernymy and instantiation be- were mapped to those of STD on word sense tween synsets of PWN and concepts of SUMO, level by semi-automatic mapping (Yoon, and found the nearest instances of SUMO for 2010). Thus KorLexVerb and KorLexAdj have synsets of PWN. Because the concept items of syntactico-semantic information on argument SUMO are much larger than those of PWN, it structures via this SJD - STD mapping. How- could be mapped between high level concepts ever, the selectional restrictions provided by of PWN and synonymy concepts of SUMO SJD are not useful, if SJSC which represents easily. (Ian Niles et al 2003). Dennis Spohr the selectional restrictions in SJD is not linked (2008) presented a general methodology to to KorLex. We thus conduct two mapping me- mapping EuroWordNet to the SUMO for ex- thods between SJSC and upper nodes of Kor- traction of selectional preferences for French. LexNoun: 1) manual mapping by a PH.D in Jan Scheffczyk et al. (2006) introduced the computational semantics (Bae & al. 2010), and connection of FrameNet to SUMO. They pre- 2) automatic mapping. This paper reports the sented general-domain links between Frame- latter. Reliable automatic mapping methods Net Semantic Types and SUMO classes in among heterogeneous language resources SUOKIF and developed a semi-automatic, should be considered, since the manual map- domain-specific approach for linking Frame- ping among large-scale resources is a very Net Frame Elements to SUMO classes time and labor consuming job, and might lack (Scheffczyk & al. 2006). Sara Tonelli et al. consistency. Less clean resources are, much (2009) presented a supervised learning frame- harder and more confusing manual mapping is. work for the mapping of FrameNet lexical In this paper, we propose an automatic map- units onto PWN synsets to solve limited cover- ping method of those two resources with three age of semantic phenomena for NLP applica- approaches to determine mapping candidate tions. Their best results were recall 0.613, pre- synsets of KorLex to a terminal node of SJSC: cision 0.761 and F1 measure 0.679. 1) using information of monosemy/polysemy Considerations on automatic mapping me- of word senses, 2) using instances between thods among language resources were always nouns of SJD and word senses of KorLex, and attempted for the sake of efficiency, using si- 3) using semantically related words between milarity measuring and evaluating methods. nouns of SJD and word senses of KorLex. We Typical traditional evaluating methods be- compared the results of automatic mapping tween concepts of heterogeneous language re- method with three approaches with those of sources were the dictionary-based approaches manual mapping aforementioned. (Kozima & al 1993), the semantic distance In the following Section 2, we discuss re- algorithm using PWN (Hirst & al 1998), the lated studies concerning language resources scaling method by semantic distance between and automatic mapping methods of heteroge- concepts (Sussna 1997), conceptual similarity neous language resources. In Section 3, we between concepts (Wu & al 1994), the scaled introduce KorLex and SJD. In Section 4, we

15 semantic similarity between concepts (Leacock Word Synsets Word

1998), the semantic similarity between con- Forms Trans Total Senses cepts using IS-A relation (Resnik 1995), the KorLexNoun 89,125 79,689 90,134 102,358 measure of similarity between concepts (Lin KorLexVerb 17,956 13,508 16,923 20,133 1998), Jiang and Conrath’s (1997) similarity KorLexAdj 19,698 18,563 18,563 20,905 computations to synthesize edge and node KorLexAdv 3,032 3,664 3,664 3,123 based techniques, etc. KorLexClas 1,181 - 1,377 1,377 Satanjeev et al. (2003) presented a new Total 130,992 115,424 130,661 147,896 measure of semantic relatedness between con- Table 1. Product of KorLex 1.5 cepts that was based on the number of shared words (overlaps) in their definitions (glosses) KorLexNoun includes 25 semantic domains for word sense disambiguation. The perfor- with 11 unique beginners with maximum 17 mances of their extended gloss overlap meas- levels in depth and KorLexVerb includes 15 ure with 3-word window were recall 0.342, semantic domains with 11 unique beginners precision 0.351 and F1 0.346. Siddharth et al. with maximum 12 levels in depth. Basically, (2003) presented the Adapted Lesk Algorithm KorLex synsets inherit the semantic informa- to a method of word sense disambiguation tion of PWN synsets mapped to them. The based on semantic relatedness. In addition, synset information of PWN consists of synset Alexander et al (2006) introduced the 5 exist- ID, semantic domain, POS, word senses, se- ing evaluating methods for PWN-based meas- mantic relations, frame information, and so on. ures of lexical semantic relatedness and com- We linked each word sense of KorLex 1.5 to pared the performance of typical five measures that of STD by hand, when the former was of semantic relatedness for NLP applications built in our previous work (Yoon & al. 2009). and information retrieval. Among them, Jiang- STD includes 509,076 word entries with about Conrath’s method showed the best perfor- 590,000 word senses. It contains a wide cover- mances: precision 0.247, recall 0.231 and F1 age for general words and a variety of example 0.211 for Detection. sentences for each meaning. More than 60% of In many studies, it was presented a variety word senses in KorLex 1.5 are linked to those of the adapted evaluating algorithms. Among of STD. KorLex 1.5, thus, inherits lexical rela- them, Jiang-Conrath’s method, Lin’s the tions described in STD, but both resources lack measure of similarity and Resnik’s the seman- refined semantic-syntactic information. tic similarity show good performances (Alex- 3.2 Sejong Electronic Dictionary ander & al 2006, Daniele 2009). SJD was developed during 1998-2007 manual- 3 Language resources to be mapped ly by linguists for a variety of Korean NLP application as a general-purpose machine read- 3.1 KorLex 1.5 able dictionary. Based on Sejong semantic KorLex 1.5 was constructed from 2004 to classes (SJSC), approximately 25,000 nouns 2007. Different from its previous version and 20,000 predicates (verbs and adjectives, (KorLex 1.0) which preserves all semantic SJPD) contain refined syntactic and semantic relations among synsets of PWN, KorLex 1.5 information. modifies them by deletion/correction of SJSC is a set of hierarchical meta-languages existing synsets, addition of new synsets and classifying word senses and it includes 474 conversion of hierarchical structure. Currently, terminal nodes and 139 non-terminal nodes, KorLex includes nouns, verbs, adjectives, and 6 unique beginners. Each unique beginner adverbs and classifiers: KorLexNoun, has levels from minimum 2 to maximum 7 le- KorLexVerb, KorLexAdj, KorLexAdv and vels in depth. Sejong Noun Dictionary (SJND) KorLexClas, respectively. Table 1 shows the contains 25,458 entries and 35,854 word size of KorLex 1.5, in which ‘Trans’ means the senses having lexical information for each en- number of synsets translated from PWN 2.0 try: semantic classes of SJSC, argument struc- and ‘Total’ is the number of manually added tures, selectional restrictions, semantically re- synsets including translated ones. lated words, derivatioinal relations/words et al.

16 sets and mapping synsets among candidates. Finally, to link each semantic class of SJSC to all lower-level synsets of LUB synsets. 4.1 Finding matched word senses between synsets and nouns of SJND For a semantic class of SJSC, we first find Figure 1. Correlation of lexical information word senses and synsets from KorLex that among SJND, SJPD and SJSC matched with nouns of SJND classified to that semantic class. Figure 2 shows the matched Figure 1 shows the correlation of lexical in- word senses and synsets between nouns of formation among SJND, SJPD and SJSC. Cer- SJND, then synsets of KorLex for a semantic tainly, that information of SJD should be ap- class. The left side of Figure 2 shows nodes of plied to a variety of NLP applications: infor- semantic classes with hierarchical structure mation retrieval, text analysis/generation, ma- and the center box shows the matched words chine translations, and various studies and (bold ones) among nouns of SJND with word educations. However, SJD has much lower senses of synsets in KorLex, and the right side lexical coverage than KorLex. More serious shows matched word senses and synsets in problem is that SJND and SJPD are still noisy: KorLex’ hierarchical structure. internal consistency inside each dictionary and external interrelationship between SJND, SJPD, and SJSC need to be ameliorated, as indicated by dot line in Fig. 1. 4 Automatic Mapping from Semantic Class of SJSC to Synsets of KorLex KorLex and SJSC have different hierarchical structures, grain sizes, and lexical information as aforementioned. For example, the semantic Figure 2. Matched word senses and synsets classes of SJSC are much bigger concepts in with nouns of SJND for a semantic class grain size than the synsets of KorLex: 623 For example, a semantic class concepts in SJSC vs.130,000 synsets in Kor- ‘Atmospheric Phenomena’ (rectangle in the Lex. Determining their semantic equivalence left) has nouns of SJND (words in the center), thus needs to be firmly based on linguistic the bold words are the matched words with clues. word senses of synsets from KorLex, and the Using following 3 linguistic clues that we underlined synsets of the right side are the found, we propose an automatic mapping me- matched ones and synset IDs in KorLex. The thod from semantic classes of SJSC to synsets notations for automatic mapping process be- of KorLex with three approaches to determine tween semantic classes of SJSC and synsets of mapping candidate synsets: 1) Information of KorLex are as follows: noun of SJND is ns, Monosemy/Polysemy of Word senses (IMPW), matched noun ns , un-matched ns , semantic 2) Instances between Nouns of SJD and Word m u class of SJSC is sc, synset is ss and word sense senses of KorLex (INW), and 3) Semantically of a synset is ws in KorLex, and monosemy Related words between Nouns of SJD and word is w and polysemy word is w . Synsets of KorLex (SRNS). mono poly A semantic class sc has nouns ns of SJND For automatic mapping method, following having matched noun ns and un-matched ns processes were conducted. First, to find word m u by comparing with word senses ws of a synset senses of synsets that matched to nouns of ss in KorLex. Thus a synset has word senses as SJND for each semantic class. Second, to se- ss ={ws , ws , …, ws }={ ns , ns , …, ns , lect mapping candidate synsets among them 1 1 2 n m1 m2 mk ns , ns , …}. And nouns of SJND for a with three approaches aforementioned. Third, u k+1 u k+2 semantic class sc is presented ns(sc )={ns , to determine the least upper bound (LUB) syn- 1 1 m1

17 nsm2, …, nsmk, nsu k, nsu k+1, …}. Therefore, we stances between nouns of SJND and word can find the matched word senses nsm1 ~ nsmk senses of a synset. As for KorLex, we used the for a semantic class sc from nouns of SJND examples of STD linked to word senses of and word senses of a synset ss in KorLex. KorLex. Figure 3 shows instances of STD and SJND for a word sense ‘Apple’. 4.2 Selecting Mapping Candidate Synsets Using those matched synsets and word senses, we select mapping candidate synsets with three different approaches. 4.2.1 Using Information of Monosemy and Polysemy of KorLex

Using information of monosemy/polysemy of Figure 3. Instances of STD and SJND for word senses of a synset, the first approach eva- luates mapping candidate synsets. The candi- word sense ‘Apple’ date synsets are evaluated into three catego- We reformulated the Lesk algorithm (Lesk ries: mc(A) is a most relevant candidate synset, 1987, Banerjee and Pedersen 2002) for com- mc(B) is a relevant candidate synset and mc(C) paring instances and evaluating mapping can- is a reserved synset. Evaluation begins from didate synsets. The process of evaluating map- lowest level synsets to top-level beginner. The ping candidate synsets is as follows. process of first approach is as follows. 1) To compare instances of a noun ns of 1) For a synset which contains a single word SJND with examples of a word of STD sense, ss={ws1}, if the word sense is a mo- linked to word sense ws of a synset ss, and nosemy, it is categorized as a a candidate to compute the Relatedness-A(ns, ws) = synset mc(A). If it is a polysemy, categori- score(instance(ns), example(ws)). zation is postponed for evaluating related- 2) To compare all nouns ns of SJND for a ness among siblings: candidate mc(C). semantic class with all nouns in instances 2) In the case of a synset having more than of STD linked to word senses ws, and to one word sense, ss={ws1, ws2, …}, if the compute the Relatedness-B(ns, ws) = matched words nsm among word senses of a score(∀ns, nouns(example(ws))). synset are over 60%: Pss(ws)=(count(nsm)/ 3) If Relatedness-A(ns, ws) ≥ λ and Related- count(ws)) ≥ 0.6, we evaluate whether that 1 ness-B(ns, ws) ≥ λ2, a synset is evaluated as synset is mapping candidate in the next step. a candidate synset mc(A). If either Related- 3) If all matched words nsm of a synset are ness-A(ns, ws) ≥ λ1 or Relatedness-B(ns, monosemic, we categorize it as a candidate ws) ≥ λ , evaluated as a candidate synset synset mc(A). If monosemic words among 2 mc(B). When threshold λ1 and λ2 were 1~4, matched words are over 50%: Pss(wmono| we had good performances. nsm) ≥ 0.5, it is evaluated as a mc(B). A 4) To repeat from step 1) to 3) for all of syn- synset containing polysemies over 50%: sets, in order to determine mapping candi- Pss(wpoly|nsm) ≥ 0.5, categorization is post- date synsets. And then, to construct hierar- poned for evaluating relatedness among chical structure for all those synsets. siblings: candidate synset mc(C). 4) To repeat from step 1) to 3) for all of syn- 4.2.3 Using Semantically Relatedness be- sets, in order to evaluate mapping candidate tween Nouns of SJND and Synsets of synsets. And then, to construct hierarchical KorLex structure for all those synsets. The third approach is to evaluate mapping 4.2.2 Using Instances between Nouns of candidate synsets using comparison of seman- SJND and Word senses of KorLex tic relations and their semantically related words between a noun of SJND and word The second approach is to evaluate mapping senses of a synset. To compute the relatedness candidate synsets using comparison of in- between them, we reformulated the computa-

18 tional formula of relatedness based on the Lesk mc(C) selected from the processes of “4.2 algorithm (Lesk 1987, Banerjee & al 2002). Select Mapping Candidate Synsets”, to de- The process of evaluating mapping candidate termine whether it is a LUB or not and fi- synsets is as follows. nal mapping synsets. 2) Among sibling synsets, if the ratio of 1) To compare semantically related words: count(mc(A)) to count(mc(A)+mc(B)+ between synonyms, hypernyms, hyponyms mc(C)) is over 60%, the parent synset of and antonyms of a noun of SJND and those siblings is evaluated as a candidate synset of a synset of KorLex. To compute the Re- mc(A) and as a LUB. latedness-C(ns, ss) = score (relations(ns), 3) If the ratio of count(mc(A)+mc(B)) to relations(ss)). count(mc(A)+mc(B)+mc(C)) is over 70%, 2) To compare all nouns ns of SJND for a the parent of siblings is evaluated as a can- semantic class with synonyms, hypernyms didate synset mc(A) and as a LUB. If the and hyponyms of a synset of KorLex, and ratio of count(mc(A)+mc(B)) to count compute the Relatedness-D(ns, ss) = score (mc(A) +mc(B)+mc(C)) is between 50% (∀ns, relations(ss)). and 69%, the parent of siblings is evaluated 3) If Relatedness-C(ns, ss) ≥ λ3 and Related- as a candidate synset mc(B) and as a LUB. ness-D(ns, ss) ≥ λ4, a synset is evaluated as 4) And if the others, to stop finding LUB for a candidate synset mc(A). If either Related- that synset and to determine final mapping ness-C(ns, ss) ≥ λ3 or Relatedness-D(ns, ss) synsets with its own level of candidate. ≥ λ4, evaluated as a candidate synset mc(B). 5) To repeat from step 1) to 4) until finding When threshold λ3 and λ4 were 1~4, we LUB synsets and final mapping synsets. have good performances. 4) To repeat from step 1) to 3) for all of syn- sets, in order to determine mapping synsets. And then, to construct hierarchical struc- ture for all those synsets. 4.3 Determining Least Upper Bound (LUB) Synsets and Mapping Synsets Next, we determine the LUB synsets using mapping candidate synsets and hierarchical structure having semantic relations: parent, Figure 4. Hierarchical structure of mapping child and sibling. In order to determine LUB candidate synsets for a semantic class and mapping synsets, we begin evaluation with bottom-up direction. Using relatedness among Figure 4 shows hierarchical structure of child-sibling candidate synsets, we evaluated mapping candidate synsets for a semantic class whether their parent synset is a LUB synset or ‘Furniture’ and when candidate synsets’ ID are not. If the parent is a LUB synset, we evaluate ‘04004316’ (Chair & Seat): mc(B), ‘04209815’ its parent (grand-parent of the candidate) syn- (Table & Desk): mc(B), ‘14441331’ (Table): set using relatedness among its sibling synsets. mc(C), and ‘14436072’ (Shoe shelf & Shoe If the parent is not a LUB, the candidate syn- rack): mc(A), we determine whether their par- sets mc(A) or mc(B) are determined as map- ent synset ‘03281101’ (Furniture) is a LUB or ping synsets (or LUB) and stop finding LUB. not, and evaluate it as a candidate synset mc(A) For all semantic classes, we determine LUB or mc(B). In this case, synset ‘03281101’ (Fur- and mapping synsets. Finally, we link the LUB niture) is a candidate mc(A) and a LUB synset. and mapping synsets to each semantic class of For all semantic classes, we find their map- SJSC. The process of determining of LUB and ping LUB and mapping synsets using informa- mapping synsets is as follows. tion of hierarchical structure and candidate synsets. Finally, we link each semantic class of 1) Using candidate synsets and their sibling, SJSC to all lower level synsets of matched for all candidate synsets mc(A), mc(B) or LUB synsets.

19 5 Experiments and Results present the ratio of the results of automatic mapping to original lexical data of Sejong and We experimented automatic mapping between KorLex: 474 semantic classes of SJSC, 25,245 623 semantic classes of SJSC and 90,134 noun nouns of SJND and 90,134 noun synsets and synsets of KorLex using the proposed automat- 147,896 word senses in KorLex. ic mapping method with three approaches. To SJD KorLex evaluate the performances, we used the results SC Nouns Word Approaches Synsets of manual mapping as correct answers, that (SJSC) (SJND) Senses was mapped 474 semantic classes (terminal 1) 473 18,575 54,943 69,970 2) 445 18,402 52,109 66,936 nodes) of SJSC to 65,820 synsets (73%) (in- 3) 413 18,047 49,768 64,003 clude 6,487 LUB) among total 90,134 noun 1)+2) 463 18,521 52,563 67,109 synsets of KorLex. We compared the results of 1)+3) 457 18,460 51,786 66,157 2)+3) 383 17,651 48,398 62,063 automatic mapping with those of manual map- 466 18,542 54,083 69,259 1)+2)+3) ping. For evaluation of performances, we em- (98.3%) (72.8%) (60%) (46.8%) ployed Recall, Precision and the F1 measure: Table 3. Numbers of semantic class, noun of F1 = (2*Recall*Precision)/(Recall+ Precision). SJD, synset and word sense of KorLex Approaches Recall Precision F1 1) 0.904 0.502 0.645 In manual mapping, we mapped 73% 2) 0.774 0.732 0.752 (65,820) synsets of KorLex for 474 semantic 3) 0.670 0.802 0.730 1)+2) 0.805 0.731 0.766 classes of SJSC. The 24,314 synsets was ex- 1)+3) 0.761 0.758 0.759 cluded in manual mapping among 90,134 total 2)+3) 0.636 0.823 0.718 nouns synsets. The reasons of excluded synsets 1)+2)+3) 0.838 0.718 0.774 in manual mapping were 1) inconsistency of Table 2. Performances of automatic mapping inheritance for lexical relations of parent-child with three approaches in SJSC or KorLex, 2) inconsistency between criteria for SJSC and candidate synsets, 3) Table 2 shows the performances of automat- candidate synsets belonging to more than two ic mapping with three approaches: 1) IMPW, semantic classes, 4) specific proper nouns 2) INW, and 3) SRNS. The ‘1)’, ‘2)’ or ‘3) in (chemical compound names), and 5) polysemic the Table present the results using for each abstract synsets (Bae & al. 2010). approach method and ‘1)+2)’, ‘1)+3)’ or In automatic mapping, we could map 60% ‘2)+3)’ present those of combining two ap- (54,083) synsets among total nouns synsets proaches. The ‘1)+2)+3)’ presents those of the (90,134) of KorLex, and it is 82.2% of the re- combining three approaches and we can see sults of manual mapping. The 11,737 synsets the best performances using the last approach was excluded in automatic mapping by com- among results: recall 0.837, precision 0.717 paring with manual mapping. Most of them and F1 0.773. The first approach ‘1)’ method were 1) tiny-grained synsets found in the low- shows high recall, but low precision and the est levels, 2) synsets having no matched word third approach ‘3)’ method present low recall senses with those of SJND, 3) synsets with and high precision. ‘1)+3)’ and ‘2)+3)’ shows polysemic word senses, 4) word senses having good performances overall. Thus, we could see poor instances in KorLex and in SJND, 5) good performances using the combined ap- word senses in SJND having poor semantic proach methods. relations. Second, we compared the numbers of se- mantic classes, nouns entries of SJND, noun Level LUB Ratio Level LUB Ratio 1 18 0.6% 9 230 7.3% synsets and word senses of KorLex for each 2 18 0.6% 10 98 3.1% approach, after mapping processes. 3 174 5.5% 11 32 1.0% 4 452 14.3% 12 20 0.6% As shown in Table 3, we can see the most 5 616 19.5% 13 4 0.1% numbers of mapping synsets using the ‘1)’ ap- 6 570 18.0% 14 4 0.1% 7 486 15.4% 15 2 0.1% proach. The ‘1)+2)+3)’ shows the results simi- 8 442 14.0% 16-17 0 0% lar to ‘1)’, but has the best performances (see Table 4. Numbers and Ratio of LUB synsets Table 2). The percentages in the round bracket excluded in automatic mapping

20 Table 4 shows the numbers and ratio of the KorLex, 2007. Korean WordNet, Korean Language LUB synsets excluded in automatic mapping processing Lab, Pusan National University. for each level in depth. Most synsets are 4-8 Available at http://korlex.cs.pusan.ac.kr levels synsets among 17 levels in depth. C. Hong. 2007. The Research Report of Develop- ment 21th century Sejong Dictionary, Ministry 6 Conclusions of Culture, Sports and Tourism, The National In- stitute of the Korean Language. We proposed a novel automatic mapping me- thod with three approaches to link Sejong Se- Dennis Spohr. 2008. A General Methodology for mantic Classes and KorLex using 1) informa- Mapping EuroWordNets to the Suggested Up- tion of monosemy/polysemy of word senses, 2) per Merged Ontology, Proceedings of the 6th LREC 2008:1-5. instances of nouns of SJD and word senses of KorLex, 3) semantically related words of Alexander Budanitsky and Graeme Hirst. 2006. nouns of SJD and synsets of KorLex. To find Evaluating WordNet-based Measures of Lexi- common clues from lexical information among cal Semantic Relatedness, Computational Lin- those language resources is important process guistics,Vol 32: Issue 1:13- 47. in automatic mapping method. Our proposed Siddharth Patwardhan, Satanjeev Banerjee and Ted automatic mapping method with three ap- Pedersen. 2003. Using Measures of Semantic proaches shows notable performances by com- Relatedness for Word Sense Disambiguation, paring with other studies on automatic map- CICLing 2003, LNCS(vol 2588):241-257. ping among language resources: recall 0.837, Satanjeev Banerjee and Ted Pedersen. 2002. An precision 0.717 and F1 0.773. Therefore, from Adapted Lesk Algorithm for Word Sense Dis- those studies, we can improve Korean seman- ambiguation Using WordNet, Proceedings of tico-syntactic parsing technology by integrat- CICLing 2002, LNCS 2276:136-145 ing the argument structures as provided by SJD, Sara Tonelli and Daniele Pighin. 2009. New Fea- and the lexical-semantic hierarchy as provided tures for FrameNet -WordNet Mapping, Pro- by KorLex. In addition, we can enrich three ceedings of the 13th Conference on Computa- resources: KorLex, SJD and STD as results of tional Natural Language Learning: 219-227. comparing and integrating them. We expect to Aesun Yoon, Soonhee Hwang, E. Lee, Hyuk-Chul improve automatic mapping technology among Kwon. 2009. Consruction of Korean WordNet other Korean language resources through this ‘KorLex 1.5’, JourNal of KIISE: Sortware and study. Applications, Vol 36: Issue 1:92-108. Acknowledgement Soonhee Hwang, A. Yoon, H. Kwon. 2010. KorLex 1.5: A Lexical Sematic Network for Korean This work was supported by the National Re- Numeral Classifiers, JourNal of KIISE: Sort- search Foundation of Korea (NRF) grant ware and Applications, Vol 37: Issue 1:60-73. funded by the Korea government(MEST) (No. Sun-Mee Bae, Kyoungup Im, Aesun Yoon. 2010. 2007-0054887). Mapping Heterogeneous Ontologies for the HLT Applications: Sejong Semantic Classes References and KorLexNoun 1.5, Korean Journal of Cog- Jan Scheffczyk, Adam Pease, Michael Ellsworth. nitive Science. Vol. 21: Issue 1: 95-126. 2006. Linking FrameNet to the Suggested Upper Aesun Yoon. 2010. Mapping Word Senses of Ko- Merged Ontology. Proc of the 2006 conference rean Predicates Between STD(STandard Dic- on Formal Ontology in Information Systems tionary) and SJD(SeJong Electronic Dictio- (FOIS 2006): 289-300. nary) for the HLT Applications, Journal of the Ian Niles and Adam Pease. 2003. Linking lexicons Linguistic Society of Korea. No 56: 197-235. and ontologies: Mapping wordnet to the sug- Hyopil Shin. 2010. KOLON: Mapping Korean gested upper merged ontology. In Proceedings of Words onto the Microkosmos Ontology and the 2003 International Conference on Informa- Combining Lexical Resources. Journal of the tion and Knowledge Engineering (IKE 03). Linguistic Society of Korea. No 56: 159-196.

21 Sequential Tagging of Semantic Roles on Chinese FrameNet

Jihong LI Ruibo WANG, Yahui GAO Computer Center Computer Center Shanxi University Shanxi University [email protected] wangruibo,gaoyahui @sxu.edu.cn { }

Abstract FrameNet (Baker et al., 1998), PropBank (Kings- bury and Palmer, 2002), NomBank (Meyers et al., In this paper, semantic role labeling(SRL) 2004), and so on. On this basis, several interna- on Chinese FrameNet is divided into the tional semantic evaluations have been organized, subtasks of boundary identification(BI) which include Senseval 3 (Litkowski, 2004), and semantic role classification(SRC). SemEval 2007 (Baker,et al., 2007), CoNLL These subtasks are regarded as the se- 2008 (Surdeanu et al., 2008), CoNLL 2009 (Hajic quential tagging problem at the word et al., 2009), and so on. level, respectively. We use the conditional The first SRL model on FrameNet was pro- random fields(CRFs) model to train and posed by Gildea and Jurafsky(2002). The model test on a two-fold cross-validation data consists of two subtasks of boundary identifica- set. The extracted features include 11 tion(BI) and semantic role classification(SRC). word-level and 15 shallow syntactic fea- Both subtasks were implemented on the pretreat- tures derived from automatic base chunk ment results of the full parsing tree. Many lex- parsing. We use the orthogonal array of ical and syntactic features were extracted to im- statistics to arrange the experiment so that prove the accuracy of the model. On the test data the best feature template is selected. The of FrameNet, the system achieved 65% precision experimental results show that given the and 61% recall. target word within a sentence, the best Most works on SRL followed Gildea’s frame- F-measures of SRL can achieve 60.42%. work of processing the SRL task on English For the BI and SRC subtasks, the best F- FrameNet and PropBank. They built their model measures are 70.55 and 81%, respectively. on the full parse tree and selected features using The statistical t-test shows that the im- various machine learning methods to improve the provement of our SRL model is not signif- accuracy of SRL models. Many attempts have icant after appending the base chunk fea- made significant progress, ssuch as the works of tures. Pradhan et al. (2005), Surdeanu et al. (2007), and so on. Other researchers regarded the task 1 Introduction of SRL as a sequential tagging problem and em- Semantic parsing is important in natural lan- ployed the shallow chunking technique to solve it, guage processing, and it has attracted an increas- as described by Marquez at al. (2008). ing number of studies in recent years. Cur- Although the SRL model based on a full parse rently, its most important aspect is the formaliza- tree has good performance in English, this method tion of the proposition meaning of one sentence of processing is not available in other languages, through the semantic role labeling. Therefore, especially in Chinese. A systemic study of Chi- many large human-annotated corpora have been nese SRL was done by Xue et al. (2008). Like constructed to support related research, such as the English SRL procedure, he removed many

22 Proceedings of the 8th Workshop on Asian Language Resources, pages 22–29, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing uncorrelated constituents of a parse tree and re- a sequential tagging problem at the word level. lied on the remainder to identify the semantic Conditional random fields(CRFs) model is em- roles using the maximum entropy model. When ployed to train the model and predict the result human-corrected parse is used, the F-measures of the unlabeled sentence. To improve the accu- on the PropBank and NomBank achieve 92.0 and racy of the model, base chunk features are intro- 69.6%, respectively. However, when automatic duced, and the feature selection method involving full parse is used, the F-measures only achieve an orthogonal array is adopted. The experimen- 71.9 and 60.4%, respectively. This significant de- tal results illustrate that the F-measure of our SRL crease prompts us to analyze its causes and to find model achieves 60.42%. This is the best SRL re- a potential solution. sult of CFN so far. First, the Chinese human-annotated resources The paper is organized as follows. In Section of semantic roles are relatively small. Sun and 2, we describe the situation of CFN and introduce Gildea only studied the SRL of 10 Chinese verbs SRL on CFN. In Section 3, we propose our SRL and extracted 1,138 sentences in the Chinese Tree model in detail. In Section 4, the candidate feature Bank. The size of the Chinese PropBank and set is proposed, and the orthogonal-array-based Chinese NomBank used in the paper of Xue is feature selection method is introduced. In Sec- significantly smaller than the ones used in En- tion 5, we describe the experimental setup used glish language studies. Moreover, more verbs ex- throughout this paper. In Section 6, we list our ist in Chinese than in English, which increases the experimental results and provide detailed analy- sparsity of Chinese Semantic Role data resources. sis. The conclusions and several further directions The same problem also exists in our experiment. are given at the end of this paper. The current corpus of Chinese FrameNet includes about 18,322 human-annotated sentences of 1,671 2 CFN and Its SRL task target words. There is only an average of less than Chinese FrameNet(CFN) (You et al., 2005) is a re- 10 sentences for every target word. To reduce the search project that has been developed by Shanxi influence of the data sparsity, we adopt a two-fold University, creating an FN-styled lexicon for Chi- cross validation technique for train and test label- nese, based on the theory of Frame Semantics ing. (Fillmore, 1982) and supported by corpus evi- Second, because of the lack of morphological dence. The results of the CFN project include a clues in Chinese, the accuracy of a state-of-the-art lexical resource, called the CFN database, and as- parsing system significantly decreases when used sociated software tools. Many natural language for a realistic scenario. In the preliminary stage processing(NLP) applications, such as Informa- of building an SRL model of CFN, we employed tion Retrieval and Machine Translation, will ben- a Stanford full parser to parse all sentences in the efit from this resource. In FN, the semantic roles corpus and adopted the traditional SRL technique of a predicate are called the frame elements of a on our data set. However, the experiment result frame. A frame has different frame elements. A was insignificant. Only 76.48% of the semantic group of lexical units (LUs) that evokes the same roles in the data set have a constituent with the frame share the same names of frame elements. same text span in the parse tree, and the F-measure The CFN project currently contains more than of BI can only achieves 54%. Therefore, we at- 1,671 LUs, more than 219 semantic frames, and tempted to use another processing technique for has exemplified more than 18,322 annotated sen- SRL on CFN. We formalized SRL on CFN into a tences. In addition to correct segmentation and sequential tagging problem at the word level. We part of speech, every sentence in the database is first extracted 11 word features into the baseline marked up to exemplify the semantic and syntac- model. Then we added 15 additional base chunk tic information of the target word. Each annotated features into the SRL model. sentence contains only one target word. In this paper, the SRL task of CFN comprises (a).

23 结构/n > ;/w To avoid the problem of data sparsity, we use all The CFN Corpus is currently at an early stage, sentences in our train data set to train the model of and the available CFN resource is relatively lim- BI. ited, so the SRL task on CFN is described as fol- lows. Given a Chinese sentence, a target word, and its frame, we identify the boundaries of the 3.3 SRC frame elements within the sentence and label them with the appropriate frame element name. This is After predicting the boundaries of semantic role the same as the task in Senseval-3. chunks in a sentence, the proper semantic role types should be assigned in the SRC step. Al- 3 Shallow SRL Models though it can be easily modeled as a classification problem, we regarded it as a sequential tagging This section proposes our SRL model architec- problem at the word level. An additional con- ture, and describes the stages of our model in de- straint is employed in this step: the boundary tags tail. of the predicting sequence of this stage should be 3.1 SRL Model Architecture consistent with the the output of the BI stage. A family of SRL models can be constructed using One intuitive reason for this model strategy is only shallow syntactic information as the input. that the SRC step can use the same feature set as The main differences of the models in this family BI, and it can further prove the rationality of our mainly focus on the following two aspects. feature optimization method. i) model strategy: whether to combine the sub- tasks of BI and SRC? ii) tagging unit: which is used as the tagging 3.4 Postprocessing unit, word or chunk. The one-stage and two-stage models are two Not all predicted IOB2 sequences can be trans- popular strategies used in SRL tasks, as described formed to the original sentence correctly; there- by Sui et al. (2009). The word and the chunk are fore, they should satisfy the following compulsory regarded as the two different tagging units of the constraints. SRL task. (1) The tagging sequence should be regular. In our SRL model, we consider BI and SRC as ”I...”, ”... OI...”, ”I-X...”, ”... O-I-X...”, ”... B-X- two stages, and the word is always used as the tag- I-Y...”, and ”B-I-X-I-X-I-Y...” are not the regular ging unit. The detailed formalization is addressed IOB2 sequences. in the following subsections. (2) The tag for the target word must be ”O”. 3.2 BI We use the Algorithm 1 to justify whether the The aim of the BI stage is to identify all word IOB2 sequences are regular. spans of the semantic roles in one Chinese sen- Moreover, at the SRC stage, the boundary tags tence. It can be regarded as a sequential tagging of the IOB2 sequence must be consistent with the problem. Using the IOB2 strategy (Erik et al., given boundary tags. 1999), we use the tag set B,I,O to tag all words, { } where tag ”B” represents the beginning word of a For the BI stage, we firstly add an additional chunk, ”I” denotes other tokens in the chunk, and chunk type tag X to all ”B” and ”I” tags in the ”O” is the tag of all tokens outside any chunks. IOB2 sequences, and then use Algorithm 1 to jus- Therefore, the example sentence (a) can be repre- tify the regularity of the sequences. sented as follows: In the testing stage of the SRL model, we use (b). 第 1 章 介绍 算法 与 数据 the regular sequence with the max probability as |B |I |O |B |I |I 结构 ; the optimal output. |I |O

24 Algorithm 1. justify the regular IOB2 sequence pus and to extract several types of features from Input: (1) IOB2 sequence:S = (s1, .., sn) them. The automatically generated base chunks where si B X,I X,O , and 1 i n (2) The position∈ { − of target− word in} sentence≤ pt≤ of example sentences (a) are given as follows: 1, Initialization: (c).[mp-ZX 第 1/m 章/q ] [vp-SG 介绍/v ] [np- ct = NULL (1) Current chunk type: ; SG 算法/n ] 与/c [np-AM 数据/n 结构/n ] ;/w (2) Regularity of sequence: state =′ REG′; 2, Check the tag of target word: spt: (1) If spt ==′ O′: go to Step 3; 4.1 Candidate Feature Set (2) If spt <>′ O′: state =′ IRR′, and go to Step 4; 3,F or(i = 1; i n; i + +) Three types of features are given as follows: ≤ (1) If si ==′ B X′: ct =′ X′; Features at the word level: − (2) If si ==′ I X′ and ct <>′ X′: state =′ IRR′, Word: The current token itself; and go to Step 4;− (3) If si ==′ O′: ct = NULL; Part-of-Speech: The part of speech of the cur- 4, Stop rent token; Output: Variable state; Position: The position of the current word rela- tive to the target word(before, after, or on); 3.5 Why Word-by-word? Target word: The target word in the sentence; We ever tried to use the methods of constituent- Features at the base chunk level: by-constituent and chunk-by-chunk to solve our Syntactic label: The syntactic label of the cur- SRL task on CFN, but the experiment results il- rent token, such as, B-np,I-vp, etc; lustrate that they are not suitable to our task. Structural label: The structural label of the cur- We use the Stanford Chinese full parser to parse rent token, such as, B-SG, I-ZX, etc; all sentences in the CFN corpus and use the SRL Head word and its Part of Speech: The head model proposed by Xue et al.(2008) in our task. word and its part of speech of the base chunk; However, the results is insignificant. Only 66.72% Shallow syntactic path: The combination of of semantic roles are aligned with the constituents the syntactic tags from the source base chunk, of the full parse tree, and the F-measure of BI only which contains the current word, to the target base achieves 52.43%. The accuracy of the state-of- chunk, which contains the target word of the sen- the-art Chinese full parser is not high enough, so tence; it is not suitable to our SRL task. Subcategory: The combination of the syntactic Chunk-by-chunk is another choice for our task. tags of the base chunk around the target word; When We use base chunk as the tagging unit of Other Features: our model, only about 15% of semantic roles did Named entity: The three types of named entities not align very well with the boundary of automati- are considered: person, location, and time. They cally generated base chunks, and the F-measure is can be directly mapped from the part of speech of significantly lower than the method of word-by- the current word. word, as described by Wang et al.(2009). Simplified sentence: A boolean feature. We use Therefore, words are chosen as the tagging unit the punctuation count of the sentence to estimate of our SRL model, which showed significant re- whether the sentence is the simplified sentence. sults from the experiment. Aside from the basic features described above, we also use combinations of these features, such 4 Feature Selection and Optimization as word/POS combination, etc. Word-level features and base-chunk features are used in our SRL research. 4.2 Feature Optimization Method Base chunk is a Chinese shallow parsing In the baseline model, we only introduce the fea- scheme proposed by Professor Zhou. He con- tures at the word level. Table 1 shows the candi- structed a high accuracy rule-based Chinese base date features of our baseline model and proposes chunk parse (Zhou, 2009), the F-measure of their optional sizes of sliding windows. which can achieve 89%. We use this parse to gen- For Table 1, we use the orthogonal array erate all base chunks of the sentences in our cor- L (49 24) to conduct 32 different templates. 32 ×

25 The best template is chosen from the highest F- and the boundary tags are introduced as features measure for testing the 32 templates. The detailed during the SRC stage. orthogonal-array-based feature selection method The feature templates in Table 2 cannot con- was proposed by Li et al.(2010). tain the best feature template selected from Table 1. This is a disadvantage of our feature selection Table 1. Candidate features of baseline models Feature type Window size method. word [0,0] [-1,1] [-2,2] [-3,3] bigram of word - [-1,1] [-2,2] [-3,3] 5 Experimental Setup and Evaluation POS [0,0] [-1,1] [-2,2] [-3,3] Metrics bigram of POS - [-1,1] [-2,2] [-3,3] position [0,0] [-1,1] [-2,2] [-3,3] 5.1 Data Set bigram of position - [-1,1] [-2,2] [-3,3] word/POS - [0,0] [-1,1] [-2,2] The experimental data set consists of all sentences word/position - [0,0] [-1,1] [-2,2] of 25 frames selected in the CFN corpus. These POS/position - [0,0] [-1,1] [-2,2] trigram of position - [-2,0] [-1,1] [0,2] sentences have the correct POS tags and CFN se- word/target word - [0,0] mantic information; they are all auto parsed by target word [0,0] the rule-based Chinese base chunk parser. Table Compared with the baseline model, the features 3 shows some statistics on these 25 frames. at the word and base chunk levels are all consid- ered in Table 2. Table 3. Summary of the experimental data set Frame FEs Sents Frame FEs Sents 感受 6 569 因果 7 140 Table 2. Candidate features of the base chunk-based model 知觉特征 5 345 陈述 10 1,603 Feature type Window size 思想 3 141 拥有 4 170 word [0,0] [-1,1] [-2,2] 联想 5 185 适宜性 4 70 bigram of word - [-1,1] [-2,2] 自主感知 14 499 发明 12 198 POS [0,0] [-1,1] [-2,2] 查看 9 320 计划 6 90 bigram of POS - [-1,1] [-2,2] 思考 8 283 代表 7 80 position [0,0] [-1,1] [-2,2] 非自主感知 13 379 范畴化 11 125 bigram of position - [-1,1] [-2,2] 获知 9 258 证明 9 101 word/POS - [0,0] [-1,1] 相信 8 218 鲜明性 9 260 word/position - [0,0] [-1,1] 记忆 12 298 外观 10 106 POS/position - [0,0] [-1,1] 包含 6 126 属于某类 8 74 trigram of position - [-2,0] [-1,1] 宗教信仰 5 54 Totals 200 6,692 syntactic label [0,0] [-1,1] [-2,2] syn-bigram - [-1,1] [-2,2] Syn-trigram - [-1,1] [-2,2] 5.2 Cross-validation technique head word [0,0] [-1,1] [-2,2] In all our experiments, three groups of two-fold head word-bigram - [-1,1] [-2,2] POS of Head [0,0] [-1,1] [-2,2] cross-validation sets are used to estimate the per- POS-bigram of head - [-1,1] [-2,2] formance of our SRL model. All sentences in a syn/head word [0,0] [-1,1] [-2,2] frame are cut four-fold on average, where every stru/head word [0,0] [-1,1] [-2,2] shallow path - [0,0] [-1,1] two folder are merged as train data, and the other subcategory - [0,0] [0,0] two folds are used as test data. Therefore, we can named Entity - [0,0] [0,0] obtain three groups of two-fold cross-validation simplified Sentence - [0,0] [0,0] target word(compulsory) [0,0] data sets. Estimating the parameter of fold number is 1 25 The orthogonal array L54(2 3 ) is employed one of the most difficult problems in the cross- × to select the best feature template from all candi- validation technique. We believe that in the task of date feature templates in Table 2. To distinguish SRL, the two-fold cross validation set is a reason- it from the baseline model, we call the model able choice, especially when the data set is relative based on the table 2 as the ”base chunk-based SRL small. With a small data set, dividing it in half is model”. split of data set is the best approximation of the For both feature sets described above, the target real-world data distribution of semantic roles and word is the compulsory feature in every template, the sparse word tokens.

26 5.3 Classifiers where, K is the fold number of cross-validation CRFs model is used as the learning algorithm and n1 and n2 are the counts of training examples in our experiments. Previous SRL research has and testing examples. In our experimental setting, K = 2 and n2 1. Moreover, the estimation of demonstrated that CRFs model is one of the best n1 ≈ statistical algorithms for SRL, such as the works the variance of the total F-measure is as follows: of Cohn et al. (2005) and Yu et al. (2007). The crfpp toolkit1 is a good implementation of 1 V ar(F ) = V ar( (F1 + F2 + F3)) the CRF classifier, which contains three different 3 3 training algorithms: CRFL1, CRFL2, and MIRA. 1 = V ar(F ) We only use CRFL2 with Gaussian priori regular- 9 j ization and the variance parameter C=1.0. ∑j=1

5.4 Evaluation Metrics Using V ar(Fj) to estimate V ar(Fj), we can obtain: As described in SRL reseach, precision, recall, d and F-measure are also used as our evaluation 3 metrics. In addition, the standard deviation of the 1 V ar(F ) = V ar(F ) F-measure is also adopted as an important metric 9 j j=1 of our SRL model. The computation method of ∑ d 3 d2 these metrics is given as follows: 1 = (F i F ) Let P i, Ri and F i be the precision, recall, and 6 j − j j j j j=1 i=1 F-measure of the jth group of the ith cross valida- ∑ ∑ tion set, where j = 1, 2, 3 and i = 1, 2. The final Finally, we can derive the standard deviation of precision(P ), recall(R), and F-measure(F ) of our the F-measure, that is, std(F ) = V ar(F ). SRL model are the expectation values of the P i, j √ i i Rj, and Fj , respectively. 5.5 Significance Test of Two SRLd Models The estimation of the variance of cross- To test the significance of SRL models A and B, validation is another difficult problem in the we use the following statistics S. cross-validation technique. Although it has been proven that the uniform and unbiased estimation F (A) F (B) of the variance of cross-validation does not ex- S = − ∼ t(n) ist (Yoshua et al., 2007), we adopted the method V ar(F (A)) + V ar(F (B)) proposed by Nadeau et al. (2007), to estimate the where√F (A) and F (B) are the F-measures of variance of the F-measure of cross-validation sets. models A and B, and n is the freedom degree of This method is proposed hereinafter. t-distribution, an integer nearest to the n′. Let Fj be the average F-measure of the j group 1 1 2 experiment, that is, Fj = (F + F ), where 2 j j 3(V ar(F (A)) + V ar(F (B)))2 j = 1, 2, 3. The proposed estimator of the vari- n′ = (V ar(F (A))2 + V ar(F (B))2) ance of Fj in the work of Nadeau et al. (2007) is as follows: We use the p value( ) to test the significance − · of SRL models A and B, which are given as fol- 2 lows: 1 n2 i V ar(Fj) = ( + ) (Fj Fj) K n1 − ∑i=1 2 p value(F (A),F (B)) = P (S t1 α/2(n)) d 1 − ≥ − = ( + 1) (F i F ) 2 j − j If p value(F (A),F (B)) 0.05, the differ- i=1 − ≤ ∑ ence of the F-measures between models A and B 1crfpp toolkit: http://crfpp.sourceforge.net/ is significant at 95% level.

27 6 Experimental Results and Discussion 7 Conclusions and Further Directions We summarized the experiment results of every The SRL of Chinese predicates is a challenging stage of our SRL model, that is, BI, SRC and a task. In this paper, we studied the task of SRL combination of these two steps (BI+SRC). on the CFN. We proposed a two-stage model and exploited the CRFs classifier to implement the au- 6.1 Baseline SRL Model tomatic SRL systems. Moreover, we introduced The results of the baseline model are given in Ta- the base chunk features and the OA-based method ble 4, which only uses the features in Table 1. to improve the performance of our model. Exper- Table 4. Results of the baseline model imental results shows that the F-measure of our P(%) R(%) F(%) std(F) best model achieves 60.42%, and the base chunk BI 74.42 66.80 70.40 0.0031 features cannot improve the SRL model signifi- SRC - - 80.32 0.0032 cantly. BI+SRC 62.87 56.44 59.48 0.0050 In the future, we plan to introduce unlabeled In Table 1, because the results of the SRC stage data into the training phase and use the EM- are based on human-corrected boundary informa- schemed semi-supervised learning algorithms to tion, the precision, recall, and F-measure of this boost the accuracy of our SRL model. stage are the same. Therefore, we only give the F-measure and its deviation at the SRC stage. Acknowledgement In the baseline model, the BI stage is the bot- The authors would like to thank Prof. Kaiying tleneck of our SRL model. Its F-measure only LIU for his comments and Prof. Qiang ZHOU for achieves 70.4%, and the recall is lower than the the base-chunk parser. precision. Moreover, the F-measure of the final model only achieves 59.48%, and its standard de- viation is larger than both stages. References Baker, C., Fillmore, C., and John B. 1998. The Berke- 6.2 Base chunk-based SRL Model ley Framenet project. In Proceedings of COLING- When base chunk features, proposed in Table 2, ACL, 86-90, Montreal, Canada. are employed in the SRL model, we can obtain Baker, C., Ellsworth, M., Erk, K. 2007. SemEval’07 the results summarized in Table 5. Task 19: Frame semantic structure extraction. Pro- Table 5. Results of the base chunk-based model ceedings of the 4th International Workshop on Se- mantic Evaluations, 99-104, Prague, Czech Repub- P(%) R(%) F(%) std(F) BI 74.69 66.85 70.55 0.0038 lic. SRC - - 81.00 0.0029 Cohn, T., Blunsom P. 2005. Semantic role labeling BI+SRC 63.97 57.25 60.42 0.0049 with tree conditional random fields. Proceedings of A comparison of Table 4 and Table 5 provides CoNLL 2005, ACL, 169-172. the following two conclusions. Erik F., and John V. 1999. Representing text chunks. (1) When base chunk features are used, all P, R, In Proceedings of EACL’99, 173-179. F at every stage slightly increase (< 1%). Fillmore, C. 1982. Frame Semantics. In The Linguis- (2) The significance test values between the tic Society of Korea, Seoul: Hanshin. baseline model and the base chunk-based model Gildea, D., and Jurafsky, D. 2002. Automatic label- are given in Table 6. For every stage, the perfor- ing for semantic roles. Computational Linguistics, mance boost after introducing the base chunk fea- 28(3):245-288. tures is not significant at 95% level. However, the Hajic, J., Ciaramita, M., Johansson, R., Kawahara, D., impact of base chunk features at the SRC stage is Marti, M., Marquez,` L., Meyers, A., Nivre, J., Pado´, larger than that at the BI stage. S., Steˇpa´nek, J., Stranak, P., Surdeanu, M., Nian- wen X., Zhang, Y. 2009. The CoNLL-2009 shared Table 6. Test values between two SRL models task: syntactic and semantic dependencies in multi- BI SRC BI+SRC ple languages. In Proceedings of CoNLL 2009, 1- p value 0.77 0.166 0.228 − 18, Boulder, CO, USA..

28 Jihong, L., Ruibo, W., Weilin, W., and Guochen, L. Surdeanu, M., Marquez,` L., Carreras, X., Comas, P. 2010. Automatic Labeling of Semantic Roles on 2007. Combination strategies for semantic role la- Chinese FrameNet. Journal of Software, 2010, beling. Journal of Artificial Intelligence Research, 21(4):597-611. 29:105-151. Jiangde, Y., Xiaozhong, F., Wenbo, P., and Zhengtao, Yoshua, B., and Yves, G. 2004. No unbiased estimator Y. 2007. Semantic role labeling based on condi- of the variance of K-fold cross-validation Journal of tional random fields Journal of southeast university Machine Learning Research, 5:1089-1105. (English edition), 23(2):5361-364. Weiwei, S., Zhifang, S., Meng, W., and Xing, W. 2009. Liping, Y., and Kaiying, L. 2005. Building Chinese Chinese semantic role labeling with shallow pars- FrameNet database. In Proceedings of IEEE NLP- ing. In Proceedings of the 2009 Conference on KE’05 , 301-306. Empirical Methods in Natural Language Process- ing(EMNLP 2009), ACL, 1475-1483. Litkowski, K. 2004. Senseval-3 task automatic label- ing of semantic roles. Third International Workshop on the Evaluation of Systems for the Semantic Anal- ysis of Text, 9-12, Barcelona, Spain. Marquez,` L., Carreras, X., Litkowski, K., Stevenson, S. 2008. Semantic Role Labeling: An Introduc- tion to the Special Issue. Computational Linguis- tics, 34(2):145-159. Meyers, A., Reeves, R., Macleod, C., Szekely, R., Zielinska, V., Young, B., and Grishman, R. 2004. The NomBank Project: An interim report. In Pro- ceedings of the NAACL/HLT Workshop on Frontiers in Corpus Annotation, 24-31, Boston, MA, USA. Nadeau, C., and Bengio, Y. 2003. Inference for the generalization error. Machine Learning, 52: 239- 281. Nianwen, X. 2008. Labeling Chinese Predicates with Semantic Roles. Computational Linguistics, 2008, 34(2): 225-255. Paul, K., and Martha, P. 2002. From TreeBank to PropBank. In Proceedings of LREC-2002, Canary Islands, Spain. Pradhan, S., Hacioglu, K., Krugler, V., Ward, W., Mar- tin, J., Jurafsky, D. 2005. Support vector learn- ing for semantic argument classification. Machine Learning, 2005, 60(1):11-39. Qiang, Z. 2007. A rule-based Chinese base chunk parser. In Proc. of 7th International Conference of Chinese Computation (ICCC-2007), 137-142, Wuhan, China. Ruibo, W. 2004. Automatic Semantic Role Label- ing of Chinese FrameNet Based On Conditional Random Fields Model. Thesis for the 2009 Mas- ter’s Degree of Shanxi University, Taiyuan, Shanxi, China. Surdeanu, M., Johansson, R., Meyers, A., Marquez,` L., Nivre, J. 2008. The CoNLL 2008 shared task on joint parsing of syntactic and semantic depen- dencies. In Proceedings of CoNLL 2008, 159-177, Manchester, England, UK.

29 Augmenting a Bilingual Lexicon with Information for Word Translation Disambiguation

Takashi Tsunakawa Hiroyuki Kaji Faculty of Informatics Faculty of Informatics Shizuoka University Shizuoka University [email protected] [email protected]

the word inputted. This task is often called Abstract word translation disambiguation. In this paper, we describe a method for add- We describe a method for augmenting ing information for word translation disam- a bilingual lexicon with additional in- biguation into the bilingual lexicon. Compara- formation for selecting an appropriate ble corpora can be used to determine which translation word. For each word in the word associations suggest which translations source language, we calculate a corre- of the word (Kaji and Morimoto, 2002). First, lation matrix of its association words we extract word associations in each language versus its translation candidates. We corpus and align them by using a bilingual dic- estimate the degree of correlation by tionary. Then, we construct a word correlation using comparable corpora based on matrix for each word in the source language. these assumptions: “parallel word as- This correlation matrix works as information sociations” and “one sense per word for word translation disambiguation. association.” In our word translation We carried out word translation experiments disambiguation experiment, the results on two settings: English-to-Japanese and Chi- show that our method achieved 42% nese-to-Japanese. In the experiments, we tested recall and 49% precision for Japa- Dice/Jaccard coefficients, pointwise mutual nese-English newspaper texts, and 45% information, log-likelihood ratio, and Student’s recall and 76% precision for Chi- t-score as the association measures for extract- nese-Japanese technical documents. ing word associations.

1 Introduction 2 Constructing word correlation ma- The bilingual lexicon, or bilingual dictionary, trices for word translation disam- is a fundamental linguistic resource for multi- biguation lingual natural language processing (NLP). For 2.1 Outline of our method each word, multiword, or expression in the source language, the bilingual lexicon provides In this section, we describe the method for translation candidates representing the original calculating a word correlation matrix for each meaning in the target language. word in the source language. The correlation Selecting the right words for translation is a matrix for a word f consists of its association serious problem in almost all of multilingual words and its translation candidates. Among NLP. One word in the source language almost the translation candidates, we choose the most always has two or more translation candidates acceptable one that is strongly suggested by its in the target language by looking up them in association words occurring around f. the bilingual lexicon. Because each translation We use two assumptions for this framework: candidate has a distinct meaning and property, (i) Parallel word associations: we must be careful in selecting the appropriate Translations of words associated with translation candidate that has the same sense as each other in a language are also asso- ciated with each other in another language

30 Proceedings of the 8th Workshop on Asian Language Resources, pages 30–37, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing

(Rapp, 1995). For example, two English words “tank” and “soldier” are associated with each other and their Japanese trans- lations “ᡓ㌴ (sensha)” and “රኈ (hei- shi)” are also associated with each other. (ii) One sense per word association: A polysemous word exhibits only one sense of a word per word association (Yarowsky, 1993). For example, a poly- semous word “tank” exhibits the “military vehicle” sense of a word when it is asso- ciated with “soldier,” while it exhibits the “container for liquid or gas” sense when it is associated with “gasoline.” Under these assumptions, we determine which of the words associated with an input word suggests which of its translations by aligning word associations by using a bilingual dictio- nary. Consider the associated English words (tank, soldier) and their Japanese translations (ᡓ㌴ (sensha), රኈ (heishi)). When we translate the word “tank” into Japanese, the associated word “soldier” helps us to translate it into “ᡓ㌴ (sensha)”, not to translate it into “ࢱࣥࢡ (tanku)” which means “a storage tank.” This naive method seems to suffer from the following difficulties:  A disparity in topical coverage between Figure 1. Overview of our method. two corpora in two languages  A shortage in the bilingual dictionary tions. Then, we iteratively calculate a correla-  The existence of polysemous associated tion matrix for each word in the source lan- words that cannot determine the correct guage. Finally, we select the translation with sense of the input word the highest correlation from the translation For these difficulties, we use the tendency that candidates of the input word and the the two words associated with a third word are co-occurring words. likely to suggest the same sense of the third For each input word in the source language, word when they are also associated with each we calculate correlation values between their other. For example, consider an English asso- translation candidates and their association ciated word pair (tank, troop). The word “troop” words. The algorithm is shown in Figure 2. cannot distinguish the different meanings be- In Algorithm 1, the initialization of correla- cause it can co-occur with the word “tank” in tion values is based on word associations, both senses of the word. The third word “sol- where D is a set of word pairs in the bilingual dier,” which is associated with both “tank” and dictionary, and Af and Ae are the sets of asso- “troop,” can suggest the translation “ ᡓ㌴ ciated word pairs. First, we retain associated (sensha).” words f’(i) when its translation e’ exists and The overview of our method is shown in when e’ is associated with e. In the iteration, Figure 1. We first extract associated word pairs the correlation values of associated words f’(i) in the source and target languages from com- that suggest e(j) increase relatively by using parable corpora. Using a bilingual dictionary, association scores ((), ) and we obtain alignments of these word associa-

31 ((), ). In our experiments, we set the Algorithm 1: number of iterations Nr to 10. Input: 2.2 Alternative association measures for f: an input word extracting word associations f’(1), …, f’(I): associated words of f e(1), …, e(J): translation candidates We extract co-occurring word pairs and calcu- of f late their association scores. In this paper, we Nr: number of iterations focus on some frequently used metrics for A bilingual lexicon finding word associations based on their oc- Word association scores for both currence/co-occurrence frequencies. languages Suppose that words x and y frequently Output: co-occur. Let n and n be the occurrence fre- 1 2 Cf = [Cf (f’(i), e(j))]: a correlation ma- quencies of x and y respectively, and let m be trix for f the frequency that x and y co-occur between w content words. The parameter w is a window ɑ 1: if (),() ƕ0 then size that adjusts the range of co-occurrences. ɒ Let N and M be the sum of occur- Sęʠ ($), (%)ʡ ɑ( ) ( ) 2: , Ų ɒ rences/co-occurrences of all words/word pairs, Ğ Sę ($), () respectively. The frequencies are summarized 3: else in Table 1. ɑ 4: (),() Ų 0 The word association scores (, ) are de- 5: end fined as follows: 6: Ų 0  Dice coefficient (Smadja, 1993) 7: while i < N 2 r Dice(, ) = (1) 8: Ų + 1 ɑ + 9: (),() Ų ((), )  Jaccard coefficient (Smadja et al., 1996) ((), ) (, ()) × Jaccard(, ) = (2) max ((), ) (, ()) + 10: end  Pointwise mutual information (pMI) (, ) = (Church and Hanks, 1990) 1 (Ǥ: (, ) Ǩ ̻ ,(,)Ǩ̾)ʭ / Ƥ & pMI(, ) =log (3) 0 (otherwise) ()( )  Log-likelihood ratio (LLR) (Dunning, (9:=9{ɑɑ|(,ɑɑ)Ǩę,(ɑ($),ɑɑ)Ǩę}) 1993) Figure 2. Algorithm for calculating correlation LLR(, ) matrices. =2logL(, ,) +logL( ,,) x x not Total logL(, ,) occur occur logL( ,,); (4) ( ) logL ,, = y mn2 –m n2 log +() log(1), (5) occur = , = , = (6) y not n1 –m M–n1 N–n2  Student’s t-score (TScore) (Church et al., occur –n2 + m 1991) Total n N–n N 1 1 TScore(, ) = (7) We calculate association scores for all pairs Table 1. Contingency matrix of occurrence of words when their occurrence frequencies frequencies. are not less than a threshold Tf and when their

32 ᅜ ᮏሠ ᐙ ⮬Ꮿ ᐙᗞ ᪋タ (kuni) (honrui) (ie) (jitaku) (katei) (shisetsu) [country] [home base] [house] [my home] [homeplace] [facilities] base 0.009907 0.017649 0.005495 0.006117 0.005186 0.005597 game 0.043507 0.048358 0.025145 0.028208 0.019987 0.023014 Kansas 0.010514 0.003786 0.004280 0.007307 0.004320 0.005459 run 0.023468 0.042035 0.014430 0.015765 0.012061 0.012986 season 0.044855 0.050952 0.025406 0.028506 0.020716 0.023631 Table 2. Word correlation matrix for a word “home,” as information for word translation disambiguation threshold Tc. We handle word pairs whose association A career .284 hitter, Beltran batted .267 scores are not less than a predefined value TA; in the regular season, split between some of the thresholds were evaluated in our Kansas City and Houston, but came experiment. The associated word pair sets Af alive in the playoffs. He hit .435 in 12 and Ae in Algorithm 1 includes only word pairs postseason games, with six stolen bases, whose scores are not less than TA in the source eight home runs and 14 runs batted in. and target language, respectively. Figure 3. An example of an input word “home” 3 Word Translation Disambiguation is the window size for word translation disam- Consider that a translator changes a word in an biguation. input sentence. Usually, two or more transla- A simple example is shown in Figure 3. The tion candidates are enumerated in the bilingual word “home” in this context means the sense dictionary for a word. The translator should of “home base” used in baseball games, not select a translation word that is grammatically/ “house” or “hometown.” The surrounding syntactically correct, semantically equivalent words such as “games,” “bases,” and “runs” to the input, and pragmatically appropriate. can be clues for indicating the correct sense of We assume that the translation word e for an the word. By using the correlation matrix (Ta- input word f tends to be selected if words oc- ble 2) and formula (8), we calculate a score for curring around f are strongly correlated with e. each translation candidate and select the best Using the correlation matrices, we select e as a translation with the largest score. In this case, translation if the associated word f’ occurs Score(ᮏሠ (honrui)) = 0.1134 was the best around f and the score Cf(f’,e) is large. In addi- score and the correct translation was selected. tion, we take distance between f and f’ into account. 4Experiment We define the score of the translation word 4.1 Experimental settings e(f0) for an input word f0 as follows. Consider an input word f0 that occurs in the context of We carried out word translation experiments “… f-2 f-1 f0 f1 f2 ….” The score for a trans- on two different settings. In the first experi- lation word (ͤ) for an input word f0 is ment (Experiment A), we used large-scale comparable corpora from English and Japanese Score(ͤ) 1 newspaper texts. The second experiment (Ex- ( ) periment B) was targeted at translating tech- = ȕ t ʠ+, ͤ ʡ , (8) | | K|+|KR ǭ ͤ nical terms by using Chinese and Japanese defined as where p is the relative position of domain-specific corpora. the words surrounding f , ʠ ,( )ʡ is We used the following linguistic resources 0 t + ͤ for the experiments: the value of the correlation matrix for f0, and  Experiment A

33 Training comparable corpora every Chinese-Japanese term pair (tC, tJ) when  The New York Times texts from (tC, tE) and (tJ, tE) were present in the dictiona- English Gigaword Corpus ries for one or more English terms tE. This Fourth Edition (LDC2009T13): merged dictionary contains about two million 1.6 billion words term pairs. While these Chinese-Japanese term  The Mainichi Shimbun Corpus pairs include wrong translations, it was not a (2000-2005): 195 million words serious problem in our experiments because Test corpus wrong translations were excluded in the pro-  A part of The New York Times cedure of our method. (January 2005): 157 paragraphs, We applied morphological analysis and 1,420 input words part-of-speech tagging by using TreeTagger  Experiment B1 (Schmid, 1994) for English, JUMAN for Jap- Training comparable corpus anese, and mma (Kruengkrai et al., 2009) for  In-house Chinese-Japanese pa- Chinese, respectively. rallel corpus in the environment In the test corpus, we manually annotated 2 domain: 53,027 sentence pairs1 reference translations for each target word. Test corpus The parameters we used were as follows:  A part of the training data: 1,443 Experiment A: sentences, 668 input words Tf = 100 (Japanese), Tf = 1000 (English),  Experiment B2 Tc = 4, w = 30, = 30. Training comparable corpus Experiment B1/B2:  In-house Chinese-Japanese pa- Tf = 100, Tc = 4, w = 10, = 25. rallel corpus in the medical do- Some of the parameters were empirically ad- main: 123,175 sentence pairs justed. Test corpus In the experiments, the matrices could be obtained for 9103 English words (A), 674  A part of the training data: 940 sentences, 3,582 input words Chinese words (B1) and 1258 Chinese words  Dictionaries (B2), respectively. In average one word had 3.24 (A), 1.15 (B1) and 1.51 (B2) translation Japanese-English bilingual dictiona- 3 ries: Total 333,656 term pairs candidates by using the best setting. Table 2 is the resulted matrix for the word “home” in  EDR Electronic Dictionary the Experiment A.  Eijiro, Third Edition  EDICT (Breen, 1995) 4.2 Results of English-Japanese word Chinese-English bilingual dictionary translation Chinese-English Translation  Table 3 shows the results of Experiment A. We Lexicon Version 3.0 (LDC classified the translation results for 1,420 2002L27): 54,170 term pairs target English words into four categories: True,  Wanfang Data Chinese-English False, R, and M. When the translation was Science/Technology Bilingual output, the result was True if the output is Dictionary: 525,259 term pairs included in the reference translations, and it For the Chinese-Japanese translation, we gen- was False otherwise. The result was R when all erated a Chinese-Japanese bilingual dictionary the associated words in the correlation matrix by merging Chinese-English and Japa- nese-English dictionaries. The Chi- nese-Japanese bilingual dictionary includes 2 We prepared multiple references for several tar- get words. The average numbers of reference trans- lations for an input word are 1.84 (A), 1.50 (B1), 1 We could prepare only parallel corpora for Chi- and 1.48 (B2), respectively. nese-Japanese language pair as training corpora. 3 From each matrix, we cut off the columns with For our experiments, we assumed them as compa- translations that do not have the best scores for any rable corpora and did not use the correspondence of associated words, because such translations are sentence pairs. never selected.

34 did not occur around the input word. The result Score TA True False R M was M when no correlation matrix existed for Dice 0 588 627 90 115 the input word. We did not select a translation (41%/48%) output in these cases. The recall and precision 0.001 586 619 94 121 are shown in the parentheses below. (41%/49%)  Recall = (True) / (Number of input words) 0.01 479 507 243 191 (34%/49%)  Precision = (True) / (True + False) Jacc- 0 594 621 90 115 Among the settings, we obtained the best ard (42%/49%) results, 42% recall and 49% precision, when 0.001 584 609 105 122 we used the Jaccard coefficient for association (41%/49%) scores and TA = 0, which means all pairs were 0.01 348 378 374 320 taken into consideration. Among other settings, (25%/48%) the Dice coefficient achieved a comparable pMI 0 292 309 704 115 performance with Jaccard. (21%/49%) 1 293 308 703 116 4.3 Results of Chinese-Japanese word (21%/49%) translation LLR 10 530 747 28 115 Tables 4 and 5 show the results of Experiment (37%/42%) B. In each domain, we tested only the settings 100 529 744 32 115 on Dice, pMI, and LLR with TA = 0. (37%/42%) In the environmental domain, the pointwise T- 1 486 793 26 115 mutual information score achieved the best Score (34%/38%) performance, 45% recall and 76% precision. 4 489 787 26 118 However, the Dice coefficient gave the best (34%/38%) recall (55%) for the medical domain. This re- Table 3. Results of English-Japanese word sult indicates that Experiment B1/B2 had translation (A) higher precision and more words without the correlation matrix than Experiment A had. Score TA True False R M Dice 0 1984 895 82 621 4.4 Discussion (55%/69%) As a result, we could generate bilingual lex- pMI 0 1886 804 271 621 icons with word translation disambiguation (53%/70%) information for 9103 English words and 1932 LLR 0 1652 1246 63 621 Chinese words. Although the number of words (46%/57%) might be augmented by changing the settings, Table 4. Results of Chinese-Japanese word the size does not seem to be sufficient as bi- translation for environmental domain (B1). lingual dictionaries. The availability of larger output should be investigated. Score TA True False R M The experimental results show that our me- Dice 0 277 124 14 253 thod selected correct translations for at least (41%/69%) half of the input words if a correlation matrix pMI 0 303 95 17 253 existed and if the associated words co-occur. (45%/76%) Among all input words, at least 40% of the LLR 0 269 131 15 253 input words can be translated. The bilingual (40%/67%) dictionaries included 24.4, 38.6, and 52.0 Table 5. Results of Chinese-Japanese word translation candidates for one input word in translation for medical domain (B2). Experiment A, B1, and B2, respectively. When we select the most frequent word, the preci- candidates in the correlation matrices for one sions were 7%, 1%, and 1%, respectively. input word are shown in Table 6. These indi- Meanwhile, the average numbers of translation cate that our method effectively removed noisy

35 translations from the Chinese-Japanese dictio- Score TA Exp.A Exp.B1 Exp.B2 nary merged Japanese-English and Chi- Dice 0 1.83 0.61 0.91 nese-English dictionaries, and that the associa- pMI 0 1.27 0.50 0.79 tion scores contributed word translation dis- LLR 10/0 1.34 1.48 2.75 ambiguation. Table 6. Average numbers of translation can- Among the settings, the Jaccard/Dice coef- didates in the correlation matrices for one input ficients were proven to be effective, although word. pointwise mutual information (pMI) was also effective for technical domains and the Chi- tactic co-occurrence, which is obtained by nese-Japanese language pair. Because the Jac- conducting dependency parsing of the results card/Dice coefficients were originally used for of a sentence. The correlation between asso- measuring the proximity of sets, these might ciated words and translation candidates also be effective for collecting related words by needs to be re-examined. Similarly, we will using the similarity of kinds of co-occurring handle verbs as associated words to the input words. However, pMI tends to emphasize nouns by using syntactic co-occurrence. low-frequency words as associated words. The consequence of this tendency might be that 5 Related Work low-frequency associated words do not appear around the input word in the newspaper text.4 Statistical machine translation (Brown et al., In most metrics for the association score, the 1990) automatically acquires knowledge for lowest threshold value TA achieved the best word translation disambiguation from parallel performance. This result indicates that the corpora. Word translation disambiguation is cut-off of associated words by some thresholds based on probabilities calculated from the was not effective, although it requires more word alignment, phrase pair extraction, and the time and memory space to obtain correlation language model. However, much broad con- matrices without cut-off. How to optimize oth- text/domain information is not considered. er parameters in our method remains unsolved. Carpuat and Wu (2007) proposed con- More words without the correlation matrix text-dependent phrasal translation lexicons by were present in Experiment B1/B2 than in Ex- introducing context-dependent features into periment A because the input word was often a statistical machine translation. technical term that was not in the bilingual dic- Unsupervised methods using dictionaries tionary. The better recall and precision of Ex- and corpora were proposed for monolingual periment B1/B2 came from several reasons, WSD (Ide and Veronis, 1998). They used including difference of test sets and language grammatical information including pairs. In addition, it might have an impact on parts-of-speech, syntactically related words, this result that the fact that word translation and co-occurring words as the clues for the disambiguation of technical terms is easier WSD. Our method uses a part of the clues for than word translation disambiguation of com- bilingual WSD and word translation disam- mon words. biguation. We handled only nouns as input words and Li and Li (2002) constructed a classifier for associated words in this study. Considering word translation disambiguation by using a only the co-occurrence in a fixed window bilingual dictionary with bootstrapping tech- would be insufficient to apply this method to niques. We also conducted recursive calcula- the translation of verbs and other parts of tion by dealing with the bilingual dictionary as speech. In future work, we will consider syn- the seeds of the iteration. Vickrey et al. (2005) introduced a context as

4 a feature for a statistical MT system and they We limited the maximum number of association generated word-level translations. How to in- words for one word to 400 in descending order of troduce the word-level translation disambigua- their association scores because of restriction of tion into sentence-level translation is a consi- computational resources. In future work, we may alleviate the drawback of pMI by enlarging or de- derable problem. leting this limitation.

36 6 Conclusion Church, Kenneth W. and Patrick Hanks. 1990. Word association norms, mutual information, and lex- In this paper, we described a method for add- icography. Computational Linguistics, ing information for word translation disam- 16(1):22-29. biguation into the bilingual lexicon, by consi- Dunning, Ted. 1993. Accurate methods for the sta- dering the associated words that co-occur with tistics of surprise and coincidence. Computa- the input word. We based our method on the tional Linguistics, 19(1):61-74. following two assumptions: “parallel word Ide, Nancy and Jean Veronis. 1998. Introduction to associations” and “one sense per word associa- the special issue on word sense disambiguation: tion.” We aligned word associations by using a the state of the art. Computational Linguistics, bilingual dictionary, and constructed a correla- 24(1):1-40. tion matrix for each word in the source lan- guage for word translation disambiguation. Kaji, Hiroyuki and Yasutsugu Morimoto. 2002. Unsupervised word sense disambiguation using Experiments showed that our method was ap- bilingual comparable corpora. In Proc. of the plicable for both common newspaper text and 19th International Conference on Computational domain-specific text and for two language Linguistics, pages 411-417. pairs. The Jaccard/Dice coefficients were proven to be more effective than the other me- Kruengkrai, Canasai, Kiyotaka Uchimoto, Jun’ichi trics as word association scores. Future work Kazama, Yiou Wang, Kentaro Torisawa, and Hitoshi Isahara. 2009. An error-driven includes extending our method to handle verbs word-character hybrid model for joint Chinese as input words by introducing syntactic word segmentation and POS tagging. In Proc. of co-occurrence. The comparisons with other the Joint Conference of the 47th Annual Meeting disambiguation methods and machine transla- of the ACL and the 4th International Joint Con- tion systems would strengthen the effective- ference on Natural Language Processing of the ness of our method. We consider also evalua- AFNLP, pages 513-521. tions on real NLP tasks including machine Li, Cong and Hang Li. 2002. Word translation translation. disambiguation using bilingual bootstrapping. In Proc. of the 40th Annual Meeting of Association Acknowledgments for Computational Linguistics, pages 343-351. This work was partially supported by Japa- Rapp, Reinhard. 1995. Identifying word transla- nese/Chinese Machine Translation Project in tions in non-parallel texts. In Proc. of the 33rd Special Coordination Funds for Promoting Annual Meeting of the Association for Computa- Science and Technology (MEXT, Japan). tional Linguistics, pages 320-322. Schmid, Helmut. 1994. Probabilistic part-of-speech References tagging using decision trees. In Proc. of the 1st International Conference on New Methods in Brown, Peter F., John Cocke, Stephen A. Della Natural Language Processing. Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Smadja, Frank. 1993. Retrieving collocations from Roossin. 1990. A statistical approach to machine text: Xtract. Computational Linguistics, translation. Computational Linguistics, 19(1):143-177. 16(2):79-85. Smadja, Frank, Kathleen R. McKeown and Vasi- Carpuat, Marine and Dekai Wu. 2007. Con- leios Hatzivassiloglou. 1996. Translating collo- text-dependent phrasal translation lexicons for cations or bilingual lexicons: A statistical ap- statistical machine translation. In Proc. of Ma- proach. Computational Linguistics, 22(1):3-38. chine Translation Summit XI, pages 73-80. Vickrey, David, Luke Biewald, Marc Teyssier, and Church, Kenneth W., William Gale, Patrick Hanks Daphne Koller. 2005. Word-sense disambigua- and Donald Hindle. 1991. Using statistics in tion for machine translation. In Proc. of the lexical analysis. Lexical Acquisition: Using Conference on HLT/EMNLP, pages 771-778. On-line Resources to Build a Lexicon, pages Yarowsky, David. 1993. One sense per collocation. 115-164. In Proc. of ARPA Human Language Technology Workshop, pages 266-271.

37 Construction of bilingual multimodal corpora of referring expressions in collaborative problem solving

TOKUNAGA Takenobu IIDA Ryu YASUHARA Masaaki TERAI Asuka take,ryu-i,yasuhara @cl.cs.titech.ac.jp [email protected] { } Tokyo Institute of Technology David MORRIS Anja BELZ [email protected] [email protected] University of Brighton Abstract reference resolution, particularly anaphora resolu- tion (Mitkov, 2002), has a long research history as This paper presents on-going work on far back as the mid-1970s (Hobbs, 1978). Much constructing bilingual multimodal corpora research has been conducted from both theoretical of referring expressions in collaborative and empirical perspectives, mainly concerning the problem solving for English and Japanese. identification of antecedents or entities mentioned The corpora were collected from dia- within the same text. This trend, targeting refer- logues in which two participants collab- ence resolution in written text, is still dominant in oratively solved Tangram puzzles with the language analysis, perhaps because such tech- a puzzle simulator. Extra-linguistic in- niques are intended for use in applications such as formation such as operations on puzzle information extraction. pieces, mouse cursor position and piece In contrast, in language generation research in- positions were recorded in synchronisa- terest has recently shifted from the generation of tion with utterances. The speech data one-off references to entities to generation of REs was transcribed and time-aligned with the in discourse context (Belz et al., 2010) and inves- extra-linguistic information. Referring tigating human referential behaviour in real world expressions in utterances that refer to puz- situations, with the aim of using such techniques zle pieces were annotated in terms of their in applications like human-robot interaction (Pi- spans, their referents and their other at- wek, 2007; Foster et al., 2008; Bard et al., 2009). tributes. The Japanese corpus has already been completed, but the English counter- In both analysis and generation, machine- part is still undergoing annotation. We learning approaches have come to replace rule- have conducted a preliminary comparative based approaches as the predominant research analysis of both corpora, mainly with re- trend since the 1990s. This trend has made anno- spect to task completion time, task suc- tated corpora an indispensable component of re- cess rates and attributes of referring ex- search for training and evaluating proposed meth- pressions. These corpora showed signif- ods. In fact, research on reference resolution has icant differences in task completion time developed significantly as a result of large scale and success rate. corpora, e.g. those provided by the Message Un- derstanding Conference (MUC)1 and the Auto- 1 Introduction matic Content Extraction (ACE)2 project. These corpora were constructed primarily for informa- A referring expression (RE) is a linguistic de- tion extraction research, thus were annotated with vice that refers to a certain object of interest (e.g. co-reference relations within texts. Also in the used in describing where the object is located in language generation community, several corpora space). REs have attracted a great deal of atten- tion in both language analysis and language gen- 1http://www.nlpir.nist.gov/related projects/muc/ eration research. In language analysis research, 2http://www.itl.nist.gov/iad/tests/ace/

38 Proceedings of the 8th Workshop on Asian Language Resources, pages 38–46, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing have been developed (Di Eugenio et al., 2000; By- nally, Section 6 concludes the paper and looks at ron, 2005; van Deemter et al., 2006; Foster and possible future directions. Oberlander, 2007; Foster et al., 2008; Stoia et al., 2008; Spanger et al., 2009a; Belz et al., 2010). Unlike the corpora of MUC and ACE, many are collected from situated dialogues, and therefore include multimodal information (e.g. gestures and eye-gaze) other than just transcribed text (Martin et al., 2007). Foster and Oberlander (2007) em- phasised that any corpus for language generation !'#%%*"#( should include all possible contextual information +')$#&!%#) # at the appropriate granularity. Since constructing a dialogue corpus generally requires experiments Figure 1: Screenshot of the Tangram simulator for data collection, this kind of corpus tends to be small-scale compared with corpora for reference resolution. 2 Data collection Against this background, we have been de- veloping multimodal corpora of referring expres- 2.1 Experimental set-up sions in collaborative problem-solving settings. We recruited subjects in pairs of friends and col- This paper presents on-going work of construct- leagues. Each pair was instructed to solve Tan- ing bilingual (English and Japanese) comparable gram puzzles collaboratively. Tangram puzzles corpora in this domain. We achieve our goal by are geometrical puzzles that originated in ancient replicating, for the English corpus, the same pro- China. The goal of a Tangram puzzle is to con- cess of data collection and annotation as we used struct a given goal shape by arranging seven sim- for our existing Japanese corpus (Spanger et al., ple shapes, as shown in Figure 1. The pieces in- 2009a). Our aim is to create bilingual multimodal clude two large triangles, a medium-sized trian- corpora collected from dialogues in dynamic situ- gle, two small triangles, a parallelogram and a ations. From the point of view of reference anal- square. ysis, our corpora contribute to augmenting the re- With the aim of recording the precise position sources of multimodal dialogue corpora annotated of every piece and every action the participants with reference relations which have been minor made during the solving process, we implemented in number compared to other types of text cor- a Tangram simulator in which the pieces can be pora. From the point of view of reference gen- moved, rotated and flipped with simple mouse op- eration, our corpora contribute to increasing the erations on a computer display. The simulator dis- resources available that can be used to further re- plays two areas: a goal shape area and a work- search of this kind. In addition, our corpora con- ing area where the pieces can be manupulated and tribute to comparative studies of human referential their movements are shown in real time. behaviour in different languages We assigned a different role to each participant The structure of the paper is as follows. Sec- of a pair: one acted as the solver and the other as tion 2 describes the experimental set-up for data the operator. The operator has a mouse for manip- collection which was introduced in our previous ulating Tangram pieces, but does not have a goal work (Spanger et al., 2009a). The setting is basi- shape on the screen. The solver has a goal shape cally the same for the construction of the English on the screen but does not have a mouse. This set- corpus. Section 3 explains the annotation scheme ting naturally leads to a situation where given a adopted in our corpora, followed by a description certain goal shape, the solver thinks of the neces- of a preliminary analysis of the corpora in sec- sary arrangement of the pieces and gives instruc- tion 4. Section 5 briefly mentions related work tions to the operator how to move them, while the to highlight the characteristics of our corpora. Fi- operator manipulates the pieces with the mouse

39 according to the solver’s instructions. Each participant pair was assigned 4 trials con- sisting of two symmetric and two asymmetric goal shapes as shown in Figure 3. In Cogni- tive Science, a wide variety of different kinds of puzzles have been employed extensively in the field of Insight Problem solving. This has been termed the “puzzle-problem approach” (Sternberg and Davidson, 1996; Suzuki et al., 2001) and in the case of physical puzzles has relatively of- ten involved puzzle tasks of symmetric shapes like the so-called T-puzzle, e.g. (Kiyokawa and Nakazawa, 2006). In more recent work Tangram puzzles have been used as a means to study var- ious new aspects of human problem solving ap- Figure 2: Picture of the experiment setting proaches, including collection of of eye-gaze in- formation (Baran et al., 2007). In order to col- As we mentioned in our previous lect data as broadly as possible in this context, we study (Spanger et al., 2009a), this interaction set up puzzle-problems including both symmetri- produces frequent use of referring expressions cal as well as asymmetrical ones as shown in Fig- intended to distinguish specific pieces of the ure 3. puzzle. In our Tangram simulator, all pieces are of the same color, thus color is not useful in identifying a specific piece, i.e. only size The participants exchanged their roles after two and shape are discriminative object-intrinsic trials, i.e. a participant first solves a symmetric and attributes. Instead, we can expect other attributes then an asymmetric puzzle as the solver and then such as spatial relations and deictic reference to does the same as the operator, and vice versa. The be used more often. order of the puzzle trials is the same for all pairs. Each pair of participants sat side by side as shown in Figure 2. Each participant had his/her own computer display showing the shared work- Before starting the first trial as the operator, ing area. A room-divider screen was set between each participant had a short training exercise in the solver (right side) and operator (left side) to order to learn how to manipulate pieces with the prevent the operator from seeing the goal shape on mouse. The initial arrangement of the pieces was the solver’s screen, and to restrict their interaction randomised every time. We set a time limit of 15 to speech only. minutes for the completion of each trial (i.e. con- struction of the goal shape). In order to prevent the solver from getting into deep thought and keeping silent, the simulator is designed to give a hint ev- ery five minutes by showing a correct piece posi- tion in the goal shape area. After 10 minutes have !% $".1-0%+- !& $#2%- passed, a second hint is provided, while the pre- vious hint disappears. A trial ends when the goal shape is complete or the time is up. Utterances by the participants are recorded separately in stereo through headset microphones in synchronisation !' $!*%+/$ !( $$)%$&.2, with the position of the pieces and the mouse op- erations. Piece positions and mouse actions were automatically recorded by the simulator at inter- Figure 3: The goal shapes given to the subjects vals of 10 msec.

40 Table 1: The ELAN Tiers of the corpus Table 2: Attributes of referring expressions Tier meaning dpr : demonstrative pronoun, e.g. “the same one”, “this”, “that”, “it” OP-UT utterances by the operator dad : demonstrative adjective, e.g. “that triangle” SV-UT utterances by the solver siz : size, e.g. “the large triangle” OP-REX referring expressions by the operator typ : type, e.g. “the square” OP-Ref referents of OP-REX dir : direction of a piece, e.g. “the triangle facing the OP-Attr attributes of OP-REX left”. SV-REX referring expressions by the solver prj : projective spatial relation (including directional SV-Ref referents of SV-REX prepositions or nouns such as “right”, “left”, SV-Attr attributes of SV-REX “above”. . . ) e.g. “the triangle to the left of the Action action on a piece square” Target the target piece of Action tpl : topological spatial relation (including non- Mouse the piece on which the mouse is hovering directional prepositions or nouns such as “near”, Indentation of Tier denotes parent-child relations. ∗ “middle”. . . ), e.g. “the triangle near the square” ovl : overlap, e.g. “the small triangle under the large one” 2.2 Subjects and collected data act : action on pieces, e.g “the triangle that you are holding now”, “the triangle that you just rotated” For our Japanese corpus, we recruited 12 Japanese cmp : complement, e.g. “the other one” sim : similarity, e.g. “the same one” graduate students of the Cognitive Science depart- num : number, e.g. “the two triangle” ment, 4 females and 8 males, and split them into 6 rpr : repair, e.g. “the big, no, small triangle” pairs. All pairs knew each other previously and err : obvious erroneous expression, e.g. “the square” referring to a triangle were of the same sex and approximately same nest : nested expression; when a referring expression age3. We collected 24 dialogues (4 trials by 6 includes another referring expression, only the pairs) of about 4 hours and 16 minutes. The av- outermost expression is annotated with this at- tribute, e.g. “(the triangle to the left of (the small erage length of a dialogue was 10 minutes 40 sec- triangle))” onds (SD = 3 minutes 18 seconds). meta: metaphorical expression, e.g. “the leg”, “the For the comparable English corpus, we re- head” cruited 12 native English speakers of various oc- cupations, 6 males and 6 females. Their aver- SLAT (Noguchi et al., 2008)4. Our target expres- age age was 30. There were 6 pairs all of whom sions in this corpus are referring expressions re- knew each other beforehand except for one pair. ferring to a puzzle piece or a set of puzzle pieces. Whereas during the creation of the Japanese cor- We do not deal with expressions referring to a lo- pus we had to give extra attention to ensuring that cation, a part of a piece or a constructed shape. social relationships did not have an impact on how These expressions are put aside for future work. the subjects communicated with one another, for The annotation of referring expressions is three- the English corpus there was no such concern. We fold: (1) identification of the span of expressions, collected 24 dialogues (4 trials by 6 pairs) of 5 (2) identification of their referents, and (3) assign- hours and 7 minutes total length. The average ment of a set of attributes to each referring expres- length of a dialogue was 12 minutes 47 seconds sion. (SD = 3 minutes 34 seconds). Using the multimodal annotation tool ELAN,5 the annotations of referring expressions were then 3 Annotation merged with extra-linguistic data recorded by the The recorded speech data was transcribed and Tangram simulator. The available extra-linguistic the referring expressions were annotated with information from the simulator consists of (1) the the Web-based multi-purpose annotation tool action on a piece, (2) the coordinates of the mouse cursor and (3) the position of each piece in the 3In Japan, the relationship of senior to junior or socially higher to lower placed might affect the language use. We 4We did not use SLAT for English corpus annotation. In- carefully recruited pairs to avoid the effects of this social re- stead, ELAN was directly used for annotating referring ex- lationship such as the possible use of overly polite and indi- pressions. rect language, reluctance to correct mistakes etc. 5http://www.lat-mpi.eu/tools/elan/

41 Table 3: Summary of trials ID time success OP-REX SV-REX ID time success OP-REX SV-REX E01 15:00 J01 8:40 o 10 48 E02 15:00 J02 11:49 o 7 55 E03 15:00 J03 11:36 o 5 26 E04 15:00 J04 7:31 o 2 21 E05 15:00 J05 15:00 23 78 E06 15:00 J06 11:12 o 5 60 E07 15:00 J07 12:11 o 3 59 E08 15:00 J08 11:20 o 4 61 E09 10:39 o J09 14:59 o 36 84 E10 15:00 J10 6:20 o 3 47 E11 15:00 J11 5:21 o 2 14 E12 8:30 o J12 13:40 o 37 77 E13 14:33 o 8 95 J13 15:00 8 56 E14 7:27 o 1 62 J14 4:48 o 1 29 E15 14:02 o 16 127 J15 9:30 o 20 39 E16 3:57 o 1 31 J16 5:07 o 3 17 E17 13:00 o J17 13:37 o 10 46 E18 6:40 o J18 8:57 o 4 51 E19 15:00 J19 8:02 o 0 37 E20 12:32 o J20 11:23 o 1 59 E21 15:00 J21 10:12 o 7 71 E22 15:00 J22 10:24 o 9 64 E23 15:00 J23 15:00 0 69 E24 5:36 o J24 14:22 o 0 76 Ave. 12:47 6.5 78.8 Ave. 10:40 8.3 51.8 SD 3:34 7.14 41.4 SD 3:18 10.4 20.1 Total 5:06:56 10 26 315 Total 4:16:01 21 200 1,244 working area. Actions and mouse cursor positions Erroneous expressions are annotated with a • are recorded at intervals of 10 msec, and are ab- special attribute. stracted into (1) a time span labeled with an action An expression without a definite referent (i.e. • symbol (“move”, “rotate” or “flip”) and its target a group of possible referents or none) is as- piece number (1–7), and (2) a time span labeled signed a referent number sequence consist- with a piece number which is under the mouse ing of a prefix, followed by the sequence of cursor during that span. The position of pieces is possible referents as its referent, if any are updated and recorded with a timestamp when the present. position of any piece changes. Information about All expressions appearing in muttering to piece positions is not merged into the ELAN files • and is kept in separate files. As a result, we have oneself are excluded. 11 time-aligned ELAN Tiers as shown in Table 1. Table 2 shows a list of attributes of referring Two annotators (two of the authors) first an- expressions used in annotating the corpus. notated four Japanese dialogues separately and The rest of the 20 Japanese dialogues were an- based on a discussion of discrepancies, decided notated by two of the authors and discrepancies on the following criteria to identify a referring ex- were resolved by discussion. Four English dia- pression. logues have been annotated so far by one of the The minimum span of a noun phrase in- authors. • cluding necessary information to identify a referent is annotated. The span might in- 4 Preliminary corpus analysis clude repairs with their reparandum and dis- We have already completed the Japanese corpus, fluency (Nakatani and Hirschberg, 1993) if which is named REX-J (2008-08), but only 4 out needed. of 24 dialogues have been annotated for the En- Demonstrative adjectives are included in ex- glish counterpart (REX-E (2010-03)). Table 3 • pressions. shows a summary of the trials. The horizontal

42 lines divide the trials by pairs, “o” in the “suc- shape as the dependent variable and the goal shape cess” column denotes that the trial was success- as the independent variable. The main effect was fully completed in the time limit (15 minutes), and not significant. the “OP-REX” and “SV-REX” columns show the In summary, we found a difference in the task number of referring expressions used by the op- performance between the languages in terms of erator and the solver respectively. The following the task completion time and the success rate, but subsections describe a preliminary comparison of no difference among the goal shapes. This dif- the English and Japanese corpora. ference could be explained by the diversity of the subjects rather than the difference of languages. Table 4: Task completion time The Japanese subject group consisted of univer- Lang. Shape (a) (b) (c) (d) sity graduate students from the same department English\ 832.0 741.2 890.3 605.8 (Cognitive Science) and roughly of the same age (105.4) (246.5) (23.7) (287.2) (Average = 23.3, SD = 1.5). In contrast, the En- Japanese 774.7 535.0 571.7 633.8 (167.3) (168.5) (242.2) (215.2) glish subjects have diverse backgrounds (e.g. high * Average (SD) school students, university faculty, writer, pro- grammer, etc.) and age (Average = 30.8, SD = 11.7). In addition, a familiarity with this kind of 4.1 Task performance geometric puzzle might have some effect. How- We conducted a two-way ANOVA with the task ever, we collected a familiarity with the puzzle completion time as the dependent variable, and only from the English subjects, we could not con- the goal shape and the language as the indepen- duct further analysis on this viewpoint. Anyhow, dent variables. Only the main effect of the lan- in this respect, the independent variable should guage was significant (F (1, 40) = 5.82, p < have been named “subject group” instead of “lan- 0.05). Table 4 shows the average and the standard guage”. deviation of the completion time. Note that we set a time limit (15 minutes) for solving the puzzle. 4.2 Referring expressions We considered the completion time as 15 minutes It is important to note that since we have only even when a puzzle was not actually solved in the completed the annotation of four dialogs, all by time limit. We also conducted a two-way ANOVA one pair of subjects, our analyses of referring ex- using only the successful cases. Both main effects pressions are tentative and pending further analy- and their interaction were not significant. sis. We then conducted an ANOVA with the num- We have 200 and 1,243 referring expressions by ber of successfully solved puzzles by each pair as the operator and the solver respectively, 1,444 in the dependent variable and the language as the in- total in the 24 Japanese dialogues. On the other dependent variable. The main effect was signifi- hand we have 26 (operator) and 315 (solver) re- cant (F (1, 10) = 6.79, p < 0.05). Table 5 shows ferring expressions in 4 English dialogues. The the average number of success goals per pair and average number of referring expressions per di- the success rate with their standard deviations in alogue in Table 3 suggests that English subjects parentheses. use more referring expressions than Japanese sub- Finally, we conducted an ANOVA with the jects. Since we have only the data from a single number of pairs who succeeded in solving a goal pair, we cannot say whether this tendency applies to the other pairs. We cannot draw a decisive con- clusion until we complete the annotation of the Table 5: The number of solved trials and success English corpus. rates Table 6 shows the total frequencies of the at- Lang. solved trials success rate [%] Japanese 3.50 (0.55) 87.5 (13.7) tributes and their frequencies per dialogue. The English 1.67 (1.63) 41.7 (40.8) table gives us an impression of significantly fre- * Average (SD) quent use of demonstrative pronouns (dpr) by the

43 as limited annotation. Table 6: Comparison of attribute distribution English Japanese The QUAKE corpus (Byron, 2005) and its suc- (4 dialogues) (24 dialogues) attribute frq frq/dlg frq frq/dlg cessor, the SCARE corpus (Stoia et al., 2008) deal dpr 226 56.5 678 28.3 with a more complex domain, where two partici- dad 29 7.3 178 7.4 siz 68 17.0 288 12.0 pants collaboratively play a treasure hunting game typ 103 25.8 655 27.3 in a 3-D virtual world. Despite the complexity dir 0 0 7 0.3 of the domain, the participants were only allowed prj 10 2.5 141 5.9 tpl 4 1 9 0.4 limited actions, e.g. moving step forward, pushing ovl 0 0 2 0.1 a button etc. act 5 1.3 103 4.3 cmp 17 4.3 33 1.4 As a part of the JAST project, the Joint Con- sim 0 0 7 0.3 num 22 5.5 35 1.5 struction Task (JCT) corpus was created based on rpr 0 0 1 0 dialogues in which two participants constructed a err 0 0 1 0 nest 1 0.3 31 1.3 puzzle (Foster et al., 2008). The setting of the meta 1 0.3 6 0.3 experiment is quite similar to ours except that both participants have even roles. Since our main concern is referring expressions, we believe our English subjects. The Japanese subjects use more asymmetric setting elicits more referring expres- attributes of projective spatial relations (prj) and sions than the symmetric setting of the JCT cor- 6 actions on the referent (act). The English subjects pus. use more complement attributes (cmp) as well as more number attributes (num). In contrast to these previous corpora, our cor- pora record a wide range of information useful 5 Related work for analysis of human reference behaviour in situ- ated dialogue. While the domain of our corpora is Over the last decade, with a growing recogni- simple compared to the QUAKE and SCARE cor- tion that referring expressions frequently appear pora, we allowed a comparatively large flexibil- in collaborative task dialogues (Clark and Wilkes- ity in the actions necessary for achieving the goal Gibbs, 1986; Heeman and Hirst, 1995), a num- shape (i.e. flipping, turning and moving of puzzle ber of corpora have been constructed to study the pieces at different degrees), relative to the com- nature of their use. This tendency also reflects plexity of the domain. Providing this relatively the recognition that this area yields both challeng- larger freedom of actions to the participants to- ing research topics as well as promising applica- gether with the recording of detailed information tions such as human-robot interaction (Foster et allows for research into new aspects of referring al., 2008; Kruijff et al., 2010). expressions. The COCONUT corpus (Di Eugenio et al., 2000) was collected from keyboard-dialogs be- As for a multilingual aspect, all the above cor- tween two participants, who worked together on pora are English. There have been several recent a simple 2-D design task, buying and arranging attempts at collecting multilingual corpora in situ- furniture for two rooms. The COCONUT cor- ated domains. For instance, (Gargett et al., 2010) pus is limited in annotations which describe sym- collected German and English corpora in the same bolic object information such as object intrinsic setting. Their domain is similar to the QUAKE attributes and location in discrete co-ordinates. As corpus. Van der Sluis et al. (2009) aim at a com- an initial work of constructing a corpus for collab- parative study of referring expressions between orative tasks, the COCONUT corpus can be char- English and Japanese. Their domain is still static acterised as having a rather simple domain as well at the moment. Our corpora aim at dealing with 6We called such expressions as action-mentioning expres- the dynamic nature of situated dialogues between sions (AME) in our previous work. very different languages, English and Japanese.

44 eration, and evaluation in different tasks (puzzles) Table 7: The REX-J corpus family as well. We are planning to distribute the REX-J name puzzle #pairs #dialg. #valid status T2008-08 Tangram 6 24 24 completed corpus family through GSK (Language Resources T2009-03 Tangram 10 40 16 completed Association in Japan)8, and the REX-E corpus T2009-11 Tangram 10 36 27 validating from both University of Brighton and GSK. N2009-11 Tangram 5 20 8 validating P2009-11 Polyomino 7 28 24 annotating D2009-11 2-Tangram 7 42 24 annotating References 6 Conclusion and future work Baran, Bahar, Berrin Dogusoy, and Kursat Cagiltay. 2007. How do adults solve digital tangram prob- This paper presented an overview of our English- lems? Analyzing cognitive strategies through eye tracking approach. In HCI International 2007 - 12th Japanese bilingual multimodal corpora of refer- International Conference - Part III, pages 555–563. ring expressions in a collaborative problem solv- ing setting. The Japanese corpus was completed Bard, Ellen Gurman, Robin Hill, Manabu Arai, and and has already been used for research (Spanger et Mary Ellen Foster. 2009. Accessibility and atten- tion in situated dialogue: Roles and regulations. In al., 2009b; Spanger et al., 2010; Iida et al., 2010), Proceedings of the Workshop on Production of Re- but the English counterpart is still undergoing an- ferring Expressions Pre-CogSci 2009. notation. We have also presented a preliminary comparative analysis of these corpora in terms of Belz, Anja, Eric Kow, Jette Viethen, and Albert Gatt. 2010. Referring expression generation in context: the task performance and usage of referring ex- The GREC shared task evaluation challenges. In pressions. We found a significant difference of the Krahmer, Emiel and Mariet¨ Theune, editors, Empir- task performance, which could be attributed to the ical Methods in Natural Language Generation, vol- difference in diversity of subjects. We have tenta- ume 5980 of Lecture Notes in Computer Science. tive results on the usage of referring expressions, Springer-Verlag, Berlin/Heidelberg. since only four English dialogues are available at Byron, Donna K. 2005. The OSU Quake 2004 cor- the moment. pus of two-party situated problem-solving dialogs. The data collection experiments were con- Technical report, Department of Computer Science ducted in August 2008 for Japanese and in March and Enginerring, The Ohio State University. 2010 for English. Between these periods, we Clark, H. Herbert. and Deanna Wilkes-Gibbs. 1986. conducted various data collections to build differ- Referring as a collaborative process. Cognition, ent types of Japanese corpora (March, 2009 and 22:1–39. November 2009). These experiments involve cap- Di Eugenio, Barbara, Pamela W. Jordan, Richmond H. turing eye-gaze information of participants during Thomason, and Johanna. D. Moore. 2000. The problem solving, and introducing variants of puz- agreement process: An empirical investigation of zles (Polyomino, Double Tangram and Tangram human-human computer-mediated collaborative di- without any hints7). They are also under prepa- alogues. International Journal of Human-Computer Studies, 53(6):1017–1076. ration for publication. Table 7 gives an overview of the REX-J corpus family, where “#valid” de- Foster, Mary Ellen and Jon Oberlander. 2007. Corpus- notes the number of dialogues with valid eye- based generation of head and eyebrow motion for gaze data. Eye-gaze data is difficult to capture an embodied conversational agent. Language Re- sources and Evaluation, 41(3–4):305–323, Decem- cleanly throughout a dialogue. We discarded di- ber. alogues in which eye-gaze was captured success- fully less than 70% of the total time of the dia- Foster, Mary Ellen, Ellen Gurman Bard, Markus Guhe, logue. Namely, we annotated or will annotate di- Robin L. Hill, Jon Oberlander, and Alois Knoll. 2008. The roles of haptic-ostensive referring ex- alogues with validated eye-gaze data only. pressions in cooperative, task-based human-robot These corpora enable research on utilising eye- dialogue. In Proceedings of 3rd Human-Robot In- gaze information in reference resolution and gen- teraction, pages 295–302. 7N2009-11 in Table 7 8http://www.gsk.or.jp/index e.html

45 Gargett, Andrew, Konstantina Garoufi, Alexander of referring expressions used in a situated collab- Koller, and Kristina Striegnitz. 2010. The give- oration task. In Proceedings of the 12th European 2 corpus of giving instructions in virtual environ- Workshop on Natural Language Generation (ENLG ments. In Proceedings of the Seventh conference on 2009), pages 110 – 113. International Language Resources and Evaluation (LREC 2010), pages 2401–2406. Spanger, Philipp, Masaaki Yasuhara, Ryu Iida, and Takenobu Tokunaga. 2009b. Using extra linguistic Heeman, Peter A. and Graeme Hirst. 1995. Collabo- information for generating demonstrative pronouns rating on referring expressions. Computational Lin- in a situated collaboration task. In Proceedings of guistics, 21:351–382. PreCogSci 2009: Production of Referring Expres- sions: Bridging the gap between computational and Hobbs, Jerry R. 1978. Resolving pronoun references. empirical approaches to reference. Lingua, 44:311–338. Spanger, Philipp, Ryu Iida, Takenobu Tokunaga, Iida, Ryu, Shumpei Kobayashi, and Takenobu Toku- Asuka Teri, and Naoko Kuriyama. 2010. Towards naga. 2010. Incorporating extra-linguistic informa- an extrinsic evaluation of referring expressions in tion into reference resolution in collaborative task situated dialogs. In Kelleher, John, Brian Mac dialogue. In Proceedings of 48th Annual Meeting Namee, and Ielka van der Sluis, editors, Proceed- of the Association for Computational Linguistics, ings of the Sixth International Natural Language pages 1259–1267. Generation Conference (INGL 2010), pages 135– 144. Kiyokawa, Sachiko and Midori Nakazawa. 2006. Ef- fects of reflective verbalization on insight problem Sternberg, Robert J. and Janet E. Davidson, editors. solving. In Proceedings of 5th International Con- 1996. The Nature of Insight. The MIT Press. ference of the Cognitive Science, pages 137–139. Stoia, Laura, Darla Magdalene Shockley, Donna K. Kruijff, Geert-Jan M., Pierre Lison, Trevor Ben- Byron, and Eric Fosler-Lussier. 2008. SCARE: jamin, Henrik Jacobsson, Hendrik Zender, and A situated corpus with annotated referring expres- Ivana Kruijff-Korbayova. 2010. Situated dialogue sions. In Proceedings of the Sixth International processing for human-robot interaction. In Cogni- Conference on Language Resources and Evaluation tive Systems: Final report of the CoSy project, pages (LREC 2008), pages 28–30. 311–364. Springer-Verlag. Suzuki, Hiroaki, Keiga Abe, Kazuo Hiraki, and Martin, Jean-Claude, Patrizia Paggio, Peter Kuehnlein, Michiko Miyazaki. 2001. Cue-readiness in in- Rainer Stiefelhagen, and Fabio Pianesi. 2007. Spe- sight problem-solving. In Proceedings of the 23rd cial issue on Mulitmodal corpora for modeling hu- Annual Meeting of the Cognitive Science Society, man multimodal behaviour. Language Resources pages 1012 – 1017. and Evaluation, 41(3-4). van Deemter, Kees, Ielka van der Sluis, and Albert Mitkov, Ruslan. 2002. Anaphora Resolution. Long- Gatt. 2006. Building a semantically transparent man. corpus for the generation of referring expressions. In Proceedings of the Fourth International Natural Nakatani, Christine and Julia Hirschberg. 1993. A Language Generation Conference, pages 130–132. speech-first model for repair identification and cor- rection. In Proceedings of 31th Annual Meeting of van der Sluis, Ielka, Junko Nagai, and Saturnino Luz. ACL, pages 200–207. 2009. Producing referring expressions in dialogue: Insights from a translation exercise. In Proceedings Noguchi, Masaki, Kenta Miyoshi, Takenobu Toku- of PreCogSci 2009: Production of Referring Ex- naga, Ryu Iida, Mamoru Komachi, and Kentaro pressions: Bridging the gap between computational Inui. 2008. Multiple purpose annotation using and empirical approaches to reference. SLAT – Segment and link-based annotation tool. In Proceedings of 2nd Linguistic Annotation Work- shop, pages 61–64.

Piwek, Paul L. A. 2007. Modality choise for gen- eration of referring acts. In Proceedings of the Workshop on Multimodal Output Generation (MOG 2007), pages 129–139.

Spanger, Philipp, Masaaki Yasuhara, Ryu Iida, and Takenobu Tokunaga. 2009a. A Japanese corpus

46 Labeling Emotion in Bengali Blog Corpus – A Fine Grained Tagging at Sentence Level

Dipankar Das Sivaji Bandyopadhyay Department of Computer Science Department of Computer Science & Engineering, & Engineering, Jadavpur University Jadavpur University [email protected] [email protected]

affective communication substrates to analyze Abstract the reaction of emotional catalysts. Among these media, blog is one of the communicative Emotion, the private state of a human and informative repository of text based emo- entity, is becoming an important topic tional contents in the Web 2.0 (Lin et al., in Natural Language Processing (NLP) 2007). with increasing use of search engines. Rapidly growing web users from multilin- The present task aims to manually an- gual communities focus the attention to im- notate the sentences in a web based prove the multilingual search engines on the Bengali blog corpus with the emotional basis of sentiment or emotion. Major studies components such as emotional expres- on Opinion Mining and Sentiment Analyses sion (word/phrase), intensity, associ- have been attempted with more focused per- ated holder and topic(s). Ekman’s six spectives rather than fine-grained emotions. emotion classes (anger, disgust, fear, The analyses of emotion or sentiment require happy, sad and surprise) along with some basic resource. An emotion-annotated three types of intensities (high, general corpus is one of the primary ones to start with. and low) are considered for the sen- The proposed annotation task has been car- tence level annotation. Presence of dis- ried out at sentence level. Three annotators course markers, punctuation marks, have manually annotated the Bengali blog sen- negations, conjuncts, reduplication, tences retrieved from a web blog archive1 with rhetoric knowledge and especially Ekman’s six basic emotion tags (anger (A), emoticons play the contributory roles disgust (D), fear (F), happy (H), sad (Sa) and in the annotation process. Different surprise (Su)). The emotional sentences are types of fixed and relaxed strategies tagged with three types of intensities such as have been employed to measure the high, general and low. The sentences of non- agreement of the sentential emotions, emotional (neutral) and multiple (mixed) cate- intensities, emotional holders and top- gories are also identified. The identification of ics respectively. Experimental results emotional words or phrases and fixing the for each emotion class at word level on scope of emotional expressions in the sen- a small set of the whole corpus have tences are carried out in the present task. Each been found satisfactory. of the emoticons is also considered as individ- ual emotional expressions. The emotion holder 1 Introduction and relevant topics associated with the emo- Human emotion described in texts is an impor- tional expressions are annotated considering tant cue for our daily communication but the the punctuation marks, conjuncts, rhetorical identification of emotional state from texts is structures and other discourse information. The not an easy task as emotion is not open to any knowledge of rhetorical structure helps in re- objective observation or verification (Quirk et moving the subjective discrepancies from the al., 1985). Emails, weblogs, chat rooms, online forums and even twitter are considered as the 1 www.amarblog.com

47 Proceedings of the 8th Workshop on Asian Language Resources, pages 47–55, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing writer’s point of view. The annotation scheme happy, sad, positively surprised, negatively is used to annotate 123 blog posts containing surprised) to accomplish the emotion annota- 4,740 emotional sentences having single emo- tion task at sentence level. They have manually tion tag and 322 emotional sentences for mixed annotated 1580 sentences extracted from 22 emotion tagss along with 7087 neutral sen- Grimms’ tales. The present approach discusses tences in Bengali. Three types of standard the issues of annotating unstructured blog text agreement measures such as Cohen’s kappa considering rhetoric knowledge along with the () (Cohen, 1960; Carletta, 1996), Measure of attributes, e.g. negation, conjunct, reduplica- Agreement on Set-valued Items (MASI) (Pas- tion etc. sonneau, 2004) and agr (Wiebe et al., 2005) Mishne (2005) experimented with mood metrics are employed for annotating the emo- classification in a blog corpus of 815,494 posts tion related components. The relaxed agree- from Livejournal ment schemes like MASI and agr are specially (http://www.livejournal.com), a free weblog considered for fixing the boundaries of emo- service with a large community. (Mihalcea and tional expressions and topic spans in the emo- Liu, 2006) have used the same data source for tional sentences. The inter annotator agreement classifying the blog posts into two particular of some emotional components such as senten- emotions – happiness and sadness. The blog tial emotions, holders, topics show satisfactory posts are self-annotated by the blog writers performance but the sentences of mixed emo- with happy and sad mood labels. In contrast, tion and intensities of general and low show the present approach includes Ekman’s six the disagreement. A preliminary experiment emotions, emotion holders and topics to ac- for word level emotion classification on a complish the whole annotation task. small set of the whole corpus yielded satisfac- (Neviarouskaya et al., 2007) collected 160 tory results. sentences labeled with one of the nine emo- The rest of the paper is organized as fol- tions categories (anger, disgust, fear, guilt, in- lows. Section 2 describes the related work. The terest, joy, sadness, shame, and surprise) and a annotation of emotional expressions, sentential corresponding intensity value from a corpus of emotion and intensities are described in Sec- online diary-like blog posts. On the other hand, tion 3. In Section 4, the annotation scheme for (Aman and Szpakowicz, 2007) prepare an emotion holder is described. The issues of emotion-annotated corpus with a rich set of emotional topic annotation are discussed in emotion information such as category, inten- Section 5. Section 6 describes the preliminary sity and word or phrase based expressions. The experiments carried out on the annotated cor- present task considers all the above emotion pus. Finally, Section 7 concludes the paper. information during annotation. But, the present annotation task additionally includes the com- 2 Related Work ponents like emotion holder, single or multiple topic spans. One of the most well known tasks of annotat- The emotion corpora for Japanese were built ing the private states in texts is carried out by for recognizing emotions (Tokuhisa et al., (Wiebe et al., 2005). They manually annotated 2008). An available emotion corpus in Chinese the private states including emotions, opinions, is Yahoo!’s Chinese news and sentiment in a 10,000-sentence corpus (the (http://tw.news.yahoo.com), which is used for MPQA corpus) of news articles. The opinion Chinese emotion classification of news readers holder information is also annotated in the (Lin, et al., 2007). The manual annotation of MPQA corpus but the topic annotation task has eight emotional categories (expect, joy, love, been initiated later by (Stoyanov and Cardie, surprise, anxiety, sorrow, angry and hate) 2008a). In contrast, the present annotation along with intensity, holder, word/phrase, de- strategy includes the fine-grained emotion gree word, negative word, conjunction, rheto- classes and specially handles the emoticons ric, punctuation and other linguistic expres- present in the blog posts. sions are carried out at sentence, paragraph as (Alm et al., 2005) have considered eight well as document level on 1,487 Chinese blog emotion categories (angry, disgusted, fearful, documents (Quan and Ren, 2009). In addition

48 to the above emotion entities, the present ap- emoticons (emo_icon)). The identification of proach also includes the annotation of single or structural clues indeed requires the identifica- multiple emotion topics in a target span. tion of lexical clues. Recent study shows that non-native English Rhetorical Structure Theory (RST) de- speakers support the growing use of the Inter- scribes the various parts of a text, how they net 2. This raises the demand of linguistic re- can be arranged and connected to form a whole sources for languages other than English. Ben- text (Azar, 1999). The theory maintains that gali is the fifth popular language in the World, consecutive discourse elements, termed text second in India and the national language in spans, which can be in the form of clauses, Bangladesh but it is less computerized com- sentences, or units larger than sentences, are pared to English. To the best of our knowl- related by a relatively small set (20–25) of rhe- edge, at present, there is no such available cor- torical relations (Mann and Thompson, 1988). pus that is annotated with detailed linguistic RST distinguishes between the part of a text expressions for emotion in Bengali or even for that realizes the primary goal of the writer, other Indian languages. Thus we believe that termed as nucleus, and the part that provides this corpus would help the development and supplementary material, termed satellite. The evaluation of emotion analysis systems in separation of nucleus from satellite is done Bengali. based on punctuation marks (, ! @?), emoti- cons, discourse markers ( jehetu [as], 3 Emotion Annotation jemon [e.g.], karon [because], mane e n Random collection of 123 blog posts contain- [means]), conjuncts ( ebong [and], a ing a total of 12,149 sentences are retrieved kintu [but], athoba [or]), causal verbs 3 from Bengali web blog archive (especially ( ghotay [caused]) if they are explicitly from comics, politics, sports and short stories) specified in the sentences. to prepare the corpus. No prior training was Use of emotion-related words is not the sole provided to the annotators but they were in- means of expressing emotion. Often a structed to annotate each sentence of the blog sentence, which otherwise may not have an corpus based on some illustrated samples of emotional word, may become emotion bearing the annotated sentences. Specially for annotat- depending on the context or underlying ing the emotional expressions and topic(s) in semantic meaning (Aman and Szpakowicz, emotional sentences, the annotators are free in 2007). An empirical analysis of the blog texts selecting the texts spans. This annotation shows two types of emotional expressions. The scheme is termed as relaxed scheme. For other first category contains explicitly stated emotional components, the annotators are emotion word (EW) or phrases (EP) mentioned given items with fixed text spans and in- in the nucleus or in the satellite. Another structed to annotation the items with definite category contains the implicit emotional clues tags. that are identified based on the context or from the metaphoric knowledge of the expressions. 3.1 Identifying Emotional Expressions for Sometimes, the emotional expressions contain Sentential Emotion and Intensity direct emotion words (EW) ( koutuk The identification of emotion or affect affixed [joke], ananda [happy], in the text segments is a puzzle. But, the puzzle ashcharjyo [surprise]), reduplication (Redup) can be solved partially using some lexical ( sanda sanda [doubt with fear], clues (e.g. discourse markers, punctuation question words (EW_Q) ( ki [what], kobe marks (sym), negations (NEG), conjuncts [when]), colloquial words (k kshyama (CONJ), reduplication (Redup)), structural [perdon]) and foreign words ( thanku clues (e.g. rhetoric and syntactic knowledge) [thanks], gossya [anger]). On the other and especially some direct affective clues (e.g. hand, the emotional expressions contain indirect emotion words e.g. proverbs, idioms 2 ( taser ghar [weakly built], grri- http://www.internetworldstats.com/stats.htm hadaho [family disturbance]) and emoticons 3 www.amarblog.com (,).

49 A large number of emoticons (emo_icon) agreement for emotion and intensities are present in the Bengali blog texts vary accord- measured using standard Cohen's kappa coef- ing to their emotional categories and slant. ficient () (Cohen, 1960). The annotation Each of the emoticons is treated as individual agreement for emoticons is also measured us- emotional expression and its corresponding ing the kappa metric. It is a statistical measure intensity is set based on the image denoted by of inter-rater agreement for qualitative (cate- the emoticon. The labeling of the emoticons gorical) items. It measures the agreement be- with Ekman’s six emotion classes is verified tween two raters who separately classify items through the inter-annotator agreement that is into some mutually exclusive categories. considered for emotion expressions. The agreement of classifying sentential in- The intensifiers (( khub [too much/very], tensities into three classes (high, general and a anek [huge/large], >KD bhishon low) is also measured using kappa (). The [heavy/too much]) associated with the emo- intensities of mixed emotional sentences are tional phrases are also acknowledged in anno- also considered. Agreement results of emo- tating sentential intensities. As the intensifiers tional, non-emotional and mixed sentences, depend solely on the context, their identifica- emoticons, along with results for each emotion tion along with the effects of negation and con- class, intensity types are shown in Table 1. juncts play a role in annotating the intensity. Sentential emotions with happy, sad or sur- Negations ( na [no], noy [not]) and con- prise classes produce comparatively higher juncts freely occur in the sentence and change kappa coefficient than the other emotion the emotion of the sentence. For that very rea- classes as the emotional expressions of these son, a crucial analysis of negation and con- types were explicitly specified in the blog juncts is carried out both at intra and inter texts. It has been observed that the emotion phrase level to obtain the sentential emotions pairs such as “sad-anger” and “anger-disgust” and intensities. An example set of the anno- often cause the trouble in distinguishing the tated blog corpus is shown in Figure 1. emotion at sentence level. Mixed emotion category, general and low intensity types give ,,: >i% poor agreement results as expected. Instead of ! . specifying agreement results of emoticons for i-? each emotion class, the average results for the three annotation sets are shown in Table 1. /B 3.3 Agreement of Emotional Expressions .! > words that are selected by the annotators. The ,B B 9 agreement is carried out between the sets of , ; text spans selected by the two annotators for each of the emotional expressions. As there is ; ;Ve no fixed category in this case, we have em- ei ;V-B ployed two different strategies instead of kappa () to calculate the agreement between “Figure 1. Annotated sample of the corpus” annotators. Firstly, we chose the measure of 3.2 Agreement of Sentential Emotion and agreement on set-valued items (MASI) (Pas- sonneau, 2006) that was used for measuring Intensity agreement on co reference annotation (Passon- Three annotators identified as A1, A2 and A3 neau, 2004) and in the semantic and pragmatic have used an open source graphical tool to annotation (Passonneau, 2006). MASI is a dis- carry out the annotation 4 . As the Ekman’s tance between sets whose value is 1 for identi- emotion classes and intensity types belong to cal sets, and 0 for disjoint sets. For sets A and some definite categories, the annotation B it is defined as: MASI = J * M, where the Jaccard metric is: 4 http://gate.ac.uk/gate/doc/releases.html

50 J = ||/||ABAB Classes Agreement (pair of annota- Monotonicity (M) is defined as, (# Sentences tors) 1, ifA B or Instances) A1-A2 A2-A3 A1-A3 Avg.

2/3,ifA)) BorB A Emotion / 0.88 0.83 0.86 0.85 1/3,ifA B , A B , andB  A Non-Emotion (5,234/7,087) 0,ifA B   Happy (804) 0.79 0.72 0.83 0.78 Sad (826) 0.82 0.75 0.72 0.76 Secondly, the annotators will annotate dif- Anger (765) 0.75 0.71 0.69 0.71 ferent emotional expressions by identifying the Disgust (766) 0.76 0.69 0.77 0.74 responsible text anchors and the agreement is Fear (757) 0.65 0.61 0.65 0.63 measured using agr metric (Wiebe et al., Surprise (822) 0.84 0.82 0.85 0.83 2005). If A and B are the sets of anchors anno- tated by annotators a and b, respectively, agr is Mixed (322) 0.42 0.21 0.53 0.38 a directional measure of agreement that meas- High (2,330) 0.66 0.72 0.68 0.68 ures what proportion of a was also marked by General 0.42 0.46 0.48 0.45 b. Specifically, we compute the agreement of b (1,765) to a as: Low (1345) 0.21 0.34 0.26 0.27 Emoticons 0.85 0.73 0.84 0.80 ||AmatchingB w.r.t six Emo- agr(||) a b  tion Classes ||A (678) Emoticons 0.72 0.66 0.63 0.67 The agr (a|| b) metric corresponds to the re- w.r.t three In- call if a is the gold standard and b the system, tensities and to precision, if b is the gold standard and a Emotional Ex- 0.64 0.61 0.66 0.63 is the system. The results of two agreement pressions strategies for each emotion class are shown in (7,588) Table 1. The annotation agreement of [MASI] emotional expressions produces slightly less Emotional Ex- 0.67 0.63 0.68 0.66 values for both kappa and agr. It leads to the pressions fact that the relaxed annotation scheme that is (7,588) [agr] provided for fixing the boundaries of the Table 1: Inter-Annotator Agreements for sen- expressions causes the disagreements. tence level Emotions, Intensities, Emoticons and Emotional Expressions 4 Identifying Emotion Holder The source or holder of an emotional expres- - sion is the speaker or writer or experiencer. -…. The main criteria considered for annotating - emotion holders are based on the nested source - hypothesis as described in (Wiebe et al., - 2005). The structure of Bengali blog corpus (as -…. shown in Figure 2) helps in the holder annota- -… tion process. Sometimes, the comments of one blogger are annotated by other bloggers in the blog posts. Thus the holder annotation task in “Figure. 2. General structure of a blog docu- user comments sections was less cumbersome ment” than annotating the holders inscribed in the topic section. Prior work in identification of opinion hold- ers has sometimes identified only a single opinion per sentence (Bethard et al., 2004),

51 and sometimes several (Choi et al., 2005). As 4.1 Agreement of Emotion Holder the blog corpus has sentence level emotion Annotation annotations, the former category is adopted. The emotion holders containing multi word But, it is observed that the long sentences con- Named Entities (NEs) are assumed as single tain more than one emotional expression and emotion holders. As there is no agreement dis- hence associated with multiple emotion hold- crepancy in selecting the boundary of the sin- ers (EH). All probable emotion holders of a gle or multiple emotion holders, we have used sentence are stored in an anchoring vector suc- the standard metric, Cohen’s kappa () for cessively according to their order of occur- measuring the inter-annotator agreement. Each rence. of the elementary emotion holders in an an- The annotation of emotion holder at sen- choring vector is treated as a separate emotion tence level requires the knowledge of two ba- holder and the agreement between two annota- sic constraints (implicit and explicit) sepa- tors is carried out on each separate entity. It is rately. The explicit constraints qualify single to be mentioned that the anchoring vectors prominent emotion holder that is directly in- provided by the two annotators may be dis- volved with the emotional expression whereas joint. the implicit constraints qualify all direct and To emphasize the fact, a different technique indirect nested sources as emotion holders. For is employed to measure the annotation agree- example, in the following Bengali sentences, ment. If X is a set of emotion holders selected the pattern shown in bold face denotes the by the first annotator and Y is a set of emotion emotion holder. In the second example, the holders selected by the second annotator for an appositive case (e.g. ( (Ram’s pleasure)) emotional sentence containing multiple emo- is also identified and placed in the vector by tion holders, inter-annotator agreement IAA e removing the inflectional suffix (- in this for that sentence is equal to quotient of number case). Example 2 and Example 3 contain the B of emotion holders in X and Y intersection emotion holders (Ram) and divided by number of emotion holders in X (Nasrin Sultana) based on implicit constraints. and Y union: IAA = X Y / X U Y Example 1. EH_Vector: < > >KD a> Two types of agreement results per emotion (Sayan) (bhishon) (anondo) (anubhob) -B class for annotating emotion holders (EH) are (korechilo) shown in Table 2. Both types of agreements Sayan felt very happy. have been found satisfactory and the difference between the two agreement types is signifi- Example 2. EH_Vector: < C, > cantly less. The small difference indicates the C a> -B minimal error involved in the annotation proc- (Rashed) (anubhob) (korechilo) (je) (Ramer) ess. It is found that the agreement is highly ( a9K moderate in case of single emotion holder, but (sukh) (antohin) is less in case of multiple holders. The dis- Rashed felt that Ram’s pleasure is endless. agreement occurs mostly in the case of satisfy- ing the implicit constrains but some issues are Example 3. EH_Vector: < ,,, B resolved by mutual understanding. > ,, B : B 5 Topic Annotation and Agreement (GeduChaCha) (bole) (ami) (Nasrin Sultanar) Topic is the real world object, event, or ab- (  <B stract entity that is the primary subject of the (dookher) (kathate) (kende) (feli) opinion as intended by the opinion holder Gedu Chacha says: No my sister, I fall into cry (Stoyanov and Cardie, 2008). They mention on the sad speech of Nasrin Sultana that the topic identification is difficult within the single target span of the opinion as there are multiple potential topics, each identified

52 with its own topic span and the topic of an standard metrics Cohen’s kappa (). We em- opinion depends on the context in which its ployed MASI and agr metric (as mentioned in associated opinion expression occurs. Hence, Section 3) for measuring the agreement of the actual challenge lies on identification of the topic spans annotation. The emotional sen- topics spans from the emotional sentences. The tences containing single emotion topic has writer’s emotional intentions in a sentence are shown less disagreement than the sentences reflected in the target span by mentioning one that contain multiple topics. It is observed that or more topics that are related to the emotional the agreement for annotating target span is ( expressions. Topics are generally distributed in 0.9). It means that the annotation is almost sat- different text spans of writer’s text and can be isfactory. But, the disagreement occurs in an- distinguished by capturing the rhetorical struc- notating the boundaries of topic spans. The ture of the text. inter-annotator agreement for each emotion class is shown in Table 3. The selection of Emotion Agreement between pair of anno- emotion topic from other relevant topics Classes tators () [IAA] causes the disagreement. [# Sen- tences, A1-A2 A2-A3 A1-A3 Avg. Emotion Agreement between Pair of annota- # Holders] Classes tors (MASI) [agr] Happy (0.87) (0.79) (0.76) (0.80) [# Sen- [804, 918] [0.88] [0.81] [0.77] [0.82] tences, A1-A2 A2-A3 A1-A3 Avg Sad (0.82) (0.85) (0.78) (0.81) # topics] [826, 872] [0.81] [0.83] [0.80] [0.81] Happy (0.83) (0.81) (0.79) (0.81) Anger (0.80) (0.75) (0.74) (0.76) [804, 848] [0.85] [0.83] [0.82] [0.83] [765,780] [0.79] [0.73] [0.71] [0.74] Sad (0.84) (0.77) (0.81) (0.80) Disgust (0.70) (0.72) (0.83) (0.75) [826, 862] [0.86] [0.79] [0.83] [0.82] [766, 770] [0.68] [0.69] [0.84] [0.73] Anger (0.80) (0.81) (0.86) (0.82) Fear (0.85) (0.78) (0.79) (0.80) [765,723] [0.78] [0.78] [0.84] [757, 764] [0.82] [0.77] [0.81] [0.80] [0.80] Surprise (0.78) (0.81) (0.85) (0.81) Disgust (0.77) (0.78) (0.72) (0.75) [822, 851] [0.80] [0.79] [0.83] [0.80] [766, 750] [0.76] [0.74] [0.70] [0.73] Table 2: Inter-Annotator Agreement for Emo- Fear (0.78) (0.77) (0.79) (0.78) tion Holder Annotation [757, 784] [0.79] [0.80] [0.81] [0.80 Surprise (0.90) (0.85) (0.82) (0.85) In blog texts, it is observed that an emotion [822, 810] [0.86] [0.82] [0.80] [0.82] topic can occur in nucleus as well as in satel- Table 3: Inter-Annotator Agreement for Topic lite. Thus, the whole sentence is assumed as Annotation the scope for the potential emotion topics. The text spans containing emotional expression and 6 Experiments on Emotion Classifica- emotion holder can also be responsible for be- tion ing the candidate seeds of target span. In Ex- ample 3 of Section 4, the target span ( A preliminary experiment (Das and Bandyop- B ( ‘sad speech of Nasrin Sul- adhyay, 2009b) was carried out on a small set tana’) contains emotion holder ( B of 1200 sentences of the annotated blog corpus using Conditional Random Field (CRF) ‘Nasrin Sultana’) as well as the emotional ex- (McCallum et al., 2001). We have employed pression (( ‘sad speech’) For that the same corpus and similar features (e.g. POS, reason, the annotators are instructed to con- punctuation symbols, sentiment words etc.) for sider the whole sentence as their target span classifying the emotion words using Support and to identify one or more topics related to Vector Machine (SVM) (Joachims, 1999). The the emotional expressions in that sentence. results on 200 test sentences are shown in Ta- As the topics are multi word components or ble 4. The results of the automatic emotion string of words, the scope of the individual classification at word level show that SVM topics inside a target span is hard to identify. outperforms CRF significantly. It is observed To accomplish the goal, we have not used the

53 that both classifiers fail to identify the emotion our experiments, the case of typographical er- words that are enriched by morphological in- rors and orthographic features (for e.g. i flections. Although SVM outperforms CRF but ‘disgusting’, b ‘surprising’) that express or both CRF and SVM suffer from sequence la- emphasize emotion in text are not considered. beling and label bias problem with other non- emotional words of a sentence. (For error 7 Conclusion analysis and detail experiments, see Das and Bandyopadhyay, 2009b). The present task addresses the issues of identi- fying emotional expressions in Bengali blog Emotion Test Set texts along with the annotation of sentences with emotional components such as intensity, Classes (# Words) CRF SVM holders and topics. Nested sources are consid- Happy (106) 67.67 80.55 ered for annotating the emotion holder infor- Sad (143) 63.12 78.34 mation. The major contribution in the task is Anger (70) 51.00 66.15 the identification and fixing the text spans de- Disgust (65) 49.75 53.35 noted for emotional expressions and multiple Fear (37) 52.46 64.78 topics in a sentence. Although the preliminary Surprise (204) 68.23 79.37 experiments carried out on the small sets of the Table 4: Word level Emotion tagging Accura- corpus show satisfactory performance, but the cies (in %) using CRF and SVM future task is to adopt a corpus-driven ap- proach for building a lexicon of emotion words Another experiment (Das and Bandyop- and phrases and extend the emotion analysis adhyay, 2009a) was carried out on a small set tasks in Bengali. of 1300 sentences of the annotated blog cor- pus. They assign any of the Ekman’s (1993) References six basic emotion tags to the Bengali blog sen- tences. Conditional Random Field (CRF) Aman Saima and Stan Szpakowicz. 2007. Identify- based word level emotion classifier classifies ing Expressions of Emotion in Text. V. Ma- toušek and P. Mautner (Eds.): TSD 2007, LNAI, the emotion words not only in emotion or non- vol. 4629, pp.196-205. emotion classes but also the emotion words into Ekman’s six emotion classes. Corpus Azar, M. 1999. Argumentative Text as Rhetorical based and sense based tag weights that are cal- Structure: An Application of Rhetorical Struc- culated for each of the six emotion tags are ture Theory. Argumentation, vol 13, pp. 97–114. used to identify sentence level emotion tag. Bethard Steven, Yu H., Thornton A., Hatzivassi- Sentence level accuracies for each emotion loglou V., and Jurafsky, D. 2004. Automatic Ex- class were also satisfactory. traction of Opinion Propositions and their Hold- Knowledge resources can be leveraged in ers. In AAAI Spring Symposium on Exploring identifying emotion-related words in text and Attitude and Affect in Text: Theories and Appli- the lexical coverage of these resources may be cations. limited, given the informal nature of online Carletta Jean. 1996. Assessing Agreement on Clas- discourse (Aman and Szpakowicz, 2007). The sification Tasks: The Kappa Statistic. Computa- identification of direct emotion words tional Linguistics, vol. 22(2), pp.249-254. incorporates the lexicon lookup approach. A Choi, Y., Cardie, C., Riloff, E., and Patwardhan, S. recently developed Bengali WordNet Affect 2005. Identifying Sources of Opinions with lists (Das and Bandyopadhyay, 2010) have Conditional Random Fields and Extraction Pat- been used in determining the directly stated terns. Human Language Technology / Empirical emotion words. But, the affect lists covers only Method in Natural Language Processing. 52.79% of the directly stated emotion words. Alm, Cecilia Ovesdotter, Dan Roth, and Richard The fact leads not only to the problem of Sproat. 2005. Emotions from text: Machine morphological enrichment but also to refer the learning for text-based emotion prediction. Hu- problem of identifying emoticons, proverbs, man Language Technology - Empirical Method idioms and colloquial or foreign words. But, in in Natural Language Processing, pp. 579-586.

54 Cohen, J. 1960. A coefficient of agreement for Passonneau, R.J. 2006. Measuring agreement on nominal scales. Educational and Psychological set-valued items (MASI) for semantic and prag- Measurement, vol. 20, pp. 37–46. matic annotation. Language Resources and Evaluation. Das Dipankar and Sivaji Bandyopadhyay. 2009a. Word to Sentence Level Emotion Tagging for Quan Changqin and Fuji Ren. 2009. Construction Bengali Blogs. Association for Computational of a Blog Emotion Corpus for Chinese Emo- Linguistics –International Joint Conference of tional Expression Analysis. Empirical Method in Natural Language Processing-2009, pp. 149- Natural Language Processing- Association for 152. Suntec, Singapore. Computational Linguistics, pp. 1446-1454, Sin- gapore Das Dipankar and Sivaji Bandyopadhyay. 2009b. Emotion Tagging – A Comparative Study on Quirk, R., Greenbaum, S., Leech, G., Svartvik, J. Bengali and English Blogs. 7th International 1985. A Comprehensive Grammar of the English Conference On Natural Language Processing- Language. Longman, New York. 09, pp.177-184, India. Stoyanov, V., and C. Cardie. 2008. Annotating top- Das Dipankar and Sivaji Bandyopadhyay. 2010. ics of opinions. Language Resources and Developing Bengali WordNet Affect for Analyz- Evaluation. ing Emotion. International Conference on the Tokuhisa Ryoko, Kentaro Inui, and Yuji. Matsu- Computer Processing of Oriental Languages- moto. 2008. Emotion Classification Using Mas- International Conference on Software Engineer- sive Examples Extracted from the Web. COL- ing and Knowledge Engineering-2010, USA. ING 2008, pp. 881-888. Ekman, P. 1992. An Argument for Basic Emotions. Wiebe Janyce, Theresa Wilson, and Claire Cardie. Cognition and Emotion. vol. 6, pp.169–200. 2005. Annotating expressions of opinions and Joachims, Thorsten. 1998. Text Categorization with emotions in language. Language Resources and Support Machines: Learning with Many Rele- Evaluation, vol. 39, pp.164–210. vant Features. In European Conference on Ma- Yang C., K. H.-Y. Lin, and H.-H. Chen. 2007. chine Learning (ECML),137-142 Building Emotion Lexicon from Weblog Cor- Lin K. H.-Y., C. Yang and H.-H. Chen. 2007. What pora. Association for Computational Linguis- Emotions News Articles Trigger in Their Read- tics, pp. 133-136. ers?. Proceedings of SIGIR, pp. 733-734. Mann, W. C. and S. A. Thompson. 1988. Rhetorical Structure Theory: Toward a Functional Theory of Text Organization, TEXT 8, pp. 243–281. McCallum Andrew, Fernando Pereira and John Lafferty. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and label- ing Sequence Data. ISBN, 282 – 289. Mihalcea Rada and Hugo Liu. 2006. A corpus- based approach to finding happiness. Association for the Advancement of Artificial Intelligence, pp. 139-144. Mishne Gilad. 2005. Emotions from text: Machine learning for text-based emotion prediction. SIGIR’05, pp. 15-19. Neviarouskaya Alena, Helmut Prendinger, and Mi- tsuru Ishizuka. 2007. Textual Affect Sensing for Social and Expressive Online Communication. 2nd international conference on Affective Com- puting and Intelligent Interaction, pp. 218-229. Passonneau, R. 2004. Computing reliability for coreference annotation. Language Resources and Evaluation, Lisbon.

55 SentiWordNet for Indian Languages

Amitava Das 1 and Sivaji Bandyopadhyay 2 Department of Computer Science and Engineering Jadavpur University [email protected] 1 [email protected] 2

domain pragmatic knowledge (Aue and Ga- Abstract mon, 2005) and lastly most challenging is the time dimension (Read, 2005). The discipline where sentiment/ opi- The following example shows that the polar- nion/ emotion has been identified and ity tag associated with a sentiment word de- classified in human written text is well pends on the time dimension. During 90’s mo- known as sentiment analysis. A typical bile phone users generally reported in various computational approach to sentiment online reviews about their color phones but in analysis starts with prior polarity lex- recent times color phone is not just enough. icons where entries are tagged with People are fascinated and influenced by touch their prior out of context polarity as screen and various software(s) installation fa- human beings perceive using their cilities on these new generation gadgets. cognitive knowledge. Till date, all re- In typical computational approaches (Higa- search efforts found in sentiment lex- shinaka et al., 2007; Hatzivassiloglou et al., icon literature deal mostly with English 2000) to sentiment analysis researchers con- texts. In this article, we propose mul- sider the problem of learning a dictionary that tiple computational techniques like, maps semantic representations to verbaliza- WordNet based, dictionary based, cor- tions, where the data comes from opinionated pus based or generative approaches for electronic text. Although lexicons in these dic- generating SentiWordNet(s) for Indian tionaries are not explicitly marked up with re- languages. Currently, SentiWordNet(s) spect to their contextual semantics, they con- are being developed for three Indian tain only explicit polarity rating and aspect languages: Bengali, Hindi and Telugu. indicators. Lexicon-based approaches can be An online intuitive game has been de- broadly classified into two categories firstly veloped to create and validate the de- where the discriminative polarity tag of lex- veloped SentiWordNet(s) by involving icons is determined on labeled training data Internet population. A number of au- and secondly where the lexicons are manually tomatic, semi-automatic and manual compiled, the later constitutes the main effec- validations and evaluation methodolo- tive approach. gies have been adopted to measure the It is undoubted that the manual compilation coverage and credibility of the devel- is always the best way to create monolingual oped SentiWordNet(s). semantic lexicons, but manual methods are expensive in terms of human resources, it in- 1 Introduction volves a substantial number of human annota- Sentiment analysis and classification from tors and it takes lot of time as well. In this pa- electronic text is a hard semantic disambigua- per we propose several computational tech- tion problem. The regulating aspects of seman- niques to generate sentiment lexicons in Indian tic orientation of a text are natural language languages automatically and semi- context information (Pang et al., 2002) lan- automatically. In the present task, SentiWord- guage properties (Wiebe and Mihalcea, 2006),

56 Proceedings of the 8th Workshop on Asian Language Resources, pages 56–63, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing

Net(s) are being developed for the Bengali, developed several automatic and semi- Hindi and Telugu languages. automatic evaluation methods. Several prior polarity sentiment lexicons are available for English such as SentiWordNet 2 Related Works (Esuli et. al., 2006), Subjectivity Word List Various methods have been used in the litera- (Wilson et. al., 2005), WordNet Affect list ture such as WordNet based, dictionary based, (Strapparava et al., 2004), Taboada’s adjective corpus based or generative approaches for sen- list (Taboada et al., 2006). timent lexicon generation in a new target lan- Among these publicly available sentiment guage. lexicon resources we find that SentiWordNet is Andreevskaia and Bergler, (2006) present a most widely used (number of citation is higher method for extracting sentiment-bearing adjec- than other resources 1) in several applications tives from WordNet using the Sentiment Tag such as sentiment analysis, opinion mining and Extraction Program (STEP). They did 58 emotion analysis. Subjectivity Word List is STEP runs on unique non-intersecting seed most trustable as the opinion mining system lists drawn from manually annotated list of OpinionFinder 2 that uses the subjectivity word positive and negative adjectives and evaluated list has reported highest score for opi- the results against other manually annotated nion/sentiment subjectivity (Wiebe and Riloff, lists. 2006). SentiWordNet is an automatically con- The proposed methods in (Wiebe and Riloff, structed lexical resource for English that as- 2006) automatically generate resources for signs a positivity score and a negativity score subjectivity analysis for a new target language to each WordNet synset. from the available resources for English. Two The subjectivity word list is compiled from techniques have been proposed for the genera- manually developed resources augmented with tion of target language lexicon from English entries learned from corpora. The entries in the subjectivity lexicon. The first technique uses a subjectivity word list have been labeled with bilingual dictionary while the second method part of speech (POS) tags as well as either is a parallel corpus based approach using exist- strong or weak subjective tag depending on the ing subjectivity analysis tools for English. reliability of the subjective nature of the entry. Automatically or manually created lexicons These two resources have been merged au- may have limited coverage and do not include tomatically and the merged resource is used for most semantically contrasting word pairs like SentiWordNet(s) generation in the present antonyms. Antonyms are broadly categorized task. (Saif Mohammed, 2008) as gradable adjec- The generated sentiment lexicons or Senti- tives (hot–cold, good–bad, friend–enemy) and WordNet(s) for several Indian languages most- productive adjectives (normal–abnormal, for- ly contain synsets (approximately 60%) of re- tune–misfortune, implicit–explicit). The first spective languages. Synset based method is type contains the semantically contrasting robust for any kind of monolingual lexicon word pairs but the second type includes ortho- creation and useful to avoid further word sense graphic suffix/affix as a clue. The second type disambiguation problem in application domain. is highly productive using very less number of Additionally we have developed an online affixation rules. intuitive game to create and validate the devel- Degree of antonymy (Mohammad et al., oped SentiWordNet(s) by involving Internet 2008) is defined to encompass the complete population. semantic range as a combined measure of the The proposed approaches in this paper are contrast in meaning conveyed by two antony- easy to adopt for any new language. To meas- my words and is identified by distributional ure the coverage and credibility of generated hypothesis. It helps to measure relative senti- SentiWordNet(s) in Indian languages we have ment score of a word and its antonym. Kumaran et al., (2008) introduced a beauti- ful method for automatic data creation by on- 1 http://citeseerx.ist.psu.edu/ line intuitive games. A methodology has been 2 http://www.cs.pitt.edu/mpqa/

57 proposed for community creation of linguistic lexicon. Otherwise the word is discarded to data by community collaborative framework avoid ambiguities later. known as wikiBABEL 3. It may be described as Some words in the Subjectivity word list are a revolutionary approach to automatically inflected e.g., memories . These words would create large credible linguistic data by involv- be stemmed during the translation process, but ing Internet population for content creation. some words present no subjectivity property For the present task we prefer to involve all after stemming ( memory has no subjectivity the available methodologies to automatically property). A word may occur in the and semi-automatically create and validate subjectivity list in many inflected forms. SentiWordNet(s) for three Indian languages. Individual clusters for the words sharing the Automatic methods involve only computation- same root form are created and then checked in al methods. Semi-automatic methods involve the SentiWordNet for validation. If the root human interruption to validate system’s output. word exists in the SentiWordNet then it is assumed that the word remains subjective after 3 Source Lexicon Acquisition stemming and hence is added to the new list. Otherwise the cluster is completely discarded SentiWordNet and Subjectivity Word List to avoid any further ambiguities. have been identified as the most reliable source Various statistics of the English Senti- lexicons. The first one is widely used and the WordNet and Subjectivity Word List are re- second one is robust in terms of performance. ported in Table 1. A merged sentiment lexicon has been devel- Subjectivity Word SentiWordNet oped from both the resources by removing the List duplicates. It has been observed that 64% of the single word entries are common in the Sub- Single Multi Single Multi jectivity Word List and SentiWordNet. The Entries Entries new merged sentiment lexicon consists of 115424 79091 5866 990 14,135 numbers of tokens. Several filtering techniques have been applied to generate the i-

mb 20789 30000 4745 963 a guous new list. Words U A subset of 8,427 sentiment words has been Orientation Subjectivity extracted from the English SentiWordNet, by Threshold POS selecting those whose orientation strength is Strength Strength above the heuristically identified threshold value of 0.4. The words whose orientation 86944 30000 2652 928 biguous Words biguous Words strength is below 0.4 are ambiguous and may Am- Discarded lose their subjectivity in the target language Table 1: English SentiWordNet and Subjec- after translation. A total of weakly subjective tivity Word List Statistics 2652 words are discarded from the Subjectivity word list as proposed in (Wiebe 4 Target Lexicon Generation and Riloff, 2006). In the next stage the words whose POS 4.1 Bilingual Dictionary Based Approach category in the Subjectivity word list is A word-level translation process followed by undefined and tagged as “ anypos ” are error reduction technique has been adopted for considered. These words may generate sense generating the Indian languages ambiguity issues in the next stages of SentiWordNet(s) from the English sentiment subjectivity detection. The words are checked lexicon merged from the English in the SentiWordNet list for validation. If a SentiWordNet and the Subjectivity Word List. match is found with certain POS category, the English to Indian languages synsets are be- word is added to the new merged sentiment ing developed under Project English to Indian Languages Machine Translation Systems

3 http://research.microsoft.com/en- us/projects/wikibabel/

58 (EILMT) 4, a consortia project funded by De- words are discarded i.e., the final list consists partment of Information Technology (DIT), of 34,117 words. Government of India. These synsets are robust and reliable as these are created by native 4.2 Telugu speakers as well as linguistics experts of the Charles Philip Brown English-Telugu Dictio- specific languages. For each language we have nary 8 , Aksharamala 9 English-Telugu Dictio- approximately 9966 synsets along with the nary and English-Telugu Dictionary 10 devel- English WordNet offset. These bilingual syn- oped by Language Technology Research Cen- set dictionaries have been used along with lan- ter (LTRC), International Institute of Hydera- guage specific dictionaries. bad (IITH) have been chosen for the present A word level synset/lexical transfer tech- task. There is no WordNet publicly available nique is applied to each English synset/word in for Telugu and the corpus (Section 4.5) we the merged sentiment lexicon. Each dictionary used is small in size. Dictionary based ap- search produces a set of Indian languages syn- proach is the main process for Telugu Senti- sets/words for a particular English synset/word. WordNet generation. These three dictionaries have been merged 4.1.1 Hindi automatically by replacing the duplicates. The Two available manually compiled English- merged English-Telugu dictionary contains Hindi electronic dictionaries have been identi- approximately 112310 unique entries. The pos- fied for the present task. First is the SHABD- itive and negative sentiment scores for the Te- KOSH 5 and the second one is Shabdanjali 6 . lugu words are copied from their English Sen- These two dictionaries have been merged au- tiWordNet equivalents. tomatically by replacing the duplicates. The The dictionary based translation process has merged English-Hindi dictionary contains ap- resulted in 30,889 Telugu entries, about 88% proximately 90,872 unique entries. The posi- of final Telugu SentiWordNet synsets. An on- tive and negative sentiment scores for the Hin- line intuitive game has been proposed in Sec- di words are copied from their English Senti- tion 4.6 to automatically validate the devel- WordNet. oped Telugu SentiWordNet by involving In- The bilingual dictionary based translation ternet population. process has resulted 22,708 Hindi entries. 4.3 WordNet Based Approach 4.1.2 Bengali 11 WordNet(s) are available for Hindi (Jha et An English-Bengali dictionary (approx- al., 2001) and Bengali 12 (Robkop et al., 2010) imately 102119 entries) has been developed but publicly unavailable for Telugu. using the Samsad Bengali-English dictionary 7. A WordNet based lexicon expansion strate- The positive and negative sentiment scores for gy has been adopted to increase the coverage the Bengali words are copied from their Eng- of the generated SentiWordNet(s) through the lish SentiWordNet equivalents. dictionary based approach. The present algo- The bilingual dictionary based translation rithm starts with English SentiWordNet syn- process has resulted in 35,805 Bengali entries. sets that is expanded using synonymy and an- A manual checking is done to identify the re- tonymy relations in the WordNet. For match- liability of the words generated from automatic ing synsets we keep the exact score as in the process. After manual checking only 1688 source synset in the English SentiWordNet. The calculated positivity and negativity score

4 http://www.cdacmumbai.in/e-ilmt 5 http://www.shabdkosh.com/ 8 http://dsal.uchicago.edu/dictionaries/brown/ 6 9 https://groups.google.com/group/aksharamala http://www.shabdkosh.com/content/category/downl 10 oads/ http://ltrc.iiit.ac.in/onlineServices/Dictionaries/Dict 7 _Frame.html http://dsal.uchicago.edu/dictionaries/biswas_bengal 11 http://www.cfilt.iitb.ac.in/wordnet/webhwn/ i/ 12 http://bn.asianwordnet.org/

59 for any target language antonym synset is cal- Affix/Suffix Word Antonym culated as: ab X Normal Ab-normal = − mis X Fortune Mis-fortune Tp1 S p im X-exX Im -plicit Ex-plicit = − Tn1 S n anti X Clockwise Anti -clockwise Non where S , S are the positivity and negativ- non X Aligned -aligned p n in X-ex X In -trovert Ex -trovert ity score for the source language (i.e, English) dis X Interest Dis -interest un X Biased Un -biased and Tp , Tn are the positivity and negativity up X-down X Up -hill Down -hill score for target languages (i.e., Hindi and Ben- im X Possible Im -possible gali) respectively. ill X Legal Il -legal 4.3.1 Hindi over X-under X Overdone Under -done in X Consistent In -consistent Hindi WordNet is a well structured and ma- rX-ir X Regular Ir -regular nually compiled resource and is being updated Xless -Xful Harm-less Harm-ful since last nine years. There is an available mal X Function Mal -function API 13 for accessing the Hindi WordNet. Al- Table 2: Rules for Generating Productive An- most 60% of final SentiWordNet synsets in tonyms Hindi are generated by this method. 4.5 Corpus Based Approach 4.3.2 Bengali Language/culture specific words such as those The Bengali WordNet is being developed by listed below are to be captured in the devel- the Asian WordNet (AWN) community. It on- oped SentiWordNet(s). But sentiment lexicon ly contains 1775 noun synsets as reported in generation techniques via cross-lingual projec- (Robkop et al., 2010). A Web Service 14 has tion are unable to capture these words. As ex- been provided for accessing the Bengali ample: WordNet. There are only a few number of सहेरा (Sahera: A marriage- noun synsets in the Bengali WordNet and other wear) important POS category words for sentiment lexicon such as adjective, adverb and verb are (Durgapujo: A festiv- absent. Only 5% new lexicon entries have been al of Bengal ) generated in this process. To increase the coverage of the developed SentiWordNet(s) and to capture the lan-

4.4 Antonym Generation guage/culture specific words an automatic cor- Automatically or manually created lexicons pus based approach has been proposed. At this have limited coverage and do not include most stage the developed SentiWordNet(s) for the semantically contrasting word pairs. To over- three Indian languages have been used as a come the limitation and increase the coverage seed list. Language specific corpus is automat- of the SentiWordNet(s) we present automatic ically tagged with these seed words and we antonymy generation technique followed by have a simple tagset as SWP (Sentiment Word corpus validation to check orthographically Positive) and SWN (Sentiment Word Nega- generated antonym does really exist. Only 16 tive). Although we have both positivity and hand crafted rules have been used as reported negativity scores for the words in the seed list in Table 2. About 8% of Bengali, 7% of Hindi but we prefer a word level tag as either posi- and 11% of Telugu SentiWordNet entries are tive or negative following the highest senti- generated in this process. ment score. 15 A Conditional Random Field (CRF ) based Machine Learning model is then trained with 13 the seed list corpus along with multiple lin- http://www.cfilt.iitb.ac.in/wordnet/webhwn/API_do guistics features such as morpheme, parts-of- wnloaderInfo.php 14 http://bn.asianwordnet.org/services 15 http://crfpp.sourceforge.net

60 speech, and chunk label. These linguistics fea- from SentiWordNet) is displayed to a player tures have been extracted by the shallow pars- and he/she is then been asked to capture his ers 16 for Indian languages. An n-gram ( n=4) immediate sentiment as extreme positive, posi- sequence labeling model has been used for the tive, extreme negative, negative or neutral by present task. pressing appropriate emoticon buttons. A snap The monolingual corpuses used have been of the game is shown in the Figure 1. The sen- developed under Project English to Indian timent score is calculated by the different emo- Languages Machine Translation Systems ticons based on the inputs from the different (EILMT). Each corpus has approximately 10K players and then is assigned the scale as fol- of sentences. lows: extreme positive (pos: 0.5, neg: 0.0), positive (pos: 0.25, neg: 0.0), neutral (pos: 0.0,

4.6 Gaming Methodology neg: 0.0), negative (pos: 0.0, 0.25), extreme There are several motivations behind develop- negative (pos: 0.0, neg: 0.5). ing an intuitive game to automatically create The score of a particular player is calculated multilingual SentiWordNet(s). The assigned on the basis of pre-stored sentiment lexicon polarity scores to each synset may vary in time scores in the generated SentiWordNet(s). dimension. Language specific polarity scores may vary and it should be authenticated by 5 Evaluation numbers of language specific annotators. Andera Esuli and Fabrizio Sebastiani (2006) In the history of Information Retrieval re- 17 have calculated the reliability of the sentiment search there is a milestone when ESP game scores attached to every synsets in the English (Ahn et al., 2004) innovate the concept of a SentiWordNet. They have tagged sentiment game to automatically label images available words in the English WordNet with positive in World Wide Web. Highly motivated by the and negative sentiment scores. In the present historical research we proposed a intuitive task, these sentiment scores from English game to create and validate SentiWordNet(s) WordNet have been directly copied to the In- for Indian languages by involving internet dian language SentiWordNet(s). population. Two extrinsic evaluation strategies have been adopted for the developed Bengali Sen- tiWordNet based on the two main usages of the sentiment lexicon as subjectivity classifier and polarity identifier. The Hindi and Telugu SentiWordNet(s) have not completely been evaluated. 5.1 Coverage NEWS BLOG Total number of documents 100 -

Figure 1: Intuitive Game for SentiWordNet(s) Total number of sentences 2234 300 Avgerage number of sentences in Creation 22 - a document In the gaming interface a simple picture (re- 18 Total number of wordforms 28807 4675 trieved by Google Image API ) along with a Avgerage number of wordforms 288 - sentiment bearing word (retrieved randomly in a document Total number of distinct 17176 1235 wordforms 16 Table 3: Bengali Corpus Statistics http://ltrc.iiit.ac.in/showfile.php?filename=downloa We experimented with NEWS and BLOG ds/shallow_parser.php 17 corpora for subjectivity detection. Sentiment http://www.espgame.org/ lexicons are generally domain independent but 18 http://code.google.com/apis/ajaxsearch/multimedia. it provides a good baseline while working with html sentiment analysis systems. The coverage of

61 the developed Bengali SentiWordNet is eva- overall performance of the polarity classifier is luated by using it in a subjectivity classifier reported in Table 6. (Das and Bandyopadhyay, 2009). The statistics of the NEWS and BLOG corpora is reported in Polarity Precision Recall Table 3. Positive 56.59% 52.89% For comparison with the coverage of Eng- Negative 75.57% 65.87% lish SentiWordNet the same subjectivity clas- Table 6: Polarity-wise Performance Using sifier (Das and Bandyopadhyay, 2009) has Bengali SentiWordNet been applied on Multi Perspective Question Answering (MPQA) (NEWS) and IMDB Mov- Comparative study with a polarity classifier ie review corpus along with English Senti- that works with only prior polarity lexicon is WordNet. The result of the subjectivity clas- necessary but no such works have been identi- sifier on both the corpus proves that the cover- fied in literature. age of the Bengali SentiWordNet is reasonably An arbitrary 100 words have been chosen good. The subjectivity word list used in the from the Hindi SentiWordNet for human eval- subjectivity classifier is developed from the uation. Two persons are asked to manually IMDB corpus and hence the experiments on check it and the result is reported in Table 7. the IMDB corpus have yielded high precision The coverage of the Hindi SentiWordNet has and recall scores. The developed Bengali Sen- not been evaluated, as no manually annotated tiWordNet is domain independent and still its sentiment corpus is available. coverage is very good as shown in Table 4. Polarity Positive Negative Languages Domain Precision Recall Percentage 88.0% 91.0% MPQA 76.08% 83.33% English Table 7: Evaluation of Polarity Score of De- IMDB 79.90% 86.55% veloped Hindi SentiWordNet NEWS 72.16% 76.00% Bengali BLOG 74.6% 80.4% For Telugu we created a version of the game Table 4: Subjectivity Classifier using Senti- with Telugu words on screen. Only 3 users WordNet have played the Telugu language specific 5.2 Polarity Scores game till date. Total 92 arbitrary words have been tagged and the accuracy of the polarity This evaluation metric measures the reliability scores is reported in Table 8. The coverage of of the associated polarity scores in the senti- Telugu SentiWordNet has not been evaluated, ment lexicons. To measure the reliability of as no manually annotated sentiment corpus is polarity scores in the developed Bengali Sen- available. tiWordNet, a polarity classifier (Das and Ban- Polarity Positive Negative dyopadhyay, 2010) has been developed using Percentage 82.0% 78.0% the Bengali SentiWordNet along with some Table 8: Evaluation of Polarity Score of De- other linguistic features. veloped Telugu SentiWordNet

Overall Performance Features 6 Conclusion Incremented By SentiWordNet 47.60% SentiWordNet(s) for Indian languages are be- Table 5: Polarity Performance Using Bengali ing developed using various approaches. The SentiWordNet game based technique may be directed towards a new way for the creation of linguistic data Feature ablation method proves that the as- not just only for SentiWordNet(s) but in either sociated polarity scores in the developed Ben- areas of NLP too. gali SentiWordNet are reliable. Table 5 shows Presently only the Bengali SentiWordNet19 the performance of a polarity classifier using is downloadable from the author’s web page. the Bengali SentiWordNet. The polarity wise 19 http://www.amitavadas.com/sentiwordnet.php

62 References ings of the Conference on Empirical Methods in Natural Language Processing and Computational Andreevskaia Alina and Bergler Sabine. CLaC and Natural Language Learning (EMNLP-2008), Oc- CLaC-NB: Knowledge-based and corpus-based tober 2008, Waikiki, Hawaii. approaches to sentiment tagging. In Proceedings of the 4th International Workshop on Semantic Pang Bo, Lee Lillian, and Vaithyanathan Shivaku- Evaluations (SemEval-2007), pages 117–120, mar. Thumbs up? Sentiment classification using Prague, June 2007. machine learning techniques. In Proceedings of the Conference on Empirical Methods in Natural Aue A. and Gamon M., Customizing sentiment Language Processing (EMNLP), pages 79–86, classifiers to new domains: A case study. In Pro- 2002. ceedings of Recent Advances in Natural Lan- guage Processing (RANLP), 2005. Read Jonathon. Using emoticons to reduce depen- dency in machine learning techniques for senti- Das A. and Bandyopadhyay S. (2010). Phrase-level ment classification. In Proceedings of the ACL Polarity Identification for Bengali, In Interna- Student Research Workshop, 2005. tional Journal of Computational Linguistics and Applications (IJCLA), Vol. 1, No. 1-2, Jan-Dec Robkop Kergrit, Thoongsup Sareewan, Charoen- 2010, ISSN 0976-0962, Pages 169-182. porn Thatsanee, Sornlertlamvanich Virach and Isahara Hitoshi.WNMS: Connecting the Distri- Das A. and Bandyopadhyay S. Subjectivity Detec- buted WordNet in the Case of Asian WordNet. . tion in English and Bengali: A CRF-based Ap- In the Proceeding of 5th International Confe- proach. In the Proceeding of ICON 2009. rence of the Global WordNet Association Esuli Andrea and Sebastiani Fabrizio. SentiWord- (GWC-2010), Mumbai, India , 31st Jan. - 4th Net: A publicly available lexical resource for Feb., 2010. opinion mining. In Proceedings of Language Re- Wiebe Janyce and Mihalcea Rada. Word sense and sources and Evaluation (LREC), 2006. subjectivity. In Proceedings of COLING/ACL- Hatzivassiloglou, Vasileios and Wiebe Janyce. Ef- 06 the 21st Conference on Computational Lin- fects of adjective orientation and gradability on guistics/Association for Computational Linguis- sentence subjectivity. In Proceedings of COL- tics. Sydney, Australia. Pages 1065--1072. ING-00, 18th International Conference on Com- Wiebe Janyce and Riloff Ellen. Creating Subjective putational Linguistics. Saarbru¨cken, GE. Pages and Objective Sentence Classifiers from Unan- 299-305. 2000. notated Texts. In Proceeding of International Higashinaka Ryuichiro, Walker Marilyn, and Pra- Conference on Intelligent Text Processing and sad Rashmi. Learning to generate naturalistic ut- Computational Linguistics, Mexico City, Pages terances using reviews in spoken dialogue sys- 475–486, 2006. tems. ACM Transactions on Speech and Lan- Wilson Theresa, Wiebe Janyce and Hoffmann Paul guage Processing (TSLP), 2007. (2005). Recognizing Contextual Polarity in Jha S., Narayan D., Pande P. and Bhattacharyya P. Phrase-Level Sentiment Analysis. In Proceed- A WordNet for Hindi, International Workshop ings of HLT/EMNLP 2005, Vancouver, Canada. on Lexical Resources in Natural Language Processing, Hyderabad, India, January 2001. Kumaran A., Saravanan K. and Maurice Sandor. WikiBABEL: Community Creation of Multilin- gual Data, in the WikiSYM 2008 Conference, Porto, Portugal, Association for Computing Ma- chinery, Inc., September 2008. Mihalcea Rada, Banea Carmen and Wiebe Janyce. Learning multilingual subjective language via- cross-lingual projections. In Proceedings of the Association for Computational Linguistics (ACL), pages 976–983, Prague, Czech Republic, June 2007. Mohammad Saif, Dorr Bonnie, and Hirst Graeme. Computing Word-Pair Antonymy. In Proceed-

63 Constructing Thai Opinion Mining Resource: A Case Study on Hotel Reviews

Choochart Haruechaiyasak, Alisa Kongthon, Pornpimon Palingoon and Chatchawal Sangkeettrakarn Human Language Technology Laboratory (HLT) National Electronics and Computer Technology Center (NECTEC) [email protected], [email protected], [email protected], [email protected]

Abstract opinionated texts could reveal potentially use- ful information regarding the preferences of peo- Opinion mining and sentiment analy- ple towards many different topics including news sis has recently gained increasing atten- events, social issues and commercial products. tion among the NLP community. Opin- Opinion mining and sentiment analysis is such ion mining is considered a domain- task for analyzing and summarizing what people dependent task. Constructing lexicons think about a certain topic. for different domains is labor intensive. Due to its potential and useful applications, In this paper, we propose a framework opinion mining has gained a lot of interest in text for constructing Thai language resource mining and NLP communities (Ding et al., 2008; for feature-based opinion mining. The Jin et al., 2009). Much work in this area focused feature-based opinion mining essentially on evaluating reviews as being positive or nega- relies on the use of two main lexicons, tive either at the document level (Turney, 2002; features and polar words. Our approach Pang et al., 2002; Dave et al., 2003; Beineke for extracting features and polar words et al., 2004) or sentence level (Kim and Hovy, from opinionated texts is based on syn- 2004; Wiebe and Riloff, 2005; Wilson et al., tactic pattern analysis. The evaluation 2009; Yu and Hatzivassiloglou, 2003). For in- is performed with a case study on ho- stance, given some reviews of a product, the sys- tel reviews. The proposed method has tem classifies them into positive or negative re- shown to be very effective in most cases. views. No specific details or features are identi- However, in some cases, the extraction fied about what customers like or dislike. To ob- is not quite straightforward. The rea- tain such details, a feature-based opinion mining sons are due to, firstly, the use of conver- approach has been proposed (Hu and Liu, 2004; sational language in written opinionated Popescu and Etzioni, 2005). This approach typi- texts and, secondly, the language seman- cally consists of two following steps. tic. We provide discussion with possible 1. Identifying and extracting features of an ob- solutions on pattern extraction for some ject, topic or event from each sentence upon of the challenging cases. which the reviewers expressed their opin- 1 Introduction ion. 2. Determining whether the opinions regard- With the popularity of Web 2.0 or social net- ing the features are positive or negative. working websites, the amount of user-generated contents has increased exponentially. One in- The feature-based opinion mining could pro- teresting type of these user-generated contents vide users with some insightful information re- is texts which are written with some opinions lated to opinions on a particular topic. For exam- and/or sentiments. An in-depth analysis of these ple, for hotel reviews, the feature-based opinion

64 Proceedings of the 8th Workshop on Asian Language Resources, pages 64–71, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing mining allows users to view positive or negative effective in most cases, especially for extracting opinions on hotel-related features such as price, polar words. However, in some cases, the extrac- service, breakfast, room, facilities and activities. tion is not quite straightforward due to the use of Breaking down opinions into feature level is very conversational language, idioms and hidden se- essential for decision making. Different cus- mantic. We provide some discussion on the chal- tomers could have different preferences when se- lenging cases and suggest some solutions as the lecting hotels to stay for vacation. For example, future work. some might prefer hotels which provide full fa- The remainder of this paper is organized as cilities, however, some might prefer to have good follows. In next section, we review some related room service. works on different approaches for constructing The main drawback of the feature-based opin- language resources for opinion mining and sen- ion mining is the preparation of different lex- timent analysis. In Section 3, we present the pro- icons including features and polar words. To posed framework for constructing Thai opinion make things worse, these lexicons, especially the mining resource by using the dual pattern extrac- features, are domain-dependent. For a partic- tion method. In Section 4, we apply the proposed ular domain, a set of features and polar words framework with a case study of hotel reviews. must be prepared. The process for language The performance evaluation is given with the ex- resource construction is generally labor inten- periment results. Some difficult cases are dis- sive and time consuming. Some previous works cussed along with some possible solutions. Sec- have proposed different approaches for automat- tion 5 concludes the paper with the future work. ically constructing the lexicons for the feature- based opinion mining (Qiu et al., 2009; Riloff 2 Related work and Wiebe, 2003; Sarmento et al., 2009). Most The problem of developing subjectivity lexicons approaches applied some machine learning al- for training and testing sentiment classifiers has gorithms for learning the rules from the corpus. recently attracted some attention. The Multi- The rules are used for extracting new features perspective Question Answering (MPQA) opin- and polar words from untagged corpus. Reviews ion corpus is a well-known resource for senti- of different approaches are given in the related ment analysis in English (Wiebe et al., 2005). work section. It is a collection of news articles from a vari- In this paper, we propose a framework for ety of news sources manually annotated at word constructing Thai language resource for the and phrase levels for opinions and other private feature-based opinion mining. Our approach states (i.e., beliefs, emotions, sentiments, spec- is based on syntactic pattern analysis of two ulations, etc.). The annotation in this work also lexicon types: domain-dependent and domain- took into account the context, which is essential independent. The domain-dependent lexicons for resolving possible ambiguities and accurately include features, sub-features and polar words. determining polarity. The domain-independent lexicons are particles, Although most of the reference corpora has negative words, degree words, auxiliary verbs, been focused on English language, work on other prepositions and stop words. Using these lexi- languages is growing as well. Kanayama et cons, we could construct a set of syntactic rules al. (2006) proposed an unsupervised method to based on the frequently occurred patterns. The detect sentiment words in Japanese. In this work, rule set can be used for extracting more unseen they used clause level context coherency to iden- sub-features and polar words from untagged cor- tify candidate sentiment words from sentences pus. that appear successively with sentences contain- We evaluated the proposed framework on the ing seed sentiment words. Their assumption is domain of hotel reviews. The experimental re- that unless the context is changed with adver- sults showed that our proposed method is very sative expressions, sentences appearing together

65 in that context tend to have the same polari- feature-based opinion mining framework pro- ties. Hence, if one of them contains sentiments posed by Liu et al. (2005). The framework starts words, the other successive sentences are likely by setting the domain scope such as digital cam- to contain sentiment words as well. Ku and era. The next step is to design a set of features Chen (2007) proposed the bag-of-characters ap- associated with the given domain. For the do- proach to determine sentiment words in Chinese. main of digital camera, features could be, for in- This approach calculates the observation proba- stance, “price”, “screen size” and “picture qual- bilities of characters from a set of seed sentiment ity”. Features could contain sub-features. For words first, then dynamically expands the set and example, the picture quality could have the sub- adjusts their probabilities. Later in 2009, Ku features as “macro mode”, “portrait mode” and et al. (2009), extended their bag-of-characters “night mode”. Preparing multiple feature levels approach by including morphological structures could be time-consuming, therefore, we limit the and syntactic structures between sentence seg- features into two levels: main features and sub- ment. Their experiments showed better perfor- features. mance of word polarity detection and opinion Another domain-dependent lexicon is po- sentence extraction. lar words. Polar words are sentiment words Some other methods to automatically gener- which represent either positive or negative views ate resources for subjectivity analysis for a for- on features. Although some polar words are eign language have leveraged the resources and domain-independent and have explicit meanings tools available for English. For example, Be- such as “excellent”, “beautiful”, “expensive” nea et al. (2008) applied machine translation and and “terrible”. Some polar words are domain- standard Naive Bayes and SVM for subjectiv- dependent and have implicit meanings depend- ity classification for Romanian. Their exper- ing on the contexts. For example, the word iments showed promising results for applying “large” is generally considered positive for the automatic translation to construct resources and screen size feature of digital camera domain. tools for opinion mining in a foreign language. However, for the dimension feature of mobile Wan (2009) also leveraged an available English phone domain, the word “large” could be con- corpus for Chinese sentiment classification by sidered as negative. using the co-training approach to make full use On the other hand, the domain-independent of both English and Chinese features in a uni- lexicons are regular words which provide differ- fied framework. Jijkoun and Hofmann (2009) ent parts of speech (POS) and functions in the also described a method for creating a Dutch sentence. For opinion mining task, we design subjectivity lexicon based on an English lexi- six different domain-independent lexicons as fol- con. They applied a PageRank-like algorithm lows (some examples are shown in Table 1). that bootstraps a subjectivity lexicon from the translation of the English lexicon and rank the Particles (PAR): In Thai language, these • words in the thesaurus by polarity using the net- words refer to the sentence endings which work of lexical relations (e.g., synonymy, hy- are normally used to add politeness of the ponymy) in Wordnet. speakers (Cooke, 1992).

Negative words (NEG): Like English, 3 The proposed framework • these words are used to invert the opinion The performance of the feature-based opinion polarity. Examples are “not”, “unlikely” mining relies on the design and completeness and “never”. of related lexicons. Our lexicon design dis- tinguishes lexicons into two types, domain- Degree words (DEG): These words are • dependent and domain-independent. The design used as an intensifier to the polar words. of domain-dependent lexicons is based on the Examples are “large”, “very”, “enormous”.

66 Auxiliary verbs (AUX): These words are words. All patterns are sorted by the frequency • used to modify verbs. Examples are of occurrence. The lexicon construction is per- “should”, “must” and “then”. formed by simply collecting words which are al- ready tagged with the lexicon types. The lex- Prepositions (PRE): Like English, Thai • icons are used for performing the feature-based prepositions are used to mark the relations opinion mining task such as classifying and sum- between two words. marizing the reviews as positive and negative Stop words (STO): These words are used based on different features. The completeness of • for grammaticalization. Thai language is lexicons is very important for the feature-based considered an isolating language, to form a opinion mining. To collect more lexicons, pat- noun the words “karn” and “kwam” are nor- terns are used in the dual pattern extraction pro- mally placed in front of a verb or a noun, cess to extract more features and polar words respectively. Therefore, these words could from the untagged corpus. be neglected when analyzing opinionated texts.

Figure 1: The proposed opinion resource con- Table 1: Domain-independent lexicons struction framework based on the dual pattern extraction. Although some of the above lexicons are sim- ilar to English, however, some words are placed in different position in a sentence. For example, 4 A case study of hotel reviews in Thai, a degree word is usually placed after a polar word. For example, “very good” would be To evaluate the proposed framework, we per- written as “good very” in Thai. form some experiments with a case study of Figure 1 shows all processes and work flow hotel reviews. In Thailand, tourism is ranked under the proposed framework. The process as one of the top industries. From the statis- starts with a corpus which is tagged based on tics provided by the Office of Tourism Develop- 1 two lexicon types. From the tagged corpus, ment , the number of international tourists vis- we construct patterns and lexicons. The pat- iting Thailand in 2009 is approximately 14 mil- tern construction is performed by collecting text 1The Office of Tourism Development, segments which contain both features and polar http://www.tourism.go.th

67 lions. The number of registered hotels in all re- gions of Thailand is approximately 5,000. Pro- viding an opinion mining system on hotel re- views could be very useful for tourists to make decision on hotel choice when planning a trip.

4.1 Corpus preparation We collected customer reviews from the Agoda website2. The total number of reviews in the cor- pus is 8,436. Each review contains the name of the hotel as the title and comments in free-form text format. We designed a set of 13 main fea- tures: service, cleanliness, hotel condition, loca- tion, food, breakfast, room, facilities, price, com- fort, quality, activities and security. The set of Table 2: Domain-dependent lexicons for the main features is designed based on the features breakfast feature. obtained from the Agoda website. Some addi- tional features, such as activities and security, are added to provide users with more dimensions. Table 3 shows the domain-dependent lexicons In this paper, we focus on two main fea- related to the service feature. The main fea- tures: breakfast and service. Table 2 shows the tures include synonyms, transliterated and En- domain-dependent lexicons related to the break- glish terms which describe the concept service. fast feature. For breakfast main feature (FEA), The service sub-features are, for example, “re- we include all synonyms which could be used to ception”, “security guard”, “maid”, “waiter” and describe breakfast in Thai. These include En- “concierge”. Unlike the breakfast feature, the glish terms with their synonyms, transliterated polar words for the service feature are quite gen- terms and abbreviations. eral and could mostly be applied for all sub- The breakfast sub-features (FEA*) are spe- features. Another observation is that some of the cific concepts of breakfast. Examples include polar words are based on Thai idiom. For ex- “menu”, “taste”, “service” and “coffee”. It can ample, the phrase “having rigid hands” in Thai be observed that some of the sub-features could means “impolite”. In Thai culture, people show also act as a main feature. For example, the politeness by doing the “wai” gesture. sub-feature “service” of breakfast is also used as the main feature “service”. Providing sub- 4.2 Experiments and results feature level could help revealing more insight- Using the tagged corpus and the extracted lexi- ful dimension for the users. However, designing cons, we construct the most frequently occurred multiple feature levels could be time-consuming, patterns. For two main features, breakfast and therefore, we limit the features into two levels, service, the numbers of tagged reviews for each i.e., main feature and sub-feature. The polar feature are 301 and 831, respectively. We ran- words (POL) are also shown in the table. We domly split the corpus into 80% as training set denote the positive and negative polar words by and 20% as test set. We only consider the pat- placing [+] and [-] after each word. It can be terns which contain both features (either main observed that some polar words are dependent features or sub-features) and polar words. For on sub-features. For example, the polar word the breakfast feature, the total number of ex- “long line” can only be used for the sub-feature tracted patterns is 86. For the service feature, the “restaurant”. total number of extracted patterns is 192. Table 2Agoda website: http://www.agoda.com 4 and 5 show some examples of most frequently

68 Table 4: Top-ranked breakfast patterns with ex- amples

Table 3: Domain-dependent lexicons for the ser- describe the polar words are straightforward and vice feature. not complicated. This is especially true for the case of breakfast feature in which the accuracy is approximately 95%. occurred patterns extracted from the corpus. The symbols of the tag set are as shown in Table 1 and 2 with the tag denoting any other words. From the tables, two patterns which occur fre- quently for both features are and . These two patterns are very simple and show that the opinionated texts in Thai are mostly very simple. Users just simply write a word describing the feature followed by a polar word (either positive or negative) with- out using any verb in between. Some examples for the pattern are and . In English, a verb “to be” (is/am/are) is usually re- quired between and . Using the extracted patterns, we perform the Table 5: Top-ranked service patterns with exam- dual pattern extraction process to collect the ples sub-features and polar words from the test data set. Table 6 shows the evaluation results of sub-features and polar words extraction for both 4.3 Discussion breakfast and service features. It can be observed Table 7 and 8 show some examples of challeng- that the set of patterns could extract polar words ing cases for breakfast and service features, re- (POL) with higher accuracy than sub-features spectively. The polar words shown in both tables (FEA*). This could be due to the patterns used to are very difficult to extract since the patterns can

69 Accuracy (%) Feature FEA* POL Breakfast 80.00 95.74 Service 82.56 89.29

Table 6: Evaluation results of features and polar words extraction. not be easily captured. The difficulties are due to many reasons including the language semantic and the need of world knowledge. For example, in case #5 of service feature, the whole phrase service can be interpreted as “attentive”. It is difficult Table 8: Examples of difficult cases of for the system to generate the pattern based on feature this phrase. Another example is case #4 of both tables, the customers express their opinions by was done with a collection of hotel reviews ob- comparing to other hotels. To analyze the senti- tained from a hotel reservation website. From ment correctly would require the knowledge of a the experimental results, polar words could be particular hotel or hotels in specific locations. extracted more easily than sub-features. This is due to the polar words often appear in spe- cific positions with repeated contexts in the opin- ionated texts. In some cases, extraction of sub- features and polar words are not straightforward due to the difficulties in generalizing patterns. For example, some subjectivity requires com- plete phrases to describe the polarity. In some cases, the sub-features are not explicitly shown in the sentence. For future work, we plan to com- plete the construction of the corpus by consider- ing the rest of main features. Another plan is to include the semantic analysis into the pattern ex- traction process. For example, the phrase “forget Table 7: Examples of difficult cases of breakfast something” could imply negative polarity for the feature service feature.

5 Conclusion and future work References We proposed a framework for constructing Thai Banea, Carmen, Rada Mihalcea, Janyce Wiebe, and opinion mining resource with a case study on Samer Hassan. 2008. Multilingual subjectivity hotel reviews. Two sets of lexicons, domain- analysis using machine translation. Proc. of the 2008 empirical methods in natural language pro- dependent and domain-independent, are de- cessing, 127–135. signed to support the pattern extraction process. The proposed method first constructs a set of pat- Beineke, Philip, Trevor Hastie and Shivakumar terns from a tagged corpus. The extracted pat- Vaithyanathan. 2004. The sentimental factor: im- proving review classification via human-provided terns are then used to automatically extract and information. Proc. of the 42nd Annual Meeting on collect more sub-features and polars words from Association for Computational Linguistics, 263– an untagged corpus. The performance evaluation 270.

70 Cooke, J.R. 1992. Thai sentence particles: putting Liu, Bing, Minqing Hu and Junsheng Cheng. 2005. the puzzle together. Proc. of the The Third Inter- Opinion observer: analyzing and comparing opin- national Symposium on Language and Linguistics, ions on the Web. Proc. of the 14th World Wide 1105–1119. Web, 342–351.

Dave, Kushal, Steve Lawrence and David M. Pen- Pang, Bo, Lillian Lee and Shivakumar Vaithyanathan. nock. 2003. Mining the peanut gallery: opinion 2002. Thumbs up?: sentiment classification using extraction and semantic classification of product machine learning techniques. Proc. of the ACL- reviews. Proc. of the 12th international confer- 02 conf. on empirical methods in natural language ence on World Wide Web, 519–528. processing, 79–86.

Ding, Xiaowen, Bing Liu and Philip S. Yu. 2008. A Popescu, Ana-Maria and Oren Etzioni. 2005. Ex- holistic lexicon-based approach to opinion mining. tracting product features and opinions from re- Proc. of the int. conf. on web search and web data views. Proc. of the conf. on human language tech- mining, 231–240. nology and empirical methods in natural language processing, 339–346. Hu, Minqing and Bing Liu. 2004. Mining and sum- Riloff, Ellen and Janyce Wiebe. 2003. Learning ex- marizing customer reviews. Proc. of the 10th ACM traction patterns for subjective expressions. Proc. SIGKDD international conference on Knowledge of the 2003 conference on empirical methods in discovery and data mining, 168–177. natural language processing, 105–112. Jin, Wei, Hung Hay Ho and Rohini K. Srihari. 2009. Sarmento, Lu´ıs, Paula Carvalho, Mario´ J. Silva, and OpinionMiner: a novel machine learning system Eugenio´ de Oliveira. 2009. Automatic creation for web opinion mining and extraction. Proc. of of a reference corpus for political opinion min- the 15th ACM SIGKDD, 1195–1204. ing in user-generated content. Proc. of the 1st CIKM workshop on topic-sentiment analysis for Kim, Soo-Min and Eduard Hovy. 2004. Determining mass opinion, 29–36. the sentiment of opinions. Proc. of the 20th inter- national conference on Computational Linguistics, Turney, Peter D. 2002. Thumbs up or thumbs down?: 1367–1373. semantic orientation applied to unsupervised clas- sification of reviews. Proc. of the 40th ACL, 417– Qiu, Guang, Bing Liu, Jiajun Bu, and Chun Chen. 424. 2009. Expanding domain sentiment lexicon through double propagation. Proc. of the 21st In- Wan, Xiaojun. 2009. Co-training for cross-lingual ternational Joint Conferences on Artificial Intelli- sentiment classification. Proc. of the joint conf. of gence, 1199–1204. ACL and IJCNLP, 235–243.

Jijkoun, Valentin and Katja Hofmann. 2009. Gener- Wiebe, Janyce and Ellen Riloff. 2005. Creating ating a non-English subjectivity lexicon: relations subjective and objective sentence classifiers from that matter. Proc. of the 12th Conference of the unannotated texts. Proc. of Conference on Intelli- European Chapter of the Association for Compu- gent Text Processing and Computational Linguis- tational Linguistics, 398–405. tics, 486–497. Wiebe, Janyce, Theresa Wilson and Claire Cardie. Kanayama, Hiroshi and Tetsuya Nasukawa. 2006. 2005. Annotating expressions of opinions and Fully automatic lexicon expansion for domain- emotions in language. Language Resources and oriented sentiment analysis. Proc. of the 2006 Evaluation, 39(2-3):165–210. Conference on Empirical Methods in Natural Lan- guage Processing, 355–363. Wilson, Theresa, Janyce Wiebe and Paul Hoffmann. 2009. Recognizing contextual polarity: An explo- Ku, Lun-Wei and Hsin-Hsi Chen. 2007 Mining opin- ration of features for phrase-level sentiment analy- ions from the Web: Beyond relevance retrieval. sis. Comput. Linguist., 35(3):399–433. Journal of American Society for Information Sci- ence and Technology, 58(12):1838–1850. Yu, Hong and Vasileios Hatzivassiloglou. 2003. To- wards answering opinion questions: separating Ku, Lun-Wei, Ting-Hao Huang and Hsin-Hsi Chen. facts from opinions and identifying the polarity of 2009. Using morphological and syntactic struc- opinion sentences. Proc. of the Conference on Em- tures for Chinese opinion analysis. Proc. of the pirical Methods in Natural Language Processing, 2009 empirical methods in natural language pro- 129–136. cessing, 1260–1269.

71 The Annotation of Event Schema in Chinese

Hongjian Zou1, Erhong Yang1, Yan Gao2, Qingqing Zeng1 1Institute of Applied Linguistics, Beijing Language and Culture University 2Zhanjiang Normal University [email protected], [email protected]

annotation of event schema, which is important Abstract for providing the integral meaning of the reports. Currently for Chinese language, the annota- We present a strategy for revealing tion of event information corpora is just begin- event schema in Chinese based on the ning and still far from being sufficient, when manual annotation of texts. The overall compared with English, hence it needs further event information is divided into three exploration. levels and events are chosen as the ele- mentary units in annotation. Event-level 2 Related Work annotation content and the obtaining of events patterns are explored in detail. The work and theories concerning event schema The discourse-level annotation, annota- annotation can be divided into three categories. tion of relations between events and an- The first kind is focused on annotation of the notation of the functional attributes pro- event argument structure, such as in ACE. The vide a simple way to represent event second kind is focused on annotation of the schema. temporal information and event factuality. The last is focused on the annotation of the relations 1 Introduction among different discourse units such as RST corpus and Penn Discourse TreeBank. When we want to understand a report on occur- ACE(2005) is an in-depth study of research rences, we need to catch the following informa- oriented annotated corpus for the purpose of tion: the categorization of events, the relation- textual information extraction. The annotation ships between them, the participants and the task includes event annotation besides the anno- attributes of the events such as polarity and tation of entities, values and relations between modality, the attitudes towards the events and entities. The event annotation is limited to cer- the following actions or consequences. Only the tain types and subtypes of events, that is, Life, information above cannot be the precisely de- Movement, Transaction, Business, Conflict, scried. Furthermore, we need to form a schema Contact, Personnel, and Justice. The argument which incorporates all of the above, that is, to structure of events including participants and compile all this information together to get the other components such as time and place are integral structure about the report. predefined and tagged. Besides these, four kinds The available annotated corpora concerning of attributes of events, polarity, tense, genericity the different types of information mentioned and modality, are tagged. The expression cha- above include: the event-annotated corpora such racters of events, including the extent and the as ACE corpora, the corpora annotating tempor- triggers, are also tagged. al information such as TimeBank, the corpora TimeML(Pustejovsky et al., 2003; TimeML, annotating event factuality such as FactBank, 2005) is a system for representing not only all the corpora annotating various types of dis- events but also temporal information. The course relations such as RST corpus and Penn events tagged are not limited to certain types as Discourse TreeBank. Meanwhile, we lack the in ACE, but are classified in a different way. Event tokens and event instances are distin-

72 Proceedings of the 8th Workshop on Asian Language Resources, pages 72–79, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing guished and tagged respectively. For each event Penn Discourse TreeBank(Miltsakaki et al., instance, four kinds of attributes, namely, tense, 2004; Webber et al., 2005) is to annotate the aspect, polarity and modality are tagged. million-word WSJ corpus in the Penn TreeBank TimeML defines three kinds of links between with a layer of discourse information. Although events and times. TLINK represents temporal the idea of annotating connectives and their ar- relationships, including simultaneous, before guments comes from the theoretical work on and after. SLINK represents subordinative rela- discourse connectives in the framework of lexi- tionships. And ALINK represents relationships calized grammar, the corpus itself is not tied to between an aspectual event and its argument any particular theory. Discourse connectives event. Several TimeML corpora have been were treated as discourse-level predicates of created now, including TimeBank and binary discourse relations that take two abstract AQUAINT TimeML Corpus. objects such as events, states, and propositions. FactBank(Roser and Pustejovsky, 2008, 2009; The two arguments to a discourse connective Roser, 2008) is a corpus that adds factuality in- were simply labeled Arg1 and Arg2. formation to TimeBank. The factual value of events under certain sources is represented by 3 The Levels and Elementary Unit of two kinds of attributes, modality and polarity. Event Schema Annotation Besides the annotation of events and their temporal relationships or factuality information, 3.1 The Elementary Unit of Event Schema there are various types of discourse annotation, Annotation which can be divided into two trends: one under What counts as an elementary unit of Event the guidance of a certain discourse theory(such Schema annotation in Chinese? as RST) and the one independent of any specific It is common to set sentences or clause as the theory(such as PDTB). basic units in discourse annotation such as RST RST (Mann and Thompson, 1987; Taboada corpus. However, there will be certain limita- and Mann, 2006) was originally developed as tions if we choose sentences or clauses as the part of studies on computer-based text genera- elementary units of Chinese event schema anno- tion by William Mann and Sandra Thompson in tation: 1980s. In the RST framework, the discourse First, a Chinese sentence is generally defined structure of a text can be represented as a tree. as a grammatical unit that has pauses before and The leaves of the tree correspond to text frag- after it, a certain intonation, and expresses a ments that represent the minimal units of the complete idea. But the definition is not exact or discourse; the internal nodes of the tree corres- operational. The only notable borders of Chi- pond to contiguous text spans; each node is cha- nese sentences in writings are the punctuations racterized by its nuclearity and by a rhetorical at the end of the sentences. The same is true of relation that holds between two or more non- clauses in Chinese. overlapping, adjacent text spans. RST chooses Second, there is generally more than one the clause as the elementary unit of discourse. event in a sentence or a clause in Chinese. All units are also spans, and spans may be com- Hence, if we choose sentences or clauses as the posed of more than one unit. RST relations are basic units of event schema annotation, the rela- defined in terms of four fields: (1) Constraints tions between the events in one sentence/clause on the nucleus; (2) Constraints on the satellite; cannot be described in detail. For example: (3) Constraints on the combination of the nuc- 1. 不到 24 小时,俄罗斯南部黑海边一个老 leus and the satellite; and (4) Effects. The num- 年人之家又燃起熊熊大火 ,至少 62 人葬身 ber and the types of relations are not fixed. It 火海。 (In less than 24 hours, a fire swept can be reduced or extended. Carlson et al. (2003) through an old people’s home in the Black Sea describes the experience of developing a dis- coast of southern Russia and killed at least 62 course-annotated corpus grounded in the people.) framework of Rhetorical Structure Theory. The 2. 智利南部艾森大区自 22 日以来频繁发生 resulting corpus contains 385 documents se- lected from the Penn Treebank. 地震,该地区政府已 宣布该地区进入“早期 警报”状态。 (Earthquakes have hit the Aysen

73 region in southern Chile frequently since the mine involves the rescue, the coalmine accident 22nd. The government has declared the region and possibly injuries. These events are linked to be a state of "early warning".) together but have different significances within In example 1, there are two events in bold the report. So it is necessary to annotate the dif- type: the fire and the death in one sentence. In ferent significances of the events, as well as re- example 2, there are also two events in a single lations between events. sentence: the earthquake and the declaration. The following passages discuss in detail the The “event” in this paper covers the same event-level and the discourse-level annotation, meaning defined by ACE(2005), which refers to while the entity-level annotation will not be dis- "a specific occurrence involving participants". cussed considering its relative simplicity. Zou and Yang(2007) shows that an average of 2.3 times events per sentence are reported in 4 Event-level Annotation Chinese texts and hence chose events as the ba- 4.1 Definition of Events sic discourse unit in their annotation. This con- sideration also fits the elementary unit of event ACE(2005) defines an event as follows: An schema annotation. event is a specific occurrence involving partici- pants. An event is something that happens. An 3.2 Three Levels of Event Schema Annota- event can frequently be described as a change of tion state. According to ACE’s definition, we define The overall event information in a report is event as the following: An Event is an occur- complex and consists of different levels. In or- rence that catches somebody's attention and a der to simplify the annotation task, we first di- change of state. vide the total event information into three levels, 4.2 Obtainment of Event Patterns that is, the discourse level, the event level, and the entity level, choosing the event as the ele- The event patterns are the argument structures mentary unit of the event schema annotation. of certain types of events, which are the direc- The event level is defined as the level relat- tors of argument annotation. They are extracted ing to atomic events. A report of occurrences from large-scale texts category by category. The always has many related events that are very above categories are based on the classification easy to recognize. The events are atomic, which of sudden events. In other words, sudden events means the events are divided into small and mi- are divided into 4 categories: natural disasters, nimal events. For example, when reading a re- accidental disasters, public health incidents, port about an earthquake that happened in Haiti, and social security incidents, and each category the reader will not only know about the earth- includes different types of events, for example, quake itself, but also other relating happenings the natural disasters includes earthquakes, tsu- such as the number of casualty or the following namis, debris flows and so on. In dealing with a search and rescue. These things are divided into specific kind of texts, only the closely related different atomic events, though they are still events that appear frequently are annotated. For linked closely. example, when annotating the events of earth- The entity level means the entities, times, quake, only earthquake itself and closely related and locations that are involved in events. For events such as loss, rescue, etc, are annotated. example, in “China rescues 115 from a flooded The event patterns are manually extracted mine”, “China” is the agent of the rescue; from real texts as follows, taking earthquake for “115(miners)” are the recipients; “a flooded instance: mine” is the location. These three entities are • A search engine is used to obtain the reports the arguments of the rescue event and should be whose titles and main bodies contain the key annotated before tagging them as the arguments word ‘earthquake’, and then manually filter out of the rescue event. those texts whose topics are not; The discourse level is the level above the • The remaining texts are then split into sen- event level which creates the integral meaning tences and only the sentences that narrate an of the event schema. For example, the report earthquake or are closely relate to the earth- concerning the rescue of miners from a flooded quake are selected;

74 • Specific entities in these sentences are re- vided that the category is not too rare in similar placed with general tags such as ‘

arg0 Time locations arg1 Location arg7 Frequency 1 arg2 Magnitude Table 2. The annotation of the Haiti Earthquake. arg3 Epicenter arg4 Focus 4.4 Annotation of Event Attributes arg5 Focal depth arg6 Quake-feeling locations Besides the types and arguments, the attributes arg7 Frequency of events are also tagged, which is necessary for Table 1. The earthquake event pattern. a comprehensive description of events. Based on the analysis of various attributes in the re- 4.3 Annotation of Types and Arguments ports, we decided to annotate the following: After obtaining the event patterns, we can anno- Polarity, Modality, Tense, Aspect, Level, Fre- tate the types and the arguments of events ac- quency, Source, and Fulfillment. Among these cording to the predefined types and patterns. If a attributes, Polarity, Modality and Tense are certain event is not yet defined, the annotator adopted by both ACE and TimeML. Aspect, should tag the event as “Other” and retag it later Frequency and Source are adopted by TimeML. after obtaining the pattern of that category pro- The primary reason for annotating these attributes is that they have an important role in

75 describing events in detail and different values “巴黎警方 10 日透露” (according to state- of some attributes can even imply a totally dif- ments by the Paris police on 10th) ferent meaning. the source is “巴黎警方”(the Paris police) Polarity is whether the event happened or and the time issued is “10 日”(the 10th). would happen. The value of polarity can only be Fulfillment is an interesting attribute of one between “Positive” and “Negative”. For events that deserves further study and will be example, in discussed in another paper. This is an attribute “所幸这起火灾没有造成人员伤亡” (Fortu- which is only applicable to man-made events nately the fires did not result in any casualties) with an emphasized intention, in other words, it the polarity of event “ 伤亡”(injuries-or- is not applicable to those events occurring natu- deaths) is “negative”. rally. It can be “Fulfilled”, “Unfulfilled”, or Modality is the possibility of the event. Cur- “Underspecified”. For example, a rescue event rently, we divide modality simply into “As- is deliberate and has or will have a result. For serted” and “Other”. For example, in example, in “震区许多居民担心再次发生海啸” (Many “中国成功救出被困 8 昼夜的 115 名矿工” residents in earthquake-hit areas worry about a (China rescues 115 from flooded mine after 8 recurrence of the tsunami) days) the modality of event “ 海啸”(tsunami) is the fulfillment of the event “救”(rescue) is “Other”. “Fulfilled”. Tense is the time the event happened com- The complete attributes of an event can be pared with the time of the report. It can be represented as a complex feature set as shown “Past”, “Present”, “Future”, or “Underspeci- below: fied”. For example, in Polarity:Positive/Negative “警方目前正在进行调查”(A Police investiga- Modality:Asserted/Other tion is under way) Tense:Past/Present/Future/Underspecified the tense of event “调查” (investigation) is Aspect:Perfective/Progressive/Underspecified “Present”. Level:Slight/Medium/Serious Aspect is whether or not the event is continu- Frequency:nn 1 ing or completed. It can be “Progressive”, “Per- time 1,Source 1 fective” or “Underspecified”. In the sentence Source: …… time m,Source n above, the aspect of event “调查” (investigation) Fulfillment :Fulfilled/Unfulfilled/Underspecified is “Progressive”. …… Level is the extent of the events. It can be Figure 1. The complex feature set of attributes. “Serious”, “Medium” or “Slight”. If the annota- tor cannot make sure, it can also be ignored. For 4.5 Annotation of Indicators example, in The recognition of types, arguments and “强烈地震袭击印尼” (Strong earthquake hits attributes of the events not only depends on the Indonesian) sense of the annotator, but also depends on lin- the level of the event “地震” (earthquake) is guistic indicators within the text. To locate the “Serious”. existences of an event and its types, the annota- Frequency is how many times the event hap- tor should find the lexical evidence that we pened. Usually it is only once, yet sometimes, called an Event Word (ACE call it a trigger) as mentioned above, it may be twice or more. which clearly indicates something that has hap- Source consists of the source of the informa- pened. In the following sentence, tion about a certain event and the time the in- 巴黎警方 10 日透露,位于巴黎一区一 formation issued. If not specialized, the source 座公寓楼 10 日凌晨发生严重火灾,造成至 is equal to the source of the report itself and the 少 2 名妇女死亡,两名消防队员重伤和巨大 time of source is equal to the time that the report 财物损失。 (According to statements made by was issued. For example, in the Paris police on the 10th, serious fire swept through an apartment building in district one in

76 Paris on the morning of the 10th, killing at least Co-reference is not the relationship between 2 women, seriously injuring two firemen and two different events but the relationship be- causing huge property damage.) tween two expressions of events that refer to the The Event Words “ 火灾”(fire), “ 死 same object. 亡 ”(killing), “ 重伤”(injuring) and “ 损 Sequential is the relation between A and B 失 ”(damage) in the sentence above indicate such that B follows A chronologically but there four events respectively. is not necessarily a causal relationship between Besides annotating Event Words for events, them. For example, in the annotator also needs annotating indicators “尼日利亚南部经济中心拉各斯一名 22 岁 from texts to help to locate the attributes of the 妇女 1 月 17 日因病死亡,后经尼卫生部门检 events. The attributes annotated should be clear- 测,该妇女死于高致病性禽流感。” (A 22- ly indicated by some linguistic hints, so the val- year-old woman died of illness on Jan. 17 in ue of a certain attribute will not be specified if Lagos, Nigeria's southern economic hub. After the hints are not so clear. being tested by the Nigerian health sector, it was found that the woman had died of bird flu.) 5 Discourse-level Annotation the events “死亡”(death) and “检测”(testing) The purpose of discourse level annotation is to have sequential relationship. integrate the information from the event-level Purpose is the relation between A and B that into a structure. We annotate two kinds of dis- A happened for B. For example, in course information, the relationships among “尼日利亚政府目前已经在全国范围内加大 events as annotated before and the functional 了卫生监管力度,以控制高致病性禽流感的 attributes of events, to represent the event 扩散。” (The Nigerian government has already schema. strengthened hygienic supervision and regula- tion nationwide to control the spread of the 5.1 Annotation of Relations among Events highly pathogenic avian influenza.) The events in the same report are not self- the purpose of the event “监管”(supervision) sufficient or independent, but are linked by var- is to “控制”(control). ious relationships, such as the causal relation- Part-whole relationship between A and B is ships between an earthquake and an injury. when B is part of A. For example, in Taking into account of both the frequency of “台风“桑美”给福鼎市带来重大人员伤 relationships between events and the ease and 亡,截至目前已有 138 人死亡,其中海上遇 accuracy of distinguishing them, we have de- 难人员 116 人,陆上遇难人员 22 人,还有 cided to focus on the following: causality, co- 人失踪。 reference, sequential, purpose, part-whole, jux- 86 ” (Saomai caused significant ca- taposition and contrast. sualties in Fuding: at least 138 people have Causality is very common in reports. If event been killed so far, including 116 at sea, and 22 A is responsible for the happening of event B, were on land, with 86 missing.) then there exists a causal relationship between the event “遇难” (killed) appeared first and A and B. For example, in is part of the event “伤亡” (casualties). “海地首都太子港附近发生里氏 7.0 级强烈 Juxtaposition relationship means that A and 地震,造成一家医院倒塌,另有多座政府建 B are caused by the same thing, or that A and B 筑损毁。” (A magnitude 7.0 earthquake hit are simultaneous. For example, in 大同市、左云县有关部门已对被困矿工 Haiti, causing a hospital to collapse and da- “ maging government buildings in the capital city 家属进行了妥善安置。同时,环境部门正在 of Port-au-Prince.) 对水质进行监测。” (Datong, Zuoyun authori- there are three events, called “ 地 ties have made proper arrangements for the 震”(earthquake), “倒塌”(collapsing) and “损 families of trapped miners. Meanwhile, the de- 毁”(damaging), and a causal relationship be- partment for environmental protection has been monitoring water quality.) tween “地震” and “倒塌” /“损毁”.

77 the “安置”(arrangement) and “监测” (moni- logic of the report would still be clear, though toring) are simultaneous. the details might be somewhat incomplete. Contrast relationship is when A would After annotating the relationships among usually cause B, but here A happened and didn’t events and functional attributes of these events, in fact cause B. For example, in we can represent a report about an earthquake “萨尔瓦多中部地区 2 日发生里氏 5.3 级 which happened in Kyrgyzstan as follow:

地震,但没有造成人员伤亡和财产损失。” causality (A 5.3 magnitude earthquake hit the central re- 1 2 gion of Salvador on the 2nd, but caused no ca- coreference coreference sualties or property losses.) causality the “地震”(earthquake) usually causes “伤 3 4 contrast 亡” 伤亡” (casualties), but here there is no “ . coreference 5 coreference The contrast relationship between A and B is coreference 6 not equal to the negation of a causal relationship, juxtaposition 7 because in a contrast relationship A is positive coreference and B is negative, while in the negation of caus- causality 9 8 al relationship, the A is negative. sequence Besides those relationships between events 10 described above, the annotator could tag the Figure 2. Event schema of Kyrgyzstan earthquake. relation as “Underspecified” if he/she feels that Nodes of 1, 3, 6, 7 and 8 represent earthquakes; Nodes relationship belongs to a new kind and deserves of 2, 4, and 9 represent damage; Node 5 represents casual- to be annotated. ty; Nodes 10 represents investigation. These relations are also annotated with the attributes similar to those of events, but only In the graph above, the nodes represent the including Polarity, Modality, Tense, Aspect events, and the edges represent the relationships and Source. between events. The gray nodes represent the core events, while the white nodes represent the 5.2 Annotation of Functional Attributes related events. As can be seen from the graph, The annotation of relations among events only the core events are at the center of the text and represents the local discourse structure of the the related events are attached to the core events. report. To represent the overall information it is necessary to integrate the event-level informa- 6 Preliminary Results and Discussion tion globally. We find that the events annotated In order to check the taggability of the annota- in one text are not owning equal significance, tion strategy mentioned above, three graduate and they can be divided into at least two basic students manually annotated about 60 news re- kinds according to their role in expressing the ports in 3 categories, including earthquake, fire highlight of the text. The two basic kinds of role and attack, using sina search engine, according we decide to tag are “core” and “related”. We to the method and principles above. Each text call this the functional attribute of the events. was annotated by two annotators and discussed The core events are the events that are the jointly if the annotation results were inconsis- topics of the reports. Other events are the re- tent or not proper. lated events. If core events were removed, the As can be seen from Table 3 below, 1) the elementary topics would change and the remain- event patterns extracted can cover the texts well ing events could not easily be organized togeth- because up to 78% sentences have been anno- er well. For example, in a report concerning the tated. 2) There are 1.6 times more annotated earthquake that happened in Haiti several events than annotated sentences. This shows months ago, the report’s core events are the that there is generally more than one event in a events representing the earthquake. The other sentence. So, it is reasonable to assume that the events such as the rescue or the injuries are not annotation method can accomplish the task of a integral and cannot be meaningful alone. But if detailed description of relationships between the other events were removed, the topic and events. 3) The relevant events are more numer-

78 ous than the core events. This shows that it is References necessary to distinguish the core events from ACE. 2005. ACE (Automatic Content Extraction) the relevant events. Chinese Annotation Guidelines for Events. http://www.ldc.upenn.edu/Projects/ACE/docs/Chi C T S NS EV CE RE AR nese-Events-Guidelines_v5.5.1.pdf C1 20 277 45 361 191 170 588 C2 20 309 66 394 183 211 515 Carlson L., D. Marcu, M. E. Okurowski. 2003. C3 20 356 93 401 121 280 605 Building a Discourse-Tagged Corpus in the C4 60 942 204 1156 495 661 1708 Framework of Rhetorical Structure Theory. Cur- Table 3. The annotation of EVENTs rent Directions in Discourse and Dialogue, Jan C: Sub-category; C1: earthquake; C2: fire; van Kuppevelt and Ronnie Smith eds., Kluwer C3: terrorist attacks; C4: total Academic Publishers. T: the number of texts; S: the number of sentences NS: the number of sentences not annotated Mann W. and S. Thompson, 1987. Rhetorical Struc- EV: the number of EVENTs ture Theory: A Theory of Text Organization (No. CE: the number of core EVENTs ISI/RS-87-190). Marina del Rey, CA, Information RE: the number of relevant EVENTs Sciences Institute. AR: the number of arguments Miltsakaki E., R. Prasad, A. Joshi, and B. Webber. We have also analyzed the event attributes in 2004. Annotating Discourse Connectives and detail(Zou and Yang, 2010). An interesting their Arguments. Proceedings of the event attribute is Fulfillment, which is only ap- HLT/NAACL Workshop on Frontiers in Corpus plicable to those events with intentions whose Annotation. Boston, MA. result is often emphasized. Sometimes, readers Pustejovsky J., J. Castaño, R. Ingria, S. Roser R. care about the intended results or outcomes as Gaizauskas, A. Setzer and G. Katz. 2003. TimeML: much as or more than the events themselves. Robust Specification of Event and Temporal Ex- Therefore it would be useful to explore the no- pressions in Text. Fifth International Workshop tion of Fulfillment, and investigate which lin- on Computational Semantics. guistic categories could play a role in deciding Taboada M. and W. Mann. 2006. Rhetorical Struc- the value of Fulfillment. We plan to create a ture Theory: Looking Back and Moving Ahead. Fulfillment corpus in the next stage. Discourse Studies 8(3): 423-459. The annotation of event schema is time- TimeML. 2005. Annotation Guidelines Version 1.2. consuming, partly because it needs to annotate http://www.timeml.org/site/publications/timeMLd all three levels of event information of every ocs/annguide_1.2.1.pdf. text, and partly because of the difficulties to Webber B., A. Joshi, E. Miltsakaki, et al. 2005. A identify the event information from trivial de- Short Introduction to the Penn Discourse Tree- scriptions, in other words, one question we of- Bank. Copenhagen Working Papers in Language ten discuss is whether it deserves to annotate and Speech Processing. certain parts of a text. Also, we often need to Roser S. 2008. A Factuality Profiler for Eventuali- make a balance between obtaining enough event ties in Text. Ph.D. Thesis. Brandeis University. patterns to cover various types of related events well and omitting low frequent event types to Roser S. and J. Pustejovsky. 2008. From Structure to simply the obtainment of event patterns. In dis- Interpretation: A Double-layered Annotation for course-level annotation, the main difficulty is Event Factuality. Prooceedings of the 2nd Lin- guistic Annotation Workshop. the identification of relations between events without lexical hints. This discourse-level anno- Roser S. and J. Pustejovsky. 2009. FactBank: A tation is only just underway. We also plan to Corpus Annotated with Event Factuality. Lan- give detailed analysis in the next stage. guage Resources and Evaluation. Zou H.J. and E.H. Yang. 2007. Event Counts as Acknowledgements. This paper is sponsored Elementary Unit in Discourse Annotation. Inter- by National Philosophy and Social Sciences national Conference on Chinese Computing 2007. Fund projects of China (No. 06YY047). Zou H.J. and E.H. Yang. 2010. Annotation of Event Attributes. The 11th Chinese Lexical Semantics Workshop.

79 Query Expansion for Khmer Information Retrieval

Channa Van and Wataru Kameyama GITS, Waseda University Honjo, Saitama, Japan [email protected], [email protected]

Abstract some major web search engine providers such as Google has a localized version of Khmer in its web This paper presents the proposed Query search engine2, it is not specifically designed for Expansion (QE) techniques based on Khmer due to the lack of integration of Khmer spe- Khmer specific characteristics to improve cific characteristics. Consequently, it misses a lot the retrieval performance of Khmer Infor- of information of Khmer websites found in the In- mation Retrieval (IR) system. Four types ternet. For this reason, we propose the QE tech- of Khmer specific characteristics: spelling niques using the specific characteristic of Khmer variants, synonyms, derivative words and to improve the effectiveness of Khmer IR system. reduplicative words have been investi- Four types of QE technique are proposed based gated in this research. In order to evalu- on the four types of Khmer specific characteris- ate the effectiveness and the efficiency of tic that are spelling variants, synonyms, deriva- the proposed QE techniques, a prototype of tive words (Khin, 2007) and reduplicative words Khmer IR system has been implemented. (Long, 2007). Moreover, a prototype of Khmer IR The system is built on top of the popu- system is implemented in order to evaluate the ef- lar open source information retrieval soft- fectiveness of these proposed QE techniques. The ware library Lucene1. The Khmer word proposed system is built on top of the popular segmentation tool (Chea et al., 2007) is open source information retrieval software library also implemented into the system to im- Lucene in order to take the advantage from its pow- prove the accuracy of indexing as well as erful indexing and searching algorithms. Further- searching. Furthermore, the Google web more, to improve the accuracy of indexing and search engine is also used in the evalua- searching, we have also implemented the specific tion process. The results show the pro- Khmer word segmentation into the system. Due to posed QE techniques improve the retrieval the lack of Khmer text collection which is required performance both of the proposed system in the evaluation process, we have also created our and the Google web search engine. With own Khmer text corpus. The corpus has been built the reduplicative word QE technique, an to be useful and beneficial to all types of the re- improvement of 17.93% of recall can be search in Khmer Language Processing. achieved to the proposed system. 1 Introduction 2 Khmer Specific Characteristics

Similar to the other major languages in the world, Khmer is the official language of Cambodia. It is the number of Khmer digital content has been the second most widely spoken Austroasiatic lan- rapidly growing over the world-wide web, and it is guage family3. Due to the long historical contact becoming very difficult to obtain the relevant in- 2 formation as needed from the Internet. Although Google Khmer: http://www.google.com.kh 3Khmer language: http://en.wikipedia.org/wiki/ 1Apache Lucene: http://lucene.apache.org Khmer_language

80 Proceedings of the 8th Workshop on Asian Language Resources, pages 80–87, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing with India, Khmer has been heavily influenced by 2.4 Reduplicative Words the Sanskrit and Pali. However, Khmer still pos- Reduplicative words are very common in Khmer. sesses its own specific characteristics so far such They are used for several purposes including em- as the word derivation rules and word reduplica- phasis, pluralization and complex thought expres- tion techniques. Furthermore, the specific written sions. Three kinds of duplicating techniques are rule is also found in Khmer, for instance the case found in Khmer: word duplication, synonym du- of multiple spelling words. These characteristics plication and symmetrical duplication (Noeurng are very useful especially in the Khmer IR due to and Haiman, 2000). the lexical-semantic relation between the words. 2.4.1 Word duplication 2.1 Spelling Variants Word duplication is the process of duplicating a In Khmer, multiple spelling words exist. They word to express pluralisation meaning. The dupli- have same meaning and pronunciation but only cation symbol ''ៗ'' is put after the word to indicate different in spelling. Most of the spelling variants the duplication. For example: are loan words, and others are the result of the sub- stitutability between characters. For example: 䮀ធំ [chhkˆe thum] ''a big dog'' → 䮀ធំៗ [chhkˆe thum thum] ''many big dogs'' - The word សមុ [sămŭt] ''sea'' ,which is orig- inated from Sanskrit, can also be spelled 2.4.2 Synonym duplication [sămŭt] ''sea'' (Khmer Dictionary, សមុទ䮑 Also called synonym compounding words. This 1967) which is originated from Pali. kind of words are created by combining two dif- ˇ - And the word ឫទ䮒ិ [r ˇœtthi] ''power'' has an- ferent words which are either synonyms or related ˇ meaning. For example: other spelling រទ䮒ិ [r ˇœtthi] ''power'' because ''ឫ'' can be substituted by ''រ''. បំណង ''goal'' + 䎶 ''intention'' → 2.2 Synonyms បំណង䎶 ''goal and intention'' Synonyms exist in all the natural languages. Thus, there is no exception for Khmer as well. Khmer 2.4.3 Symmetrical duplication has rich and variety of synonym vocabularies. Symmetrical duplication is the process of cre- Most of these synonyms are found in the loan ating a word by combining a word and its simi- words ( influenced by Sanskrit and Pali ) and the lar phonetic syllables. It is similar to create words which sound like ''zigzag'' or ''hip hop'' in English. social status's words. For instance the word ➶ំ ''to eat'' has many synonyms for each social status: This duplication technique is usually used to em- phasize the meaning of the original word. There សុី ​(impolite word), ពិ羶 (polite word), ᮶ន់ (reli- are quite remarkable amount of this kind of words gious word), យ (royal word) and etc. found in Khmer. For example: 2.3 Derivative Words Derivative words in Khmer are the words which ធំ [thum] ''big'' + ង [théng] (similar phonetic are derived from the root words by affixation, in- syllable) →ធំង [thum théng] ''very big'' cluding prefixation, infixation and suffixation (Jen- [krei] (similar phonetic syllable) + [krar] ner, 1969). Interestingly, derivative word's mean- ី ''poor'' → [krei krar] ''very poor'' ing is semantically related to its root word. For ី example: 3 Khmer Text Corpus ាស់ ''old'' + កញ (prefix) →កស់ ''very old'' A large collection of Khmer text is required in the evaluation process. Due to the lack of Khmer lan- ើរ ''to walk'' + មណ (infix) →ដំើរ ''the walk'' guage resource especially the Khmer text corpus, ឯក殶ជ ''independence'' + 徶ព (suffix) → we have built our own Khmer text corpus. The cor- ឯក殶ជ徶ព ''the state of independence'' pus was designed to be useful and beneficial to all

81 Figure 1: System Design of Building a Khmer Text Corpus kinds of the research in Khmer language process- 3.2 Preprocessing Tasks ing. Building such text corpus for Khmer is a chal- The texts collected from the Internet are usually lenging task since there is no implementation yet not clean and unstructured. It may contain un- on Khmer optical character recognition, and a few wanted elements such as images, links and some research in Khmer have been done . All texts have HTML elements. Therefore, cleaning process is to be collected manually from various Khmer web- carried out to remove the unwanted elements and site in the Internet. The corpus includes some basic to restructure the texts before proceeding to the corpus annotations such as word annotation, sen- next step. tence annotation and part-of-speech annotation. It After cleaning, each text is categorized by its do- was encoded in eXtensible Corpus Encoding Stan- main according to its content by the labeling pro- dard (XCES) (Nancy et al., 2000) to assure the fu- cess. There are twelve domains in this corpus: ture extensibility. Figure 1 illustrates the whole newspaper, magazine, medical, technology, cul- process of building a Khmer text corpus in which ture, history, law, agriculture, essay, novel, story four main steps were carried out: text collection, and other. The text descriptions such as author's preprocessing tasks, corpus annotations and cor- name, publisher's name, publishing date and pub- pus encoding. The detail of each step is described lisher's digital address, are kept along with each in the following subsections: text.

3.3 Corpus Annotations 3.1 Text Collection Corpus annotation information is very important Collecting Khmer digital text is the most difficult for the corpus-based research. Therefore, this cor- and time consuming task in the process of build- pus also includes the sentence annotation, word an- ing this corpus. As there is no implementation on notation and POS annotation. Khmer optical character recognition, it is not pos- 3.3.1 Sentence Annotation sible to scan or extract Khmer texts neither from Each sentence is annotated with three kinds of books nor from other paper sources. However, information: position, identification and length. thanks to the websites as well as Khmer blog com- munity that provide the valuable Khmer digital 1. Position: it is defined by the position of the texts for this research. All text were manually col- first character and the last character of a sen- lected from all the available sources. tence in the text.

82 2. Identification: the sequence number of a sen- 3.4 Corpus Encoding tence within a text file. To assure the extensibility of corpus and the fa- 3. Length: the number of characters of a sen- cility of development for the future works, this tence. corpus has been encoded in eXtensible Corpus Encoding Standard (XCES) (Nancy et al., 2000). Like English, each sentence in Khmer can be sep- XCES is an XML-based standard to codify text arated by special symbols. In modern Khmer, corpus. It is highly based on the previous Cor- there exists a series of characters that are used to pus Encoding Standard (Nancy, 1998) but using mark the boundaries of different kind of sentences XML as the markup language. Since the corpus (Khin, 2007). Based on these characters, each sen- encoding is based on XML, the corpus is suitable tence in a text can be separated easily. for many programming languages which support XML. Furthermore, it can fully take the advan- • ។ and ។ល។ : end of declarative sentences. tage of the powerful XML framework including • ? : end of interrogative sentences. XQuery, XPath and so on. In addition, XCES sup- ports many types of corpora especially the annota- • ! : end of exclamative sentences. tion corpora which our corpus is based on. The en- • ៕ end of the last sentence in a text. coding of annotation files and text description files are conformed to XCES schema version 1.0.4. 3.3.2 Word Annotation The position, identification and length of each 3.5 Corpus Statistic word are also annotated as in sentence annotation. Table 1 shows the corpus statistic. We have 1. Position: it is defined by the position of the achieved more than one million words within first character and the last character of a word twelve different domains of text. The corpus size in the text. is relatively small at the moment, the expansion of the corpus size is continuously undergoing. 2. Identification: the sequence number of a word within a text file. Table 1: Corpus Statistic 3. Length: the number of characters of a word. Domain # of Article # of Sentence # of Word Khmer is non-segmented script. There is no sepa- Newspaper 571 13222 409103 rator between words which is very difficult to seg- Magazine 52 1335 42566 ment. In order to do that, we have used the Khmer Medical 3 76 2047 word segmentation tool (Chea et al., 2007) devel- Technical 15 607 16356 4 oped by PAN Localization Cambodia . Culture 33 1178 43640 3.3.3 Part-of-Speech Annotation Law 43 5146 101739 History 9 276 7778 To enhance the usefulness of the corpus, we also Agriculture 29 1484 30813 include the Part-of-Speech annotation. We have Essay 8 304 8318 used a Khmer POS tagger which is based on the Story 108 5642 196256 work of Nou et al. where a transformation-based Novel 78 12012 236250 approach with hybrid unknown word handling for Khmer POS tagger is proposed (Nou et al., 2007). Other 5 134 5522 There are 27 types of Khmer tagset which can be Total 954 41416 1100388 obtained by this Khmer POS tagger. Each obtained POS tag is assigned to each word in the corpus, and 4 Retrieval Environment it is kept along with the word annotation.

4Cambodia PAN Localization: http://www. This section provides the necessary background to pancambodia.info/ understand the context in which the experiment

83 Figure 2: Query Expansion Procedures was carried out. 4.3 Indexing Indexing is very important because it determines 4.1 Query Expansion Procedures what search results are returned when a user sub- Query Expansion is a technique commonly used in mits query. Our proposed system's indexer is IR (Manning et al., 2009) to improve retrieval per- based on the Lucene's indexer with the modifi- formance by reformulating the original query. We cations adapted to Khmer language. Lucene in- have proposed four types of QE technique based dexes a document by indexing the tokens, which on the four types of Khmer specific characteristics are words, obtained by tokenizing text of the doc- that we have presented in the section 2. During ument (Hatcher et al., 2009). Since the default the expansion process, the original search query is Lucene's tokenizer can only work with segmented analyzed and expanded corresponding to the type western languages such as English or French of words (Figure 2). Four types of QE is carried where spaces are used between each word, it is out: spelling variants expansion, synonym expan- impossible to tokenize document in Khmer which sion, derivative word expansion and reduplicative belongs to the class of non-segmenting language word expansion. The expanded query is obtained group, where words are written continuously with- after adding the the expansion term to the original out using any explicit delimiting character. Khmer query. word tokenizer, which is developed by the a re- search group of Cambodia PAN Localization, has 4.2 Khmer IR System Design and been used to handle this task. Implementation A prototype of Khmer IR system, which is shown 5 Experiments in the Figure 3, has been implemented to evalu- ate the efficiency of the proposed QE techniques. The experiment to evaluate the proposed QE tech- There are two main processes in the system im- niques was initially conducted only on the proto- plementation: indexing and searching. We have type of the Khmer IR system that we have imple- started to implement the system by constructing mented. As Google also has a localized Khmer version of its popular web search engine, and it searching index from the text corpus. All texts in 5 the corpus are tokenized into words. Then these can explicitly specify websites to be searched , words are indexed by the Lucene's indexer, and we have extended our experiment to Google web stored in an index database. On the other hand in search engine in order to obtain more precise re- the searching part, the search query is tokenized sult. into words before being analyzed in the QE pro- 5.1 Experiment Setup cess. Finally, the search results are obtained by the Lucene's searcher which searches through the The Khmer text corpus, which consists 954 docu- index database and returns results that correspond ments collected from various websites in the In- to the expanded query. 5http://www.google.com/advanced_search

84 Figure 3: Proposed Khmer Information Retrieval System Design ternet, was used for the experiments. A web- space also can be found in almost all the texts in site, which contains all documents from the cor- the corpus. Based on the zero-width space charac- pus, was hosted in our laboratory's web server in ter, the Google can separate and index the Khmer order that these documents can be indexed by the texts in our text corpus hosted in our laboratory Google's indexer. Then we followed up the index- web server. After Google completely indexed all ing progress by consulting the Google Webmaster the documents, the experiment was proceeded. Tools6 service. In Khmer where words are writ- ten one after the other, word processing programs, 5.2 Experiment Procedures Internet browsers and other programs that format We conducted the experiment for each type of pro- text need to know where they can cut a sentence posed QE technique in our implemented system in order to start a new line. This kind of problem and the Google web search engine. Four similar does not appear in the western languages where experiments of the four proposed QE techniques space are used between words. Thus, the zero- were carried out to the both systems. Due to the width space character was proposed to solve this small size of the text corpus, only ten original problem by inputting the zero-width space at the queries have been chosen for each type of experi- 7 end of each word while typing . In Unicode, the ment. Each query possesses a specific topic in or- zero-width space character is a type of space char- der that we can judge the relevant results after. The acter but it is invisible. Using the zero-width space experiment processes are as following: is very confusing because of the invisibility of the character, plus it is unnatural to Khmer writing sys- 1. Input ten original queries into the both sys- tem. As a result, most people only partly used it tems, and calculate the precisions and recalls. for the text display purpose. Therefore, we can All queries are selected from the different top- find the zero-width space in almost all Khmer texts ics, and each query contains at least an ex- found in the Internet. Since all texts in the cor- pandable word corresponding to its expansion pus are collected from the Internet, the zero-width type.

6http://www.google.com/webmasters/tools 7Khmer Unicode typing: http://www.khmeros.info/ 2. Expand the original queries according to their download/KhmerUnicodeTyping.pdf expansion type. Then input the ten expanded

85 Table 2: Results of Spelling Variants and Synonyms Expansions

Spelling Variants Synonyms Precision Recall F-measure Precision Recall F-measure Google 39.79% 46.72% 37.35% 38.91% 39.99% 34.66% Proposed Sys. 62.09% 47.63% 50.78% 46.74% 47.03% 44.71% Google & QE 46.83% 53.04% 46.13% 44.32% 58.99% 45.28% Proposed Sys. & QE 60.73% 64.01% 58.38% 48.34% 64.59% 51.66%

Table 3: Results of Derivative Word and Reduplicative Word Expansions

Derivative Words Reduplicarive Words Precision Recall F-measure Precision Recall F-measure Google 28.35% 56.04% 36.56% 21.14% 42.61% 26.16% Proposed Sys. 41.07% 51.07% 44.10% 31.69% 39.05% 29.00% Google & QE 29.18% 60.41% 38.32% 24.69% 48.58% 26.35% Proposed Sys. & QE 33.93% 62.38% 41.71% 34.28% 56.98% 36.25%

queries into the both systems and recalculate techniques, while the increase in recall at 6.32%, the precisions and recalls. 19.00%, 4.37%, 5.97% after applying the QE tech- niques respectively to Google web search engine. The relevance judgments were manually done In addition, comparing our proposed system based on the content of each document obtained with QE to Google web search engine without QE, by the both IR systems. Since there is no Khmer the recall improvement is 17.29%, 24.60%, 6.34%, digital thesaurus yet, the expansions were manu- 14.37% respectively, while to Google web search ally done by respecting the query syntax of Lucene engine with QE, the recall improvement is 10.97%, and Google8. Moreover, as Google web search 5.60%, 1.97%, 8.40% respectively. engine cannot tokenize Khmer words, the tok- As a summary, the search results using our pro- enization was also done manually. For example: posed system with QE techniques is significantly the query តុ澶ζរ䮘រហម ''Khmer Rough tri- better than the conventional Google search results. bunal'' consists two words តុ澶ζរ ''tribunal'' and This can be seen clearly from the improvement of 䮘រហម ''Khmer Rough''. We know that the syn- F-Measure 21.03%, 17.00%, 5.15%, 10.09% re- onym of តុ澶ζរ ''tribunal'' is 羶澶ក䮊ី ''tribunal''. spectively. So the expanded query for our proposed system is ''តុ澶ζរ䮘រហម OR 羶澶ក䮊ី䮘រហម''. On 6 Conclusion and Future Works the other hand, the expanded query for Google is In this research, we have investigated four types ''[តុ澶ζរ OR 羶澶ក䮊ី] AND 䮘រហម''. of QE technique based on Khmer linguistic char- acteristics. These QE techniques are specifically 5.3 Results and Discussion designed to improve the retrieval performance of Table 2 and 3 show the results of precision, re- Khmer IR System. The experiments have demon- call and F-Measure before and after implementing strated the improvement of retrieval performance each QE techniques to the proposed system and of the proposed system and the Google web search Google web search engine. The improvement in engine after applying the proposed QE techniques. recall of our proposed system is 16.38%, 17.56%, However, the improvement in precision after uti- 11.31%, 17.93% after applying the respective QE lizing our proposed QE techniques is not so sig- 8http://www.google.com/support/websearch/ nificant. In the case of derivative words, it even bin/answer.py?answer=136861 shows some slight decrement. This is one of the

86 main problems that we will tackle in our future re- Long, Siem. 1999. បចនស័ព䮑វទ䮘រ ​''Khmer search to reduce non-relevant contents by seman- Lexicology Problem''. National Language Insti- tically analyzing Khmer content. At the moment tute, Royal University of Phnom Penh, Phnom Penh, Cambodia. the size of the corpus is very small, and we are ac- tively dealing with this issue in hope to provide a Manning, Christopher D., Prabhakar Raghavan and good Khmer language resource for the future re- Hinrich Schütze 2009. An Introduction to Informa- tion Retrieval. Cambridge University Press, Cam- search in Khmer language processing. bridge, England. In addition, due to the lack of research in Khmer IR as well as in Khmer language processing, a lot Nou, Chenda and Wataru Kameyama 2007. Khmer of aspects can still be worked on in order to im- POS Tagger: A Transformation-based Approach with Hybrid Unknown Word Handling. In Proceed- prove the system performance. For example, the ing of International Conference on Semantic Com- improvement of Khmer word segmentation and puting (ICSC), pp. 482--492. Irvine, USA. the building of Khmer thesaurus for IR system, Ourn, Noeurng and John Haiman. 2000. Symmetrical which are expected to improve the IR system per- Compounds in Khmer. Studies in Language. 24(3), formance, are also in the priority tasks of our future pp. 483--514. works. Singhal, Amit. 2001. Modern Information Retrieval: A Brief Overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, References 24(4):35--43. Beaulieu, Micheline and Susan Jones. 1998. Inter- active searching and interface issues in the Okapi best match probabilistic retrieval system. Interact- ing With Computers, 10(3):237--248.

Buddhist Institute. 1967. វច侶នុម䮘រ ​''Khmer Dic- tionary''. Buddhist Institute, Phnom Penh, Cambo- dia. Chea, Sok-Huor, Rithy Top, Pich-Hemy Ros, Navy Vann, Chanthirith Chin and Tola Chhoeun. 2007. Word Bigram Vs Orthographic Syl- lable Bigram in Khmer Word Segmentation. http://www.panl10n.net/english/OutputsCambodia1 .htm, (Last retrieved 30 April 2010). Hatcher, Erik, Otis Gospodneti´c and Michael McCand- less. 2009. Lucene in Action, Second Edition. Man- ning Publications, Connecticut, USA. Ide, Nancy 1998. Corpus Encoding Standard: SGML Guidelines for Encoding Linguistic Corpora. In Pro- ceedings of the First International Language Re- sources and Evaluation Conference, pp. 463-70. Ide, Nancy, Patrice Bonhomme and Laurent Romary 2000. XCES: An XML-based standard for linguis- tic corpora. In Proceeding of Second Language Resources and Evaluation Conference (LREC), pp. 825--830, Athens. Greece. Jenner, Philip Norman. 1969. Affixation in modern Khmer. A dissertation submitted to the graduate di- vision, University of Hawaii.

Khin, Sok. 2007. យករណ៍徶羶䮘រ ​''Khmer Gram- mar''. Royal Academy of Cambodia, Phnom Penh, Cambodia.

87 Word Segmentation for Urdu OCR System

Misbah Akram Sarmad Hussain National University of Computer Center for Language Engineering, and Emerging Sciences AlKhawarizmi Institute of Computer [email protected] Science, University of Engineering and Technology, Lahore, Pakistan [email protected]

nary task for Urdu language processing. It has Abstract applications in many areas like spell checking, POS tagging, speech synthesis, information re This paper presents a technique for word trieval etc. This paper focuses on the word seg segmentation for the Urdu OCR system. mentation problem from the point of view of Word segmentation or word tokeniza Optical Character Recognition (OCR) System. tion is a preliminary task for Urdu lan As space is not visible in typed and scanned text, guage processing. Several techniques spacing cues are not available to the OCR for are available for word segmentation in word separation and therefore segmentation has other languages. A methodology is pro to be done more explicitly. This word segmenta posed for word segmentation in this pa tion model for Urdu OCR system takes input in per which determines the boundaries of the form of a sequence of ligatures recognized words given a sequence of ligatures, by an OCR to construct a sequence of words based on collocation of ligatures and from them. words in the corpus. Using this tech nique, word identification rate of 2 Literature Review 96.10% is achieved, using trigram prob abilities normalized over the number of Many languages, e.g., English, French, ligatures and words in the sequence. Hindi, Nepali, Sinhala, Bengali, Greek, Russian, etc. segment text into a sequence of words using 1 Introduction delimiters such as space, comma and semi colon etc., but on the other hand many Asian languag Urdu uses Nastalique style of Arabic script es like Urdu, Persian, Arabic, Chinese, for writing, which is cursive in nature. Charac Dzongkha, Lao and Thai have no explicit word ters join together to form ligatures, which end boundaries. In such languages, words are seg either with a space or with a nonjoining charac mented using more advanced techniques, which ter. A word may be composed of one of more can be categorized into three methods: ligatures. In Urdu, space is not used to separate two consecutive words in a sentence; instead (i) Dictionary/lexicon based approaches readers themselves identify the boundaries of (ii) Linguistic knowledge based approaches words, as the sequence of ligatures, as they read (iii) Machine learning based approach along the text. Space is used to get appropriate es/statistical approaches character shapes and thus it may even be used (Haruechaiyasak et al., 2008) within a word to break the word into constituent ligatures (Naseem 2007, Durrani 2008). There Longest matching (Poowarawan, 1986; Richard fore, like other languages (Theeramunkong & Sproat, 1996) and maximum matching (Sproat Usanavasin, 2001; Wan and Liu, 2007; Khanka et al., 1996; Haizhou & Baosheng, 1998) are sikam & Muansuwan, 2005; Haruechaiyasak et examples of lexicon based approaches. These al., 2008; Haizhou & Baosheng, 1998), word techniques segment text using the lexicon. Their segmentation or word tokenization is a prelimi

88 Proceedings of the 8th Workshop on Asian Language Resources, pages 88–94, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing accuracy depends on the quality and size of the word statistics are derived from these corpora. dictionary. In the decoding phase, first all sequences of NGrams (Chang et al., 1992; Li Haizhou words are generated from input set of ligatures et al., 1997; Richard Sproat, 1996; Dai & Lee, and ranking of these sequences is done based on 1994; Aroonmanakun, 2002) and Maximum lexical lookup. Top k sequences are selected for collocation (Aroonmanakun, 2002) are Linguis further processing, based on the number of valid tic knowledge based approaches, which also words. Finally, the probability of each of the k rely very much on the lexicon. These approach sequences is calculated for the final decision. es select most likely segmentation from the set Details are described in the subsequent sections. of possible segmentations using a probabilistic or costbased scoring mechanism. 3.1 Data collection and preparation Word segmentation using decision trees An existing lexicon of 49630 unique words (Sornlertlamvanich et al., 2000; Theeramun is used (derived from Ijaz et al. 2007). The cor kong & Usanavasin, 2001) and similar other pus used for building ligature grams consists of techniques fall in the third category of word half a million words. Of these, 300,000 words segmentation techniques. These approaches use are taken from the Sports, Consumer Informa a corpus in which word boundaries are explicit tion and Culture/Entertainment domains of the ly marked. These approaches do not require dic 18 million word corpus (Ijaz et al. 2007), tionaries. In these approaches ambiguity prob 100,000 words are obtained from UrduNepali lems are handled by providing a sufficiently English Parallel Corpus (available at large set of training examples to enable accurate www.PANL10n.net ), and another 100,000 classification. words are taken from a previously POS tagged A knowledge based approach has been corpus (Sajjad, 2007; tags of this corpus are re adopted for earlier work on Urdu word segmen moved before further processing). This corpus tation (Durrani 2007; also see Durrani and Hus is manually cleaned for word segmentation er sain 2010). In this technique word segmentation rors, by adding missing spaces between words of Urdu text is achieved by employing know and replacing spaces with Zero Width Non ledge based on the Urdu linguistics and script. Joiner (ZWNJ) within words. For the computa The initial segmentations are ranked using min tion of word grams, the 18 million word corpus word, unigram and bigram techniques. It reports of Urdu is used (Ijaz et al. 2007). 95.8 % overall accuracy for word segmentation of Urdu text. Mukund et al. (2009) propose us 3.2 Count and probability calculations ing character model along with linguistic rules Table 1 and Table 2 below give the counts and report 83% precision. Lehal (2009) propos for unigram, bigrams and trigram of the liga es a two stage process, which first uses Urdu tures and the words derived from the corpora linguistic knowledge, and then uses statistical respectively. information of Urdu and Hindi (also using transliteration into Hindi) in the second stage Ligature Ligature Ligature Ligature for words not addressed in the first stage, re Tokens Unigram Bigrams Trigrams porting an accuracy of 98.57%. 1508078 10215 35202 65962 These techniques use characters or words in Table 1. Unigram, bigram and trigram counts of the input, whereas an OCR outputs a series of the ligature corpus ligatures. The current paper presents work done using statistical methods as an alternative, Word Word Word Word which works with ligatures as input. Tokens Unigrams Bigrams Trigrams 17352476 157379 1120524 8143982 3 Methodology Table 2. Unigram, bigram and trigram counts of Current work uses the cooccurrence in the word corpus formation of ligatures and words to construct a statistical model, based on manually cleaned After deriving word unigrams, bigrams, and segmented training corpora. Ligature and and trigrams, the following cleaning of corpus is

89 performed. In the 18 million word corpus, cer 3.3 Word sequences generation from input tain words are combined due to missing space, The input, in the form of sequence of liga but are separate words. Some of these words tures is used to generate all possible words. occur with very high frequency in the corpus. These sequences are then ranked based on real For example “ ” (ho ga , “will be”) exists as words. For this purpose, a tree of these se single word rather than two words due to miss quences is incrementally built. The first ligature ing space. To solve this space insertion problem, is added as a root of tree, and at each level two a list of about 700 words with frequency greater to three additional nodes are added. For exam than 50 is obtained from the word unigrams. ple the second level of the tree contains the fol Each word of the list is manually reviewed and lowing tree nodes. space is inserted, where required. Then these • Current ligature forms a separate word, se error words are removed from the word unigram parated with space, from the sequence at its and added to the word unigram frequency list as parent, l l two or three individual words incrementing re 1 2 spective counts. • Current ligature concatenates, without a For the space insertion problem in word space, with the sequence at its parent, l 1l2 bigrams, each error word in joinedword list • Current ligature concatenates, without a (700word list) is checked. Where these error space, with the sequence at its parent but words occurs in a bigram word frequency list, with an additional, l 1ZWNJl 2 for example “ ” (kiya ho ga “will have done”) exists in the bigram list and contains For each node, at each level of the tree, a nu " " error word, then this bigram entry “ ” meric value is assigned, which is the sum of is removed from the bigram list and counts of squares of the number of ligatures in each word which is in the dictionary. If a word does not “ ” and “ ” are increased by the count of exist in dictionary then it does not contribute to “ ”. If these words do not exist in the word the total sum. If a nodestring has only one word bigram list then they are added as a new bi and this word does not occur in the dictionary as grams with the count of “ ”. Same proce a valid word then it is checked that this word dure is performed for the word trigrams. may occur at the start of any dictionary entry. In The second main issue is with wordaffixes, this case numeric value is also assigned. which are sometimes separated by spaces from the words. Therefore, in calculations, these are After assignment, nodes are ranked ac treated as separate words and exist as bigram cording to these values and best k (beam value) nodes are selected. These selected nodes are entries in the list rather than a unigram entry. further ranked using statistical methods dis For example " " (sehat+mand , “healthy”) cussed below. exists as a bigram entry but in Urdu it is a single word. To cope with this problem, a list of 3.4 Best word segmentation selection wordaffixes is used. If any entry of word bi For selection of the most probable word gram matches with an affix, then this word is segmentation sequence word and ligature mod combined by removing spurious space from it els are used. For word probabilities the follow (and inserting ZWNJ, if required to maintain its ing is used. glyph shape). Then this word is inserted in the unigram list with its original bigram count and PW = argmax ∈ SPw unigram list updated accordingly. Same proce dure is performed if a trigram word matches To reduce the complexity of computing, Mar with an affix. kov assumption are taken to give bigram and After cleaning, unigram, bigram and tri trigram approximations (e.g., see Jurafsky & gram counts for both words and ligatures are Martin 2006) as given below. calculated. To avoid data sparseness One Count Smoothing (Chen & Goodman, 1996) is applied. PW = argmax ∈ S ∏ P w|w PW = argmax ∈ S∏ Pw|ww Similarly the ligature models are built by taking the assumption that sentences are made

90 up of sequences of ligatures rather than words and space is also a valid ligature. By taking the This gives the maximum probable word se- Markov bigram and trigram assumption for liga quence among all the alternative word se- ture grams we get the following. quences. The precision of the equation can be taken at bigram or trigram level for both PL = argmax ∈ S∏ P l|l ligature and word, giving the following pos- sibilities. Additionally, normalization is also PL = argmax ∈ S∏ P l|ll done to better compare different sequences,

as each sequences has different number of Given the ligatures, e.g. as input from and words and ligatures per word. OCR, we can formulate the decoding problem • Ligature trigram and word bigram based as the following equation. technique PW|L = argmax ∈ SPw | l PW = argmax ∈ S∏ P l|ll ∗ where w = w,w,w,w,…w and ∏ Pw|w • Ligature bigram and word trigram based l = l,l,l,l,…l ; n represents number of technique words and m represents the number of liga- tures. This equation also represents that m PW = argmax ∈ S∏ P l|l ∗ number of ligatures can be assigned to n ∏ Pw|ww number of words. By applying the Bayesian • Ligature trigram and word trigram based theorem we get the following derivation. technique P .P PW|L = argmax ∈ S P PW = argmax ∈ S∏ P l|ll ∗ As P l is same for all w , so the denomi- ∏ Pw|ww nator does not change the equation, simplify- • Normalized ligature bigram and word bi ing to the following expression. gram based technique PW|L = argmax ∈ SPl |w . Pw NL PW = argmax ∈ S∏ P l|l ∗ where NW ∏ Pw|w Pl |w =P l,l,l, … l|w • Normalized ligature trigram and word bi =Pl|w ∗ P l|w l ∗ Pl|w ll ∗ gram based technique Pl|w lll ∗ … Pl|w lll … l NL Assuming that a ligature l depends only on PW = argmax ∈ S ∏ P l|ll ∗ NW the word sequence w and its previous liga- ∏ Pw|w ture l, and not the ligature history, the • Normalized ligature bigram and word tri above equation can be simplifed as follows. gram based technique Pl |w = P l|w ∗ P l|w l ∗ Pl|w l ∗ Pl|w l ∗ … Pl|w l NL PW = argmax ∈ S∏ P l|l ∗ NW = ∏ P l|w l ∏ Pw|ww • Normalized ligature trigram and word tri Further, if it is assumed that l depends on gram based technique the word in which it appears, not whole word sequence, the equation can be further simpli- NL PW = argmax ∈ S∏ P l|ll ∗ fied to the following as probability of l with- NW in a word is 1. ∏ Pw|ww In the current work, all the above tech Pl |w = ∏ P l|l niques are used and the best sequence from each Thus, considering bigrams, PW|L = one is shortlisted. Then the word sequence which occurs the most times in this shortlist is finally selected. argmax ∈ S P l|l Pw|w

91 NL represents the number of ligature bi The normalized data gives much better pre grams or trigrams and NW represents the num diction compared to the unnormalized data. ber of word bigram or trigrams that exist in the Sentence identification errors depend heavi given sentence. ly on the unknown words. For example, at the 4 Results and Discussion beam value of 30 we predict 38 incorrect sen tences, of which 25 sentence level errors are due The model is tested on a corpus of 150 sen to unknownwords and 13 errors are due to tences composed of 2156 words and 6075 liga known word identification errors. Thus improv tures. In these sentences, 62 words are unknown, ing system vocabulary will have significant im i.e. the words that do not exist in our dictionary. pact on accuracy. The average length of the sentence is 14 words Many of the word errors are caused due to and 40.5 ligatures. The average length of word insufficient cleaning of word the larger corpus. is 2.81 ligatures. All the techniques are tested Though the words with frequency greater than with a beam value, k, of 10, 20, 30, 40, and 50. 50 from the 18 million word corpus have been The results can be viewed from two pers cleaned, the lower frequency words cause these pectives: sentence identification rate, and word errors. For example word list still contains bunyad per , “depends on”), " " (se)" د" identification rate. A sentence is considered incorrect even if one word of the sentence is taqseem , “divided by”) with frequency of 40 identified wrongly. The technique gives the and 5 respectively, and each should be two sentence identification rate of 76% at the beam words with a space between them. If low fre value of 30. At word level, Normalized Liga quency words are also cleaned results will fur ture Trigram Word Trigram Technique outper ther improve, though it would take a lot of ma forms other techniques and gives a 96.10% nual effort. word identification rate at the beam value of 50. Total Words Beam Total Sentences %age Total known Total unknown Identified %age %age %age Value identified words identified words identified

10 110/150 73.33% 2060/2156 95.55% 2024/2092 96.75% 36/64 56.25% 20 112/150 74.67% 2066/2156 95.83% 2027/2092 96.89% 39/64 60.94% 30 114/150 76% 2062/2156 95.64% 2019/2083 96.93% 43/73 58.90% 40 105/150 70% 2037/2156 94.48% 2000/2092 95.60% 37/64 57.81% 50 106/150 70.67% 2040/2156 94.62% 2000/2092 95.60% 40/64 62.50% Table 3. Results changing beam width k of the tree

Technique Tot al s entences %age Total words %age Total known %age Total unknown %age identified identified words Identified words identified

Ligature Bigram 50/150 33.33% 1835/2156 85.11% 1806/2092 86.33% 29/64 45.31% Ligature Bigram Word 68/150 45.33% 1900/2156 88.13% 1865/2092 89.15% 35/64 54.69% Bigram Ligature Bigram Word 83/150 55.33% 1960/2156 90.91% 1924/2092 91.97% 36/64 56.25% Trigram Ligature Trigram 16/150 10.67% 1637/2156 75.93% 1610/2092 76.96% 27/64 42.19% Ligature Trigram Word 42/150 28% 1776/2156 82.38% 1746/2092 83.46% 30/64 46.88% Bigram Ligature Trigram Word 62/150 41.33% 1868/2156 86.64% 1835/2092 87.72% 33/64 51.56% Trigram Normalized Ligature 90/150 60% 2067/2156 95.87% 2024/2092 96.75% 43/64 67.19% Bigram Word Bigram Normalized Ligature 100/150 66.67% 2070/2156 96.01% 2028/2092 96.94% 42/64 65.63% Bigram Word Trigram Normalized Ligature 93/150 62% 2071/2156 96.06% 2030/2092 97.04% 41/64 64.06% Trigram Word Bigram Normalized Ligature 101/150 67.33% 2072/2156 96.10% 2030/2092 97.04% 42/64 65.63% Trigram Word Trigram Word Bigram 47/150 31.33% 1827/2156 84.74% 1796/2092 85.85% 31/64 48.44% Word Trigram 74/150 49.33% 1937/2156 89.84% 1903/2092 90.97% 34/64 53.13%

92 Table 4. Results for all techniques for the beam value of 50

Errors are also caused if an alternate liga tion methods for estimating probabilities of Eng ture sequence exists. For example the proper lish bigrams . Computer Speech and Language , 5, .kartak) is not identifiec as it does 1954) " ر" noun not exist in dictionary, but the alternate two Dai, J.C., & Lee, H.J. (1994). Paring with Tag In kar tak , “till the car”) formation in a probabilistic generalized LR pars) " ر " word sequence is valid. er . International Conference on Chinese Compu This work uses the knowledge of ligature ting. Singapore. grams and word grams. It can be further en Durrani, N. (2007). Typology of word and automatic hanced by using the character grams. We have word Segmentation in Urdu text corpus . Thesis, tried to clean the corpus. Further cleaning and National University of Computer & Emerging additional corpus will improve the results as Sciences, Lahore , Pakistan. well. Improvement can also be achieved by Durrani, N., Hussain, S. (2010). Urdu Word Segmen handling abbreviations and English words trans tation , In the 11th Annual Conference of the literated in the text. The unknown word detec North American Chapter of the Association for tion rate can be increased by applying POS tag Computational Linguistics (NAACL HLT 2010), ging to further help rank the multiple possible Los Angeles, US. sentences. Haizhou, L., & Baosheng, Y. (1998). Chinese Word 5 Conclusions Segmentation . In Proceedings of the 12th Pacific Asia Conference on Language, Information and This work presents an initial effort on sta Computation, (pp. 212217). tistical solution of word segmentation, especial Haruechaiyasak, C., Kongyoung, S., & Dailey, M. N. ly for Urdu OCR systems. This work develops a (2008). A Comparative Study on Thai Word Seg cleaned corpus of half a million Urdu words for mentation Approaches . statistical training of ligature based data, which is now available for the research community. In Hussain, S. (2008). Resources for Urdu Language addition, the work develops a statistical model Processing . In Proceedings of the Sixth Workshop for word segmentation using ligature and word on Asian Language Resources. statistics. Using ligature statistics improves Ijaz, M., Hussain, S. (2007). Corpus Based Urdu upon using just the word statistics. Further Lexicon Development, in the Proceedings of Con normalization has significant impact on accura ference on Language Technology (CLT07), Uni cy. versity of Peshawar, Pakistan. References Jurafsky, D., & Martin., l. J(2000). Speech and Lan Aroonmanakun, W. (2002). Collocation and Thai guage Processing: An Introduction to atural Word Segmentation. In Proceedings of the Fifth Language Processing (1st ed.) . Computational Symposium on Natural Language Processing & Linguistics and Speech Recognition Prentice Hall. The Fifth Oriental COCOSDA Workshop, (pp. Khankasikam, K., & Muansuwan, N. (n.d.). Thai 6875). Pathumthani. Word Segmentation a Lexical Semantic Ap Chang, JyunShen, Chen, S.D., Zhen, Y., Liu, X.Z., proach. & Ke, S.J. (1992). Largecorpusbased methods Khankasikam, K., & Muansuwan, N. (2005). Thai for Chinese personal name recognition. Journal of Word Segmentation a Lexical Semantic Approach . Chinese Information Processing , 6 (3), 715. In the Proceedings of the Tenth Machine Transla Chen, F., & Goodman, T. (1996). An Empirical tion Summit, (pp. 331338). Thailand. Study of Smoothing Techniques for Language Lehal, G. (2009). A Two Stage Word Segmentation Modeling. In the Proceedings of the 34th Annual System for Handling Space Insertion Problem in Meeting of the Association for Computational Urdu Script . World Academy of Science, Engi Linguistics, (pp. 310318). neering and Technology 60. Church, K. W., & Gale, W. A. (1991). A comparison of the enhanced GoodTuring and deleted estima

93 MacKay, D. J., & Peto, L. C. (1995). A Hierarchical Word Binding Force . In Proceedings of the 16th Dirichlet Language Mode . Natural Language En conference on Computational linguistics. gineering , 1 (3), 119.

Mukund, S. & Srihari, R. (2009). E Tagging for Urdu based on Bootstrap POS Learning . In the Proceedings of CLIAWS3, Third International Cross Lingual Information Access Workshop, pages 61–69, Boulder, Colorado, USA.

Naseem, T., & Hussain, S. (2007). Spelling Error Trends in Urdu, In the Proceedings of Conference on Language Technology (CLT07), University of Peshawar, Pakistan. Pascale, F., & Dekai, W. (1994). Statistical augmen tation of a Chinese machine readable dictionary. Proceedings of the Second Annual Workshop on Very Large Corpora, (pp. 6985). Poowarawan, Y. (1986). Dictionarybased Thai Syl lable Separation . Proceedings of the Ninth Elec tronics Engineering Conference. Richard Sproat, C. S. (1996). A Stochastic Finite State WordSegmentation Algorithm for Chinese . Computational Linguistics , 22 (3). Sajjad, H. (2007). Statistical PartofSpeech for Ur du . MS thesis, National University of Computer and Emerging Sciencies, Centre for Research in Urdu Language Processing, Lahore , Pakistan. Sornlertlamvanich, V., Potipiti, T., & charoenporn, T. (2000). Automatic CorpusBased Thai Word Algorithm Extraction with the C4.5 Learning . Proceedings of the 18th conference on Computa tional linguistics. Sproat, R., Shih, C., Gale, W., & Chang, N. (1996). A Stochastic FiniteState WordSegmentation Al gorithm for Chinese . Computational Linguistics , 22 (3). Theeramunkong, T., & Usanavasin, S. (2001). on DictionaryBased Thai Word Segmentation Using Decision Trees . Proceedings of the first interna tional conference on Human language technology research. UrduNepaliEnglish Parallel Corpus. (n.d.). Re trieved from Center for Research in Urdu Lan guage Processing: http://www.crulp.org/software /ling_resources/urdunepalienglishparallelcorpus. htm Wang, X.J., Liu, W., & Qin, Y. (2007). A Search based Chinese Word Segmentation Method. 16th International World Wide Web Conference. Wong, P.k., & Chan, C. (1996). Chinese Word Segmentation based on Maximum Matching and

94 Dzongkha Word Segmentation

Sithar Norbu, Pema Choejey, Tenzin Dendup Sarmad Hussain, Ahmed Mauz Research Division Center for Research in Urdu Language Processing Department of Information Technology & National University of Computer & Emerging Telecom Sciences {snorbu, pchoejay, tdendup}@dit.gov.bt {sarmad.hussain, ahmed.mauz}@nu.edu.pk

Abstract essential in the analysis of natural language processing. This is because word is both Dzongkha, the national language of syntactically and semantically, the Bhutan, is continuous in written form fundamental unit for analyzing language and it fails to mark the word boundary. structure. Like in any other language Dzongkha word segmentation is one of processing task, Dzongkha word segmentation the fundamental problems and a is also viewed as one of the fundamental and prerequisite that needs to be solved foremost steps in Dzongkha related language before more advanced Dzongkha text processing tasks. processing and other natural language The most challenging features of Dzongkha processing tools can be developed. script is the lack of word boundary separation This paper presents our initial attempt between the words1. So, in order to do the at segmenting Dzongkha sentences into further linguistic and natural language words. The paper describes the processing tasks, the scripts should be implementation of Maximal Matching transformed into a chain of words. Therefore, (Dictionary based Approach) followed segmenting a word is an essential role in by bigram techniques (Non-dictionary Natural Language Processing. Like Chinese, based Approach) in segmenting the Japanese and Korean (CJK) languages, Dzongkha scripts. Although the used Dzongkha script being written continuously techniques are basic and naive, it without any word delimiter causes a major provides a baseline of the Dzongkha problem in natural language processing tasks. word segmentation task. Preliminary But, in case of CJK, Thai, and Vietnamese experimental results show percentage languages, many solutions have been of segmentation accuracy. However, published before. For Dzongkha, this is the the segmentation accuracy is dependent first ever word segmentation solution to be on the type of document domain and documented. size and quality of the lexicon and the In this paper, we describe the Dzongkha corpus. Some of the related issues for word segmentation, which is performed firstly future directions are also discussed. using the Dictionary based approach where the principle of maximal matching algorithm is Keywords: Dzongkha script, word applied to the input text. Here, given the segmentation, maximal matching, bigram collection of lexicon, the maximal matching technique, smoothing technique. algorithm selects the segmentation that yields the minimum number of words token from all possible segmentations of the input sentence. Then, it uses non-dictionary based approach 1 Introduction where bigram technique is applied. The Segmentation of a sentence into word is one of probabilistic model of a word sequence is the necessary preprocessing tasks and is 1 http://www.learntibetan.net/grammar/sentence.htm

95 Proceedings of the 8th Workshop on Asian Language Resources, pages 95–102, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing studied using the Maximum Likelihood whitespace in the Dzongkha sentence serves as Estimation (MLE). The approach using the a phrase boundary or comma, and is a faithful MLE has an obvious disadvantage because of representation of speech: after all in speech, the unavoidably limited size of the training we pause not between words, but either after corpora (Nuges, 2006). To this problem of data certain phrases or at the end of sentence. sparseness, the idea of Katz back-off model The sample dzongkha sentence reads as with Good-Turing smoothing technique is follows: applied. རོང་ཁ་གོང་འཕེལ་ལན་ཚོགས་འདི་ འབག་རལ་ཁབ་ནང་ གཞང་གི་ཁ་ ཐག་ལས་ འབག་གི་རལ་ཡོངས་སད་ཡིག་ རོང་ཁའི་སིད་བས་རམ་མི་ 2 Dzongkha Script དང་ རོང་ཁའི་མཐར་ཐག་གི་དབང་འཛིན་པ་ རང་དབང་རང་སོང་གི་ Dzongkha language is the official and national འདས་ཚོགས་ མཐོ་ཤོས་ཅིག་ཨིན། འདས་ཚོགས་འདི་ འབག་རལ་ language of Bhutan. It is spoken as the first language by approximately 130,000 people བཞི་པ་མི་དབང་མངའ་བདག་རིན་པོ་ཆེ་ དཔལ་འཇིགས་མེད་སེང་གེ་ and as the second language by about 470,000 people (Van Driem and Tshering, 1998). དབང་ཕག་མཆོག་གི་ཐགས་དགོངས་དང་འཁིལ་ཏེ་ སི་ལོ་ ༡༩༨༦ ལ་ Dzongkha is very much related to Sino- གཞི་བཙགས་གནང་གནངམ་ཨིན། Tibetan language which is a member of (English Translation of example text) Tibeto-Burmese language family. It is an [The Dzongkha Development Commission is alphabetic language, with phonetic the leading institute in the country for the characteristics that mirror those of Sanskrit. advancement of Dzongkha, the national Like many of the alphabets of India and South language of Bhutan. It is an independent East Asia, the Bhutanese script called organization established by the Fourth King of Dzongkha script is also a syllabic2. A syllable Bhutan, His Majesty the King Jigme Singye can contain as little as one character or as Wangchuck, in 1986.] many as six characters. And a word can be of one syllable, two syllable or multiple syllables. In the written form, Dzongkha script contains a 3 Materials and Methods dot, called Tsheg ( ་ ) that serve as syllable and phrase delimiter, but words are not separated at Since, our language has no word boundary all. delimiter, the major resource for Dzongkha word segmentation is a collection of lexicon For example, (dictionary). For such languages, dictionaries are needed to segment the running texts. Dzongkha Transliteration English Syllables Therefore, the coverage of a dictionary plays a dmarpo red Single- དམརཔོ significant role in the accuracy of word syllabled segmentation (Pong and Robert, 1994). slop-pon Teacher Two- The dictionary that we used contains 23,333 སོབ་དཔོན syllabled word lists/lexicons. The lexicons were nd hjam-tog-to easy Three- collected from “Dzongkha Dictionary”, 2 འཇམ་ཏོག་ཏོ syllabled Edition, Published by Dzongkha Development Authority, Ministry of Education, 2005, har-ri-hur-ri crowdedness Four- འར་རི་འར་རི ([email protected]). The manually segmented /confusion syllabled text corpus containing 41,739 tokens are also used for the method. The text corpora were Table 1: Different syllabled Dzongkha scripts. collected from different sources like newspaper articles, dictionaries, printed books, The sentence is terminated with a vertical etc. and belong to domains such as World stroke called Shad ( ། ). This Shad acts as a Affairs, Social Sciences, Arts, Literatures, full_stop. The frequent appearance of Adventures, Culture and History. Some texts like poetry and songs were added manually. 2 http://omniglot.com/writing/tibetan.htm

96 Table below gives the glimpse of textual (Huor et al., 2007) generated by the maximal domains contained in the text corpora used for matching algorithm. These mechanisms are the method (Chungku et al., 2010). described in the following

Domain Sub domain (%) 3.1 Maximal Matching Algorithm World Affairs Bilateral relations 12% The basic idea of Maximal matching algorithm Social Science Political Science 2% is, it first generates all possible segmentations Arts Poetry/Songs/Ballad 9% for an input sentence and then selects the segmentation that contains the minimum Literatures Essays/Letters/Dictionary 72% number of word tokens. It uses dictionary Adventures Travel Adventures 1% lookup. Culture Culture Heritage/Tradition 2% We used the following steps to segment the given input sentence. History Myths/Architecture 2% 1. Read the input of string text. If an input line contains more than one Table 2: Textual domain contained in Corpus sentence, a sentence separator is applied to break the line into Figure 1 below shows the Dzongkha Word individual sentences. Segmentation Process. 2. Split input string of text by Tsheg( ་ ) into syllables. 3. Taking the next syllables, generate all possible strings 4. If the number of string is greater than n for some value n3  Look up the series of string in the dictionary to find matches, and assign some weight-age4 accordingly.  Sort the string on the given weight-age  Delete (number of strings – n) low count strings. 5. Repeat from Step 2 until all syllables are processed.

The above mentioned steps produced all Figure 1: Dzongkha Word Segmentation possible segmented words from the given input Process. sentence based on the provided lexicon. Thus, the overall accuracy and performance depends Dzongkha word segmentation implements a on the coverage of lexicon (Pong and Robert, principle of maximal matching algorithm 1994). followed by statistical (bigram) method. It uses a word list/lexicon at first to segment the raw input sentences. It then uses MLE principles to estimate the bigram probabilities for each segmented words. All possible segmentation 3The greater the value of n, the better the chances of of an input sentence by Maximal Matching are selecting the sentence with the fewest words from then re-ranked and picked the mostly likely the possible segmentation. segmentation from the set of possible 4If the possible string is found in the dictionary segmentations using a statistical approach entries, the number of syllable in the string is (bigram technique). This is to decide the best counted. Then, the weight-age for the string is possible segmentation among all the words calculated as (number of syllable)2 else it carries the weight-age 0

97 3.2 Bigram Method probability distribution implied by a language model so that all reasonable word sequences (a) Maximum Likelihood Estimation5 can occur with some probability. This often In the bigram method, we make the involves adjusting zero probabilities upward approximation that the probability of a word and high probabilities downwards. This way, depends on identifying the immediately smoothing technique not only helps prevent preceding word. That is, we calculate the zero probabilities but the overall accuracy of probability of next word given the previous the model are also significantly improved word, as follows: n n (Chen and Goodman, 1998). P w1  =Π i=1 P  wi/w i−1 In Dzongkha word segmentation, Katz back- where off model based on Good-Turing smoothing principle is applied to handle the issue of data count  wi−1 w i  P  wi /wi−1= sparseness. The basic idea of Katz back-off count wi−1  model is to use the frequency of n-grams and if where no n-grams are available, to back off to (n-1)

 count  wi−1 wi  is a total occurrence grams, and then to (n-2) grams and so on (Chen and Goodman, 1998). of a word sequence w i−1 wi in the corpus, and The summarized procedure of Katz  count  wi−1 is a total occurrence of a smoothing technique is given by the following 6 word w i−1 in the corpus. algorithm:

To make P  wi /wi−1 meaningful for i=1 , C  wi−1 /wi ifr>k we use the distinguished token at the Pkatz  wi∣wi−1 = dr C  wi−1 /wi ifk≥r>0 beginning of the sentence; that is, we pretend { α wi−1  P  wi ifr=0 } w0 = . In addition, to make the sum of the probabilities of all strings equal 1, it is necessary to place a distinguished token where at the end of the sentence.  r is the frequency of bigram counts One of the key problems with the MLE is  k is taken for some value in the insufficient data. That is, because of the range of 5 to 10, other counts are unavoidably limited size of the training corpus, not re-estimated. vast majority of the word are uncommon and  r nK+1 some of the bigrams may not occur at all in the − k+1  corpus, leading to zero probabilities. r n1  d = Therefore, following smoothing techniques r  k+1  nk+1 were used to count the probabilities of unseen 1− n bigram. 1 

(b) Smoothing Bigram Probabilistic 1− ∑ PKatz wi∣w i−1 w :r>0 The above problem of data sparseness α w = i underestimates the probability of some of the  i−1 1− ∑ PKatz w i sentences that are in the test set. The wi :r> 0 smoothing technique helps to prevent errors by making the probabilities more uniform. With the above equations, bigrams with non- Smoothing is the process of flattening a zero count r are discounted according to the 5P.M, Nugues. An Introduction to Language Processing with Perl and Prolog: An Outline of 6X. Huang, A. Acero, H.-W.Hon, Spoken Language Theories, Implementation, and Application with Processing: A Guide to Theory, Algorithm and Special Consideration of English, French, and System Development, (Prentice-Hall Inc., New German (Cognitive Technologies) (95 – 104). Jersey 07458, 2001), 559 - 561.

98 r manually segmented corpus containing 41,739 discount ratio d = i.e., the count r r tokens are used for the method. subtracted from the non-zero count are In the sample comparison below, the symbol redistributed among the zero count bigrams ( ་ ) does not make the segmentation unit's according to the next lower-order distribution, mark, but ( ། ) takes the segmentation unit's the unigram model. mark, despite its actual mark for comma or full_stop. The whitespace in the sentence are 4 Evaluations and Results phrase boundary or comma, and is a faithful representation of speech where we pause not Subjective evaluation has been performed by between words, but either after certain phrases comparing the experimental results with the or at the end of sentence. manually segmented tokens. The method was evaluated using different sets of test Consider the sample input sentence: documents from various domains consisting of 714 manually segmented words. Table 3 རོང་ཁ་ལི་ནགསི་འདི་ རོང་ཁ་གོག་རིག་ནང་བཙགས་ནིའི་དོན་ལ་ རབ་ summarizes the evaluation results. སོར་ཧིལ་བ་གཅིག་ཁར་བསོམ་མི་ རང་དབང་ལི་ནགསི་ བཀོལ་སོད་

Document text Correct Detect Accuracy རིམ་ལགས་འདི་གི་ ཉེ་གནས་སེ་མཐན་འགར་བཟོ་ཡོད་པའི་ཐོན་རིམ་ (Correctly segmented tokens / total no. of ཅིག་ཨིན། དེ་གིས་ ཆ་ཚང་སེ་སད་བསར་འབད་ཡོད་པའི་ལག་ལེན་པའི་ words) ངོས་འད་བ་ཚ་སོནམ་ཨིན། Astrology.txt 102/116 87.9% dzo_linux.txt 85/93 91.4% Manually segmented sentence of the sample input sentence: movie_awards.txt 76/84 90.5% News.txt 78/85 91.8% རོང་ཁ།ལི་ནགསི།འདི། རོང་ཁ།གོག་རིག།ནང།བཙགས།ནིའི།དོན་ལ། Notice.txt 83/92 90.2% རབ་སོར།ཧིལ་བ།གཅིག།ཁར།བསོམ།མི། རང་དབང།ལི་ནགསི། བཀོལ་ Religious.txt 63/73 89.0% སོད།རིམ་ལགས།འདི།གི། ཉེ་གནས།སེ།མཐན་འགར།བཟོ།ཡོད།པའི། Song.txt 57/60 95.0% ཐོན།རིམ།ཅིག།ཨིན། དེ་གིས། ཆ་ཚང།སེ།སད་བསར།འབད།ཡོད།པའི། Tradition.txt 109/111 98.2% ལག་ལེན།པའི།ངོས།འད་བ།ཚ།སོནམ།ཨིན། Total 653/714 91.5% Using maximal matching algorithm: Table 3: Evaluation Results རོང་ཁ། ལི། ནགསི། འདི། རོང་ཁ། གོག་རིག། ནང། བཙགས།

Accuracy in %age are measured as: ནིའི། དོན། ལ། རབ་སོར། ཧིལ། བ། གཅིག། ཁར། བསོམ། མི། རང་དབང། ལི། ནགསི། བཀོལ་སོད། རིམ་ལགས། འདི་གི། N Accuracy(%) = ∗100 ཉེ་གནས། སེ། མཐན་འགར། བཟོ། ཡོད། པའི། ཐོན། རིམ། T where ཅིག་ཨིན། དེ། གིས། ཆ་ཚང། སེ། སད་བསར། འབད། ཡོད།  N is the number of correctly པའི། ལག་ལེན། པའི། ངོས། འད། བ། ཚ། སོནམ། ཨིན། segmented tokens  T is the total number of manually System segmented version of the sample input segmented tokens/ Total number of sentence: Underlined text shows the incorrect words. segmentation.

We have taken the extract of different test data རོང་ཁ།ལི་ནགསི་འདི། རོང་ཁ།གོག་རིག།ནང།བཙགས།ནིའི་དོན་ལ། hoping it may contain fair amount of general རབ་སོར།ཧིལ་བ།གཅིག།ཁར།བསོམ།མི། རང་དབང།ལི་ནགསི།བཀོལ་ terms, technical terms and common nouns. The སོད།རིམ་ལགས།འདི།གི། ཉེ་གནས།སེ།མཐན་འགར།བཟོ།ཡོད།པའི།

99 ཐོན།རིམ།ཅིག།ཨིན། དེ།གིས།ཆ་ཚང།སེ།སད་བསར།འབད།ཡོད།པའི། 2.འདི།རོང་ཁ།གི།ཞིབ།འཚོལ།ཡིག་ཆ།ཨིན this | Dzongkha | of | arrange together | search/ ལག་ལེན།པའི།ངོས།འད་བ།ཚ།སོནམ། ཨིན། expose | written document | is 5 Discussions 3.འདི།རོང།ཁ།གི།ཞིབ་འཚོལ།ཡིག་ཆ།ཨིན During the process of word segmentation, it is this | fortress | mouth/surface | of | research | understood that the maximal matching written document | is algorithm is simply effective and can produce accurate segmentation only if all the words are These problems of ambiguous word present in the lexicon. But since not all the divisions, unknown proper names, are lessened word entry can be found in lexicon database in and solved partially when it is re-ranked using real application, the performance of word the bigram techniques. Still the solution to the segmentation degrades when it encounters following issues needs to be discussed in the words that are not in the lexicon (Chiang et al., future. Although the texts were collected from 1992). widest range of domains possible, the lack of Following are the significant problems with available electronic resources of informative the dictionary-based maximal matching text adds to the following issues: method because of the coverage of lexicon  (Emerson, 2000): small number of corpus were not very impressive for the method  incomplete and inconsistency of the  lexicon database ambiguity and inconsistent of manual segmentation of a token in  absence of technical domains in the the corpus resulting in lexicon incompatibility and sometimes in  transliterated foreign names conflict.  some of the common nouns not included in the lexicon Ambiguity and inconsistency occurs  lexicon/word lists do not contains because of difficulties in identifying a word. genitive endings པའི (expresses the Since the manual segmentation of corpus entry genitive relationship as a quality or was carried out by humans rather than characteristic of the second element, computer, such humans have to be well skilled in identifying or understanding what a word is. for example, དབལ་པའི་བ 'son of a The problem with the Dzongkha scripts that pauper') and འི (first singular also hampers the accuracy of dzongkha word segmentation includes the issues such as possessive, for example, ངེའི་བམོ ambiguous use of Tsheg ( ་ ) in different which actually is ངི་གི་བམོ 'my documents. There are two different types of daughter') that indicates possession Tsheg: Unicode 0F0B ( ་ ) called Tibetan mark or a part-to-whole relationship, like inter syllabic tsheg is a normal tsheg that English 'of'. provides a break opportunity. Unicode 0F0C ( ༌ ) called Tibetan Mark Delimiter Tsheg Bstar A Dzongkha sentence like: is a non-breaking tsheg and it inhibits line breaking. འདི་རོང་ཁ་གི་ ཞིབ་འཚོལ་ཡིག་ཆ་ ཨིན། For example, may include the following ambiguous possible input sentence with Tsheg 0F0B: segmentation based on simple dictionary lookup: སངས་རས་དང་ཚེ་རིང་གཉིས་ བར་དོན་དང་འཕལ་རིག་ནང་ ལ་འབད་ དོ་ཡོདཔ་ཨིན་པས། 1.འདི།རོང་ཁ།གི།ཞིབ་འཚོལ།ཡིག་ཆ།ཨིན achieves 100% segmentation as follow: this | Dzongkha | of | research | written document | is

100 སངས་རས། དང། ཚེ་རིང། གཉིས། བར་དོན། དང། འཕལ། རིག། ནང། Acknowledgment ལ། འབད། དོ། ཡོདཔ། ཨིན། པས། This research work was carried out as a part of PAN Localization Project whereas the same input sentence with Tsheg (http://www.PANL10n.net) with the aid of a 0F0C is incorrectly segmented as follows: grant from the International Development སངས༌རས༌དང༌ཚེ༌རིང༌གཉིས༌། བར༌དོན༌དང༌འཕལ༌རིག༌ནང༌། Research Centre (IDRC), Ottawa, Canada, administered through the Center of Research in ལ༌འབད༌དོ༌ཡོདཔ༌ཨིན༌པས། Urdu Language Processing (CRULP), National There are also cases like shortening of words, University of Computer and Emerging removing of inflectional words and Sciences (NUCES), Pakistan. The research abbreviating of words for the convenience of team would also like to express the gratitude to the writer. But this is not so reflected in the all the PAN Localization Project members of dictionaries, thus affecting the accuracy of the Bhutanese team based at Department of segmentation. Information Technology and Telecom, for Following words has a special abbreviated way their efforts in collecting, preparing and of writing a letter or sequence of letters at the providing with the lexicon, corpus, useful end of a syllable as training and testing materials and finally for རོ་རེ as རོེ the their valuable support and contribution that made this research successful. ཡེ་ཤེས as ཡེས etc.. References 6 Conclusion and Future works Chen, Stanley F., Joshua Goodman, 1998. An Empirical Study of Smoothing Techniques for This paper describes the initial effort in Language Modeling, Computer Science Group, segmenting the Dzongkha scripts. In this Harvard University, Cambridge, Massachusetts preliminary analysis of Dzongkha word Chiang, T-Hui., J-Shin Chang,, M-Yu Lin, K-Yih segmentation, the preprocessing and Su, 2007. Statistical models for word normalizations are not dealt with. Numberings, segmentation and unknown word resolution. special symbols and characters are also not Department of Electrical Engineering , National included. These issues will have to be studied Tsing Hua University, Hsinchu, Taiwan. in the future. A lot of discussions and works also have to be done to improve the Chungku., Jurmey Rabgay, Gertrud Faaβ, 2010. NLP Resources for Dzongkha. Department of performance of word segmentation. Although Information Technology & Telecom, Ministry of the study was a success, there are still some Information & Communications, Thimphu, obvious limitations, such as its dependency on Bhutan. dictionaries/lexicon, and the current Dzongkha lexicon is not comprehensive. Also, there is Durrani, Nadir and Sarmad Hussain, 2010. Urdu Word Segmentation. Human Language absence of large corpus collection from Technologies: 11th Annual Conference of the various domains. Future work may include North American Chapter of the Association for overall improvement of the method for better Computational Linguistics, Los Angeles, June efficiency, effectiveness and functionality, by 2010. exploring different algorithms. Furthermore, the inclusion of POS Tag sets applied on n- Emerson, Thomas. 2000. Segmenting Chinese in Unicode. 16th International Unicode conference, gram techniques which is proven to be helpful Amsterdam, The Netherlands, March 2000 in handling the unknown word problems might enhance the performance and accuracy. Haizhou, Li and Yuan Baosheng, 1998. Chinese Increasing corpus size might also help to Word Segmentation. Language, Information and improve the results. Computation (PACLIC12), 1998. Haruechaiyasak, C., S Kongyoung, M.N. Dailey, 2008. A Comparative Study on Thai Word

101 Segmentation Approaches. In Proceedings of ECTI-CON, 2008. Huang, X., A. Acero, H.-W. Hon, 2001. Spoken Language Processing: A Guide to Theory, Algorithm and System Development (pp. 539 – 578). Prentice-Hall Inc., New Jersey 07458. Huor, C.S., T. Rithy, R.P. Hemy, V. Navy, C. Chanthirith, C. Tola, 2007. Word Bigram Vs Orthographic Syllable Bigram in Khmer Word Segmentation. PAN Localization Working Papers 2004 - 2007. PAN Localization Project, National University of Computer and Emerging Sciences, Lahore, Pakistan. Jurafsky, D., A. Acero, H.-W. Hon, 1999. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition (pp. 189 – 230). Prentice-Hall Inc., New Jersey 07458. Nugues, P.M. 2006. An Introduction to Language Processing with Perl and Prolog: An Outline of Theories, Implementation, and Application with Special Consideration of English, French, and German (Cognitive Technologies) (pp. 87 – 104). Springer-Verlag Berlin Heidelberg Pong, L.W. and Robert. 1994. Chinese word segmentation based on maximal matching and bigram techniques. Retrieved from The Association for Computational Linguistics and Chinese Language Processing. On-line: http://www.aclclp.org.tw/rocling/1994/P04.pdf Sunthi, Thepchai. 2007. Word Segmentation and POS tagging. ADD-2 Workshop, SIIT, NECTEC, Thailand. Van Driem, George. and Karma Tshering, (Collab), “Languages of Greater Himalayan Region”, 1998.

102 Building NLP resources for Dzongkha: A Tagset and A Tagged Corpus

Chungku Chungku, Jurmey Rabgay Gertrud Faaß Research Division Institute für Maschinelle Department of Information Technology Sprachverarbeitung(NLP processing), & Telecom University of Stuttgart {chungku,jrabgay}@dit.gov.bt [email protected]

Abstract kind. Producing such enriched data is proven to be useful especially when designing NLP This paper describes the application of representations of higher levels of probabilistic part of speech taggers to representation, e.g. syntactic parses. the Dzongkha language. A tag set Our project is designed to annotate Dzongkha containing 66 tags is designed, which is 1 cyclopedia text with parts of speech using a based on the Penn Treebank . A probabilistic tagger. This means that a set of training corpus of 40,247 tokens is tags is to be developed and applied manually to utilized to train the model. Using the parts of these texts creating training data. lexicon extracted from the training Probabilistic taggers can then be applied to corpus and lexicon from the available annotate other texts with their parts of speech word list, we used two statistical automatically. In this paper, we make use of two taggers for comparison reasons. The such taggers and report on their results. best result achieved was 93.1% accuracy in a 10-fold cross validation At present, our POS tagged data is already in on the training set. The winning tagger use in projects concerning Dzongkha Text to was thereafter applied to annotate a Speech (TTS) processing, further tests on word 570,247 token corpus. segmentation (see current state below) and in corpus-linguistic research. Future work entails 1 Introduction its utilization for higher-level NLP tasks such as parsing, building parallel corpora, research on Dzongkha is the national language of Bhutan. semantics, machine translation, and many more. Bhutan has only begun recently applying Natural Language Processing (henceforth NLP) Sections 2 and 3 of this paper describe the methodologies and tools. However, Dzongkha Dzongkha script and the challenges in computing is currently progressing rapidly. Dzongkha, section 4 presents our resources, tagset and corpus. Section 5 describes tagging Part of speech (henceforth POS) tagging means and validation processes and reports on their annotating each word with their respective POS results. Section 6 concludes and discusses future label according to its definition and context. work. Such annotation generates a description of the text on a meta-level, i.e. a representation of 2 The Dzongkha Language linguistic units on the basis of their properties. This POS-level provides significant information Dzongkha is recognized as the national and usable by further linguistic research, may it be official language of Bhutan. It is categorized as of the morphological, syntactic or semantic a Sino-Tibetan Language and said to have derived from the classical Tibetan or choka: 1 [http://www.cis.upenn.edu/~treebank/] Dzongkha consonants, vowels, phonemes, phonetics and writing system are all identical.

103 Proceedings of the 8th Workshop on Asian Language Resources, pages 103–110, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing

From a linguistic perspective, Dzongkha script and hence we have listed about 28,300 is syllabic, a syllable can contain one character Dzongkha words with their POS in a lexicon or as many as six characters. A syllable marker selected from the 570,247 token corpus to be known as “tsheg”, which is simply a tagged. We fed these data to the tagger during superscripted dot, separates the syllables of a its training phase. Note, however, that such a word. Linguistic words may contain one or lexicon may never be complete, as there are more syllables and are also separated by the morphologically productive processes creating same symbol, “tsheg”, thus the language is new forms (these belong to POS classes that are lacking word boundaries. often named “open”). Such forms may be taken into account when developing a tagset, Sentences of Dzongkha contain one or more however, in this work, we opted for postponing phrases which themselves contain one or more the issue until a morphological analyser can be words. A character known as “shed” marks a developed. sentence border, it looks like a vertical pipe. Phonetic information is available, too: In most 3.2 Ambiguous function words sentences, a pause of silence is taken after each A number of Dzongkha function words are phrase while speaking the Dzongkha language. ambiguous; for each occurrence of such a word, The written form of Dzongkha represents this the tagger has to decide on the basis of the pause with a space after each phrase in the case word’s contexts, which of the possible tags is to that it occurs not at the end of the sentence. The be assigned. Here, the tagset itself comes into Dzongkha writing system leads to a serious view: whenever it is planned to utilize problem: the detection of word borders, because probabilistic POS taggers, the tagset should be only phrases are separated by a space. POS designed on the basis of the words’ tagging usually requires a one-token-per line distributions, otherwise the potential accuracy format, which is produced by a process called of the taggers may never be achieved. word segmentation. The tagger then adds the POS category to each token. In Dzongkha it is mainly the function words that are ambiguous in terms of their POS. A typical The training data of (40247 tokens) was example is /le/(from) belonging to the segmented manually to achieve higher accuracy ལས་ of word boundary and also due to lack of word category PP (post position) and ལས་/le/(so) segmentation during that time. After a which is of the category CC (conjunction). Dzongkha word segmentation2 tool was developed, the remaining text was segmented 3.3 Fused forms with this tool, which works basically with a Some morpho-phonemic processes in Dzongkha lexicon and the longest string matching method. lead to the fusing of words, presenting another 3 3 Challenges and suggestions for challenge for tagging. Such words are not very frequent, thus proposing a challenge to tagging Dzongkha texts statistical taggers. The word རལ་པོའི་/gelpoi/ 3.1 Words unknown to the language model (king[+genitive]), for example, is fused from the A statistical tagger learns from POS dis- phrase རལ་པོ་གི་/gelpo gi/ (king is); another tributions in manually tagged data while being example is the fused form བསེདན་/sen/ ([to] kill), trained and, when being applied to unknown made from /se wa cin/ (if [to] kill). text, “guesses” the POS of each word. The བསེད་་བ་ཅིན་ TreeTagger (Schmid, 1994) additionally makes When a tagset does not cater for fused forms of use of a lexicon externally provided when words, one could split these forms while producing its language model (the “parameter tokenizing adding an intermediate level of file”). We had opted for using the TreeTagger representation between original text level and

2This tool was developed at NETEC(National Electronics 3 In our training set, there were 1.73% of all words and Computer Technology Center), Thailand. detected as fused forms.

104 the POS level: a level of text material to be the Penn Treebank tags as a basis for tagging. utilized for tagging or other further processing, Examining the similar features exhibited by as e.g. done by (Taljard et al., 2008) for the both the languages (Dzongkha and English), Bantu Language Northern Sotho. However, the tags that were applicable to Dzongkha were forms could not easily be split, as the resulting taken directly from the Penn Treebank. In cases first parts of the words would not contain the where these languages showed dissimilarities in separator “tsheg”. Splitting the word རལ་ their nature, new tags for Dzongkha were assigned (based e.g. on the work on Urdu of པོའི་/gelpoi/ (king[+genitive]), for example, would Sajjad and Schmid, 2009). As an example for result in རལ་པོ/gelpo/(king) and འི་/yi/[+genitive]. such dissimilarity, Dzongkha postpositions are The language model does not cater for words mentioned here, cf. (1); the respective tag (PP) ending in characters other than "tsheg" (word only exist for Dzongkha whereas in English the border) or being directly followed by "shed" (a whole set of ad position tags (preposition and postpositions) exist. word like རལ་པོ/gelpo/(king) may only appear if preceding a sentence border). Tagging accuracy (1) for such theoretical forms are not expected to be བི་ལི་ ཤིང་གི་འོག་ལའདག acceptable. Fusing processes are productive, j'ili shing-gi wôlu -dû therefore, further research in the frame of a Cat tree[posp] under[PP] be project developing a Dzongkha tokenizer is deemed necessary. ''A cat is under the tree'' We examined all fused words contained in our Whenever a tagset designed on theoretical textual data to find an workable solution at the implications is applied to text, it will be found current stage of the project. As long as the in the course of creating the training data that problem of tokenizing automatically is not not all morpho-syntactic phenomena of the solved, we opted for keeping the fused forms as language had been considered. This happened they are. To enhance our tagger results, we for Dzongkha, too: words appeared in the texts suggest to add a number of tags to our tagset that didn't fit in any of the pre-defined classes. that consist of the two POS tags involved. རལ་ Dzongkha uses honorific forms: ན་བཟའ་/nam za/ /gelpoi/ (king ), for example, is པོའི་ [+genitive] (cloths) is the honorific form of the noun གོ་ tagged as “NN+CG” and /sen/ ([to] kill) བསེདན་ ལ་/gola/(cloths), གསངས་/sung/(tell) the honorific as “VB+SC”. The “+” indicates combined individual tags. All known forms are added to a form of the verb སབ་/lab/(tell). We opted to mark new version of the lexicon. Note, however, that them by adding the tag NNH (honorific all tagging results reported upon in this paper, common noun) and VBH (honorific verb) to are still based on the tag set described below. enable future research on this specific usage of Dzongkha language. A number of tags were 4 Resources used added to the set, of which we describe four in more detail: two of the additional tags are sub- 4.1 Tagset classes of verbs: VBH (honorific verb form), and VBN which describes past participle forms, During the first phase of PAN Localization project, the first Dzongkha POS tagset4 was like, e.g. བངམ་/jun/(created), the past particle created. It consisted of 47 tags, its design is form of བང་/jung/(create). based on the Penn Guidelines5 and its categories of POS correspond to the respective English Concerning case, we added two subclasses of Penn categories. PAN generally makes use of case: CDt and CA. These differentiate between dative (CDt) and ablative (CA): The CDt 4 The original Dzongkha tag set is described at (Dative case) labels e.g.དོན་ལས་/doen le/(for it) http://www.panl10n.net 5 The Penn Guidelines can be downloaded from: and དོན་ལ་/doen lu/(for this). The Ablative case http://www.cis.upenn.edu/~treebank/

105 (CA) is used when the argument of the The Training data set preposition describes a source. For example, in Cleaning of texts. Raw text is usually to be the phrase ཤིང་ལས་རང་ཁི་/shing le kang thri/(from modified before it can be offered to a tagger. It wood chair), ལས་/le/from/ will be labeled CA is to be cleaned manually, e.g. by removing since the chair described is made from (the extra blank spaces, inserting missing blanks, source) wood (Muaz, et al. 2009). The tagset correcting spelling mistakes, and by removing utilized in our experiment consists of a total of duplicate occurrences of sequences. Secondly, 66 parts of speech as shown in Appendix (A). the process of tokenization (“Word Segmentation”) is to be applied. 4.2 Collecting a Corpus and generating a Design and generation of training data. The training data set training data set was produced in several steps: The Corpus collection process. The process of Firstly, 20,000 tokens were manually labeled collecting a corpus should be based on its with their respective parts of speech (for a purpose. As our goal was the design a comparison of tagging techniques, cf. Hasan et Dzongkha text corpus as balanced as possible in al., 2007). Thereafter, the problems that had terms of its linguistic diversity, the text data was occurred during the manual process were gathered from different sources like newspaper summarized and the tagset revised as described articles, samples from traditional books, and in section 4.1. Thereafter, we added another dictionaries, some text was added manually 20,247 tokens. The final training data set hence (poetry and songs). The text selection was also consists of 40,247 tokens (2,742 sentences, processed with a view on the widest range of 36,362 words, 3,265 punctuation, 650 numbers). genres possible: texts from social science, arts 4.3 Tagging technique: TreeTagger and and culture, and texts describing world affairs, TnT travel adventure, fiction and history books were added as our goal is to make it representative of TreeTagger (Schmid, 1994): TreeTagger every linguistic phenomena of language (Schmid, 1994) is a probabilistic part of speech (Sarkar, et al. 2007). The corpus is however not tagger operating on the basis of decision trees. balanced for a lack of available electronic Helmut Schmid developed it in the frame of the resources of informative text (so far only 14% “TC”6 project at the Institute for Computational belong to this category). Future work will Linguistics at the University of Stuttgart, therefore entail collecting more data from Germany. respective websites and newspapers. The entire corpus contains 570,247 tokens; it The software consists of two modules made from the domains described in table (1). a) train-tree-tagger: utilized to generate a parameter file from a lexicon and a hand- Domain Share % Text type tagged corpus. 1) World Affairs 12% Informative b) tree-tagger: makes use of the parameter file 2) Social Science 2% Informative generated with a); annotates text (which is to 3) Arts 9% Descriptive be tokenized first) with part-of-speech automatically. 4) Literature 72% Expository a) Generating a language model: Training 5) Adventure 1% Narrative When generating a language model stored in a 6) Culture 2% Narrative so-called “parameter file”, three files are 7) History 2% Descriptive required: a lexicon describing tokens and their Table (1): Textual domains contained respective tags, a list of open tags, and training in the corpus

6 The tagger is freely available at http://www.ims.uni- stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTag ger.html

106 data. The “train-tree-tagger” module generates the combination of Subordinate conjunction a binary parameter file. NN+SC, VB+SC, JJ+SC, RB+SC, just to name a few. The lexicon is a file that contains tokens (word forms and punctuation), in the format of one Tagged training data: a file that contains per line. The TreeTagger was developed for tagged training data. The data must be stored in languages with inflection, i.e. languages where one-token-per-line format. This means that each one word may occur in a number of line contains one token and its respective tag, allomorphs. To ease the task of tagging such these are separated by one tabulator. The file languages, the tool can operate on the level of should be cleaned from empty lines, no meta- a base form, the “lemma” which may be added information, like, e.g. SGML markup is to every word form. In the case of Dzongkha, allowed. lemmatization has not been developed yet, b) Tagging therefore, we either make use of the word form itself, or a hyphen in the case that no lemma is Two files serve as input: the binary parameter available. As table (2) demonstrates, the file and a text file that is to be tagged. lexicon contains the word forms in the first Parameter file: the file that was generated by column, the second column contains the POS step a) above. and the third a hyphen. In the case of ambiguous entries, one line may contain a Text file: a file that is to be tagged; it is sequence of tag-“lemma” pairs that follow the mandatory that the data in this file appears in a word form of the first column. The lexicon one-token-per line format. may not contain any spaces; all columns must TnT (Brants, 2000):The Trigram’s’n’Tags be separated by exactly one tab(ulator). (TnT) tagger was developed by Thorsten Brants Because the lexicon is only made use of during (Brants, 2000). It is language independent and the training phase of the tagger, any update has been used widely for a number of must result in reproducing the parameter file. languages, often yielding an accuracy of +96% without utilizing an external lexicon or an open word class file. TnT is based on Markov Word Pos tag lemma Pos tag lemma models, and takes not only distributional data of NN tokens, but also the final sequences of ཀ་ ཀ་ characters of a token into account when guessing the POS of a word. It can use the same ཀ་ཀ་ NN ཀ་ཀ format for training data as the TreeTagger, ཀ་ཀ་ར་ NNP -- therefore, in order to use TnT for comparison reasons, no additional preparations for tagging CC ལས་ PP ལས་ ལས་ Dzongkha are necessary. Table (2) Example entries of the lexicon 5 Validation and Results

5.1 k-fold cross validation and Open class tags: a file containing the list of bootstrapping open class tags, i.e. the productive classes (one When applying a tagset to training data for the entry per line), cf. Appendix A. In the upcoming first time, it is advisable to progress in steps and version of the tagset, tags of the following fused to validate each step separately: One begins forms will be added, like, e.g. NN+CG with annotating a rather small portion of text (combination of all forms nouns with genitive that is then divided into k number of slices. case CG), VB+CG (combination of all forms Slices k-1 are then utilized to create a parameter verb with genitive case CG), JJ+CG file, the slice k is stripped of its annotations and (combination of all forms of adjective with annotated by the tagger using that parameter genitive case CG), RB+CG (combination of all file. The same procedure is followed for all adverb with genitive case CG), and same with other slices (“k-fold cross validation”).

107 Afterwards, a comparison between the original to tag them is not found yet. The tagset was tags with the tags assigned by the tagger will extended to 66 tags (cf. Appendix A). With a then help to judge upon a number of issues, like, full knowledge of the possible tag-token e.g., whether the size of the training data is combinations, the Tree-Tagger achieved a sufficient (quantitative review). Examining the median accuracy of 93.1%. most frequent (typical) assignment errors of the tagger will also support the enhancement of the 5.3 TnT results and comparison with tagset: if e.g. the distribution of two different the TreeTagger tags is more or less identical, a probabilistic tagger will not succeed in making the right Using the 40,247 tokens text segment, a 10-fold choices, here, one is to consider if using one tag cross validation was also performed with the would be acceptable from a linguistic point of TnT tagger. It achieves a 91.5 % median view (qualitative review). accuracy when the tagset containing 47 tags is The knowledge gained here usually leads to applied. Results for each slice and mean/median updates in the tagset and/or to the necessity to can be found in table (3) of both taggers for add more amounts of texts containing comparison reasons. TnT reports on the number constellations that were found as being of unknown tokens detected in each slice; the problematic for probabilistic tagging for they mean of 16.49 % (median 14.18%) of unknown occur too rarely in the texts. After such updates tokens offers an explanation why TnT does not are done on the existing training texts and tagset perform as good as the TreeTagger which was respectively, the k-fold validation may be supplied with a complete lexicon thus not being repeated and reviewed again. faced with unknown tokens at all. Updating training data and tagset will be repeated until the tagging results are satisfying Tree-Tagger TNT Tagger: accuracy % accuracy % (such a progressing method is usually called slice 1 92.13 92.33 “bootstrap-ping”). slice 2 84.61 89.73 5.2 TreeTagger results slice 3 89.08 89.88 slice 4 90.17 90.43 The work on automatic part of speech tagging slice 5 92.95 91.01 for Dzongkha began with the manual annotation slice 6 93.32 91.35 of 20,000 tokens. Because a non-linguistic slice 7 94.24 91.69 person performed the process manually, the slice 8 93.32 92.03 language coordinator did thorough correction. slice 9 95.21 92.55 The 20,000 token training set, made use of 43 slice 10 94.56 92.60 different single tags (of 47 provided by the Mean 91.96 91.36 tagset). The token-tag combinations from there Median 93.14 91.20 were combined with an external lexicon Table (3) 10-fold cross validation results for produced from a dictionary; the resulting TreeTagger and TnT lexicon file thus contained all types. The 10-fold cross validation resulted in an A qualitative review of the results showed that accuracy of around 78%. Result introspection usually it is the tag CC that is confused with lead to the knowledge that more data had to be others (NN, DT, NN, DT, PRL, etc.) by TnT, added and that fused words will have to receive while the TreeTagger is rather confusing NN separate tags. It also showed that manual (with VB, NNP, PRL, CC). tokenization is an error-prone procedure, as a However, a more thorough qualitative significant number of word and sentence examination of these results is still to be done borders had to be corrected in the data. and may lead to further updates on the tagset. After updating tagset and training data, another 20,247 tokens were added to the training set and the lexicon was updated accordingly, except for the fused forms, where a final solution on how

108 6 Discussion, Conclusions and Future Acknowledgments work This research work carried out as part of PAN This paper describes the building of NLP Localization Project (www.PAN10n.net) with resources for the national language of Bhutan, the aid of grant from the International namely Dzongkha. We have designed and built Development Research Center (IDRC), Ottawa, an electronic corpus containing 570,247 tokens Canada, administered through the Center of belonging to different text types and domains. Research in Urdu language Processing Automated word segmentation with a high (CRULP), National University of Computer and precision/recall still remains a challenge. We Emerging Sciences (NUCES), Pakistan. The have begun to examine statistical methods to research team would also like to thank PD Dr. find solutions for this and we plan to report on Helmut Schmid and Prof. Dr. Heid, Insitut für our progress in the near future. Maschinelle Sprachverarbeitung (NLP We have developed a first version of a tag set institute), Universität Stuttgart, Germany for on the basis of the Penn Tree tagset for English their valuable support and contributions that (cf. section 4.1) A training data set of 40,247 made this research successful. tokens has been tagged manually and thoroughly checked. Lastly, we have tagged the corpus with the TreeTagger (Schmid, 1994) References using a full form lexicon achieving 93.1% and, for comparison reasons, with TnT (Brants, Brants, Thorsten. 2000. TnT - as statistical part-of- 2000), without a lexicon, achieving 91.5 %. speech tagger. In Proceedings of the Sixth Applied We have used the present output in the Natural Language Processing Conference construction of an advance Dzongkha TTS (ANLP-2000), Seattle, WA, USA, pages 224 – 231. (text to speech) using an HMM-based method Hasan, Fahim Muhammad, UzZaman, and which is developed by the Bhutan team in Mumit Khan. 2007. Comparison of different POS collaboration with HLT team at NECTEC, Tagging Techniques (N-Gram, HMM and 7 Thailand Brill’s tagger) for Bangla. PAN Localization Loads of work still remains, we are still to Working Papers, 2004-2007, pages 31-37. examine the tagger results from a qualitative aspect in order to answer inter Alia the Hassan, Sajjad and Helmut Schmid . 2009. Tagging following questions: Are there any further Urdu Text with Parts of Speech: A Tagger updates on the tag set necessary, what is the Comparison, Proceedings of the 12th Conference of best way to process fused forms. Quantitative the European Chapter of the Association for Computational Linguistics (EACL) . Athens, Greece, aspects might also still play a role: It still might 2009. be necessary to add further training data containing part of speech constellations that Muaz, Ahmed, Aasim Ali, and Sarmad Hussain. rarely occur, so tagger results for those will 2009. Analysis and Development of Urdu POS enhance. Tagged Corpus. Association for Computational We also plan to increase our corpus collection Linguistics. Morristown, NJ, USA. from various ranges of domains. At present Retrieved December 1, 2009, from there are more media, e.g. newspapers available http:// www.lancs.ac.uk in the world wide web, we will be able to collect such texts easily. In Bhutan, there is an Schmid, Helmut. 1994. Probabilistic Part-of-Speech ongoing project on OCR (optical character Tagging Using Decision Trees. In Proceedings of the International Conference on New Methods in recognition) of Dzongkha under the PAN Language Processing. Manchester, UK, pages 44 – project (www.PANL10n.net). Given the success 49. of this project, we will be able to scan text from textbooks. Sarkar, Asif Iqbal, Dewan Shahriar Hossain Pavel and Mumit Khan. 2007. Automatic Bangla Corpus Creation. BRAC University, Dhaka, Bangladesh. 7 http://www.nectec.or.th/en/

109 PAN Localization Working Papers, 2004-2007. Type SubClass Label pages 22-26. Vocative Case CV Pronouns Locative PRL Taljard Elsabé, Faaß Gertrud, Heid Ulrich, and Daan Differential PRD J. Prinsloo. 2000. On the development of a tagset for Northern Sotho with special reference to Personal PRP standardization. Literator 29(1), April 2008 (special Reflexive PRRF edition on Human Language Technologies), South Conjunction Coordinate CC Africa, pages 111 – 137 Subordinate SC Number Cardinal Number CD Ordinal Number OD Nominal Number ND Ad position Post position PP APPENDIX A Determiner Definite DT Possessive DT$ The Dzongkha Tagset Indefinite DTI as used for the validation tests Negator NEG Punctuation PUN Type SubClass Label Combined Open classes: tags: Noun Common Noun NN Noun+Genitiv Common+CG NNCG Honorific form NNH e case(CG) Particular/Person NNP Particular+CG NNPCG Quantifier NNQ Quantifier+CG NNQCG Plural NNS Plural+CG NNSCG Verb Aspirational VBAs Adjective+CG Adjective+CG JJCG Honorific VBH Characteristic +CG JJCtCG Agentive VBAt Periodic+CG JJPCG Non-Agentive VBNa Verb+CG Honorific+CG VBHCG Auxiliary VBAUX Agentive+CG VBAtCG Imperative VBI Verb+CG VBCG Modal VBMD Modal+CG VBMDCG Past participle VBN DefiniteDeter Determiner+CG DTCG Verb VB miner+CG Adjective Characteristic JJCt Locative Locative+CG PRLCG Periodic JJP Pronoun +CG Comparative JJR Negator+CG Negator+CG NEGCG Superlative JJS Noun+ Common Noun NNSC Adjective JJ Subordinate +SC Adverb Behavioral RBB Conjunction(S Comparative RBR C) Superlative RBS Verb+SC Verb+SC VBSC Adverb RB Agentive+SC VBAtSC Interjection UH Modal verb+SC VBMDC Closed classes: Affirmative Affirmative +SC AMSC +SC Marker Affirmative AM Interrogative IrM Negator+SC Negator+SC NEGSC Tense TM Case marker Ablative Case CA Dative Case CDt Genitive Case CG

110

Unaccusative/Unergative Distinction in Turkish: A Connectionist Approach

Cengiz Acartürk Deniz Zeyrek Middle East Technical University Middle East Technical University Ankara, Turkey Ankara, Turkey [email protected] [email protected]

explicit syntactic manifestations such as auxil- Abstract iary selection has been less studied.2 Computa- tional approaches are even scarcer. The major This study presents a novel computational goal of this study is to discuss the linguistic is- approach to the analysis of unaccusa- sues surrounding SI in Turkish and present a tive/unergative distinction in Turkish by novel computational approach that decides employing feed-forward artificial neural which verbs are unaccusative and which verbs networks with a backpropagation algo- are unergative in this language. The computa- rithm. The findings of the study reveal cor- tional approach may in turn be used to study the respondences between semantic notions split in lesser-known languages, especially the and syntactic manifestations of unaccusa- ones lacking a clear diagnostic. It may also be tive/unergative distinction in this language, used with well-known languages where the split thus presenting a computational analysis of is observed as a means to confirm earlier predic- the distinction at the syntax/semantics in- tions made about SI. terface. The approach is applicable to other languages, particularly the ones which lack 2 Approaches to Split Intransitivity (SI) an explicit diagnostic such as auxiliary se- lection but has a number of diagnostics in- Broadly speaking, approaches to the SI may be stead. syntactic or semantic. Syntactic approaches di- vide intransitive verbs into two syntactically 1 Introduction distinct classes. According to the seminal work of Perlmutter (1978), unaccusative and unerga- Ever since Unaccusativity Hypothesis (UH, tive verbs form two syntactically distinct classes Perlmutter, 1978), it is widely recognized that of intransitive verbs. Within the context of Rela- there are two heterogeneous subclasses of in- tional Grammar, Perlmutter (1978) proposed transitive verbs, namely unaccusatives and un- that unaccusative verbs have an underlying ob- ergatives. The phenomenon of unaccusa- ject promoted to the subject position, while un- tive/unergative distinction is wide-ranging and ergative verbs have a base-generated subject. labeled in a variety of ways, including active, This hypothesis, known as the Unaccusativity split S, and split intransitivity (SI). (cf. Mithun, Hypothesis (UH) maintains that the mapping of 1991).1 the sole argument of an intransitive verb onto Studies dealing with SI are numerous and re- syntax as subject or direct object is semantically cently, works taking auxiliary selection as the predictable. The UH distinguishes active or ac- basis of this syntactic phenomenon have in- tivity clauses (i.e., unergative clauses) from un- creased (cf. McFadden, 2007 and the references accusative ones. Unergative clauses include therein). However, SI in languages that lack

2 An exception is Japanese. For example see Kishimoto 1 In this paper, the terms unaccusative/unergative (1996), Hirakawa (1999), Oshita (1997), Sorace and Sho- distinction and split intransitivity (SI) are used mura (2001), and the references therein. Also see Richa interchangeably. (2008) for Hindi.

111 Proceedings of the 8th Workshop on Asian Language Resources, pages 111–119, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing willed or volitional acts (work, speak) and cer- 1993), and locomotion (see Randall, 2007; tain involuntary bodily process predicates Alexiadou et al., 2004, and, Aranovich, 2007, (cough, sleep); unaccusative clauses include and McFadden, 2007 for reviews). Some re- predicates whose initial term is semantically searchers have suggested that syntax has no role patient (fall, die), predicates of existing and in SI. For example, van Valin (1990), focusing happening (happen, vanish), nonvoluntary on Italian, Georgian, and Achenese, proposed emission predicates (smell, shine), aspectual that SI is best characterized in terms of Aksion- predicates (begin, cease), and duratives (remain, sart and volitionality. Kishimoto (1996) sug- survive). gested that volitionality is the semantic parame- From a Government and Binding perspective, ter that largely determines unaccusa- Burzio (1986) differentiates between two in- tive/unergative distinction in Japanese. transitive classes by the verbs’ theta-marking Auxiliary selection is among the most reliable properties. In unaccusative verbs (labeled ‘erga- syntactic diagnostics proposed for SI. This re- tives’), the sole argument is the same as the fers to the auxiliary selection properties of lan- deep structure object; in unergative verbs, the guages that have two perfect auxiliaries corre- sole argument is the same as the agent at the sponding to be and have in English. In Romance surface. The configuration of the two intransi- and Germanic languages such as Italian, Dutch, tive verb types may be represented simply as German, and to a lesser extent French, the equi- follows: valents of be (essere, zijn, sein, etre) tend to be selected by unaccusative predicates while the Unergatives: NP [VP V] John ran. equivalents of have (avere, haben, hebben, Unaccusatives: [VP V NP] John fell. avoir) tend to be selected by unergative predi- cates (Burzio, 1986; Zaenen, 1993; Keller, In its original formulation, the UH claimed that 2000; Legendre, 2007, among others). In (1a–b) the determination of verbs as unaccusative or the situation is illustrated in French (F), German unergative somehow correlated with their se- (G) and Italian (I). (Examples are from Legen- mantics and since then, there has been so much dre, 2007). theoretical discussion about how strong this connection is. It has also been noted that a strict (1) a. Maria a travaillé (F)/hat gearbeitet binary division is actually not tenable because (G)/ha lavorato (I). across languages, some verbs fail to behave ‘Maria worked.’ consistently with respect to certain diagnostics. b. Maria est venue (F)/ist gekommen For example, it has been shown that, with stan- (G)/é venuta (I). dard diagnostics, certain verbs such as last, stink, ‘Maria came.’ bleed, die, etc can be classified as unaccusative in one language, unergative in a different lan- Van Valin (1990) and Zaenen (1993) discuss guage (Rosen, 1984; Zaenen, 1988, among auxiliary selection as a manifestation of the se- many others). This situation is referred to as mantic property of telicity. Hence in Dutch, unaccusativity mismatches. New proposals that zijn-taking verbs are by and large telic, hebben- specifically focus on these problems have also taking verbs are atelic. been made (e.g., Sorace, 2000, below). Impersonal passivization is another diagnostic 2.1 The Connection of Syntax and Seman- that seems applicable to a wide range of lan- tics in SI guages and used by a number of authors, e.g. Perlmutter (1978), Hoekstra and Mulder (1990), Following the initial theoretical discussions Keller (2000). This construction is predicted to about the connection between syntactic diagnos- be grammatical with unergative clauses but not tics and their semantic underpinnings, various with unaccusative clauses. Zaenen (1993) notes semantic factors were suggested. These involve that impersonal passivization is controlled by directed change and internal/external causation the semantic notion of protagonist control in (Levin & Rappaport-Hovav, 1995), inferable Dutch; therefore incompatibility of examples eventual position or state (Lieber & Baayen, such as bleed with impersonal passivization is 1997), telicity and controllability (Zaenen,

112 attributed to the fact that bleed is not a protago- 3 Diagnostics for SI in Turkish nist control verb. Levin and Rappaport-Hovav (1995:141) take impersonal passivization as an Just as other languages, intransitive verbs in unaccusativity diagnostic but take its sensitivity Turkish are sensitive to a set of syntactic envi- to protagonist control as a necessary but an in- ronments, summarized below. sufficient condition for unergative verbs to al- 3.1 The –ArAk Construction low it. In other words, only unergative verbs will be found in this construction, though not all One of the diagnostics is the –ArAk construc- of them. tion, which is an adverbial clause formed with Refinements of the UH have also been pro- the root verb plus the morpheme –ArAk posed. Most notably, Sorace (2000) argued that (Özkaragöz, 1986). In a Turkish clause which the variation attested across languages (as well involves the verbal suffix –ArAk, both the con- as within the dialects of a single language) is troller (the complement verb) and the target (the orderly, and that there are a number of cut-off matrix verb) have to be either unaccusative or points to which verb classes can be sensitive. unergative. In addition, both the controlled and Sorace’s work on (monadic) intransitive verbs the target have to be the final (surface) subjects is built on variation in the perfective auxiliary of the clause. The examples below contain sen- selection of verbs in Romance and Germanic tences where both the controller and the target languages and called Auxiliary Selection Hier- verbs are unaccusative (2) or both are unerga- archy (ASH). She demonstrates that the varia- tive (3). The examples also contain ungram- tion is based on a hierarchy of thematic and as- matical sentences where the controller verb is pectual specification of the verbs (viz., telicity unergative whereas the target verb is unaccusa- and agentivity) and that it is a function of the tive (4), and those in which the controller verb position of a verb on the hierarchy. Verbs with a is unaccusative whereas the target verb is high degree of aspectual and thematic specifica- unergative (5). (Examples are from Özkaragöz, tion occupy the extreme ends; variable verbs 1986). occupy the middle position, reflecting the de- (2) Hasan [kol-u kana -y -arak] acı çek creasing degree of aspectual specification. Both -ti. cross-dialectally and across languages, these arm-POSS bleed-GL-ArAk suffer verbs may be used either with unaccusative or 3 -PST unergative syntax. The ASH therefore is a de- ‘Hasan, while his arm bled, suffered.’ scriptive statement considering auxiliary selec- (3) Kız [ (top) oyna-y -arak] şarkı söyle- tion as a property characterized by both syntax di. and semantics, as originally viewed by the UH. girl ball play-GL-ArAk sing We now turn to Turkish, which lacks perfec- -PST tive auxiliaries. A number of other syntactic ‘The girl, while playing (ball), sang.’ diagnostics, reviewed below, have been pro- (4) * Kız [ (top) oyna-y -arak] kay-dı. posed but unlike auxiliaries in other languages, girl ball play-GL-ArAk slip -PST these are not obligatory constructions in Turk- ‘The girl, while playing (ball), slipped.’ ish. In addition, the semantic properties underly- (5) * Kız [kayak kay-arak] düş-tü. ing the proposed diagnostics have not been girl ski-ArAk fall-PST studied extensively. Therefore, Turkish presents ‘The girl, while skiing, fell.’ a particular challenge for any study about SI. 3.2 Double Causatives Double construction is allowed with unaccusa- tive verbs but not with unergatives, as shown in (6) and (7) below (Özkaragöz, 1986).

3 The claim that the two notions of ASH lie within a single (6) Sema Turhan-a çiçeğ-i dimensional hierarchy has been questioned by Randall sol- dur -t -tu. (2007). The ASH has also been criticized since it does not -DAT flower-ACC explain the reason why a certain language shows the fade-CAUS-CAUS-PST pattern it does (McFadden, 2007).

113 ‘Sema made Turhan cause the flower argument of a transitive verb (e.g., anne to fade.’ ‘mother’ in [12]), or the internal argument of a (7) * Ben Turhan-a Sema-yı koş-tur passivized verb (e.g., borç ‘debt’ in [13]). The -t -t -um internal argument of a transitive verb is not al- I -DAT -ACC run-CAUS- lowed as the modified noun as illustrated in CAUS-PST-1sg (14). ‘I made Turhan make Sema run.’ (12) Çocuğu-n-u bırak-mış anne 3.3 Gerund Constructions Child-POSS-ACC leave-mIş mother ‘a/the mother who left her children’ The gerund constructions –Irken ‘while’ and (13) Öde-n-miş borç –IncE ‘when’ are further diagnostics. The for- pay-PASS-mIş debt mer denotes simultaneous action and the latter ‘the paid debt’ denotes consecutive action. Unergative verbs (14) *Öde-miş borç are predicted to be compatible with the –Irken pay-mIş debt construction, whereas unaccusatives are pre- *‘the pay debt’ dicted to be compatible with the –IncE con- 4 struction, as shown in (8) and (9). As expected, the adjectives formed by intransi- tive verbs and the –mIş participle is more ac- (8) Adam çalış-ırken esne-di. ceptable with unaccusatives compared to uner- man work-Irken yawn-PAST.3per.sg gatives, as shown in (15) and (16). ‘The man yawned while working.’ (9) Atlet takıl-ınca düş-tü. (15) sol-muş/ karar-mış çiçek athlete trip-IncE fall-PAST.3per.sg wilt/ blacken -mIş flower ‘The athlete when tripped fell.’ ‘The wilted/blackened flower’ 3.4 The Suffix –Ik (16) *sıçra-mış/ yüz-müş/ bağır-mış çocuk jump/ swim/ shout -mIş child It has also been suggested that the derivational ‘The jumped/ swum/ shouted child’ suffix –Ik, used for deriving adjectives from verbs, is compatible with unaccusatives but not 3.6 Impersonal Passivization with unergatives, as shown in (10) and (11). Impersonal passivization, used as a diagnostic (10) bat-ık gemi to single out unergatives by some researchers, sink-Ik ship appears usable for Turkish as well. In Turkish, the sunk ship impersonal passives carry the phonologically (11) *çalış-ık adam conditioned passive suffix marker, -Il, accom- work-Ik man panied by an indefinite human interpretation the worked man and a resistance to agentive by-phrases. It has been suggested that the tense in which the verb 3.5 The –mIş Participle appears affects the acceptability of impersonal passives: when the verb is in the aorist, the im- The past participle marker –mIş, which is used plicit subject has an arbitrary interpretation, i.e. for deriving adjectives from verbs has been pro- either a generic or existential interpretation. On posed as yet another diagnostic. The suffix –mIş the other hand, in those cases when the verb is forms participles with transitive and intransitive in past tense, the implicit subject has a referen- verbs, as well as passivized verbs. The basic tial meaning, namely a first person plural read- requirement for the acceptability of the –mIş ing. It was therefore suggested that impersonal participle is the existence of an internal argu- passivization is a proper diagnostic environment ment in the clause. In well-formed –mIş partici- only in the past tense, which was also adopted ples, the modified noun must be the external in the present study (Nakipoğlu-Demiralp, 2001, cf. Sezer, 1991). (17) and (18) exemplify 4 Examples in sections 3.3 and 3.4 are from Nakipoğlu (1998).

114 impersonal passivization with the verb in the between observations and outputs by hand. past tense. ANNs are characterized by interconnected group of artificial neurons, namely nodes. An (17) Burada koşuldu. ANN generally has three major layers of nodes: Here run-PASS-PST a single input layer, a single or multiple hidden ‘There was running here.’ (existential layers, and a single output layer. In a feedfor- interpretation) ward ANN, the outputs from all the nodes go to (18) ??Bu yetimhanede büyündü. succeeding but not preceding layers. This orphanage-LOC grow-PASS-PST There are three major learning paradigms that ‘It was grown in this orphanage.’ are used for training ANNs: supervised learning, unsupervised learning, and reinforcement learn-

ing. A backpropagation algorithm is a super- The diagnostics summarized above do not al- vised learning method which is used for teach- ways pick out the same verbs in Turkish. For ing a neural network how to perform a specific example, most diagnostics will fare well with task. Accordingly, a feed-forward ANN with a the verbs düş- ‘fall’, gel- ‘come’, gir- ‘enter’ backpropagation algorithm is a computational (with a human subject) just as well as imper- tool that models the relationship between obser- sonal passivization. In other words, these verbs vations and output by employing supervised are unaccusative according to most diagnostics learning method (see Hertz et al., 1991; Ander- and unergative according to impersonal son & Rosenfeld, 1988, among many others for passivization. The opposite of this situation also ANNs). The following section presents how holds. The stative verb devam et- ‘continue’ is such an ANN is used for analyzing unaccusa- bad or marginally acceptable with most tive/unergative distinction in Turkish. diagnostics as well as impersonal passivization. The conclusion is that in Turkish, acceptabil- 4.2 The Analysis ity judgments with the proposed diagnostic en- vironments do not yield a clear distinction be- Two feed-forward ANNs with a backpropaga- tween unaccusative and unergative verbs. In tion algorithm were developed for the analysis. addition, it is not clear which semantic proper- Both models had a single input layer, a single ties these diagnostics are correlated with. The hidden layer, and a single output layer of nodes. model described below is expected to provide Both models had a single output node, which some answers to these issues. It is based on na- represents the binary status of a given verb as tive speaker judgments but it goes beyond them unaccusative (0) or unergative (1). The number by computationally showing that there are cor- of nodes in the hidden layer was variable (see respondences between semantic notions and below for a discussion of network parameters). syntactic manifestations of SI in Turkish. The The difference between the two models was model is presented below. the design of the input layer. The first model (henceforth, the diagnostics model DIAG) took 4 The Model diagnostics as input nodes, whereas the second model (henceforth, the semantic parameters This study employs feed-forward artificial neu- model SEMANP) took semantic parameters as ral networks with a backpropagation algorithm input nodes, as presented in detail below. as computational models for the analysis of un- accusative/unergative distinction in Turkish. The Diagnostics Model (DIAG): Binary ac- ceptability values of the phrases or sentences 4.1 Artificial Neural Networks and Learn- formed by the syntactic diagnostics constituted ing Paradigms the input nodes for the network (see above for the SI diagnostics). Each syntactic diagnostic An artificial neural network (ANN) is a compu- provided a binary value (either 0 or 1) to one of tational model that can be used as a non-linear the input nodes. For example, consider the –mIş statistical data modeling tool. ANNs are gener- participle as one of the syntactic diagnostics for ally used for deriving a function from observa- SI in Turkish. As discussed above, the –mIş par- tions, in applications where the data are com- ticiple forms acceptable adjectival phrases with plex and it is difficult to devise a relationship

115 unaccusative verbs (e.g., sol- ‘wilt’) but not Accordingly, the input pattern for the verb with the unergative verbs (e.g., sıçra- ‘jump, konuş- ‘talk’ is schematically shown below. leap’), as shown in (19) and (20) below. 0 1 0 1 1 0 0 1

(19) sol-muş çiçek-ler The Semantic Parameters Model (SE- wilt-mIş flower-PLU MANP): The input nodes for the SEMANP ‘wilted flowers’ network were constituted by four binary values (20) *sıçra-mış sporcu-lar that represented the status of four semantic pa- jump-mIş sportsman-PLU rameters (telicity, volitionality, dynamicity, and ‘jumped sportsmen’ directed motion) for each verb. Each semantic parameter provided a binary value (either 0 or 1) Accordingly, for the verb sol- ‘wilt’, the –mIş to one of the input nodes. The value of the input participle diagnostic provides the value 1 with nodes were determined by applying the follow- one of the input nodes, whereas for the verb ing tests for the relevant semantic aspects: (1) sıçra- ‘jump, leap’ it provides the value 0 with in/for an hour test for telicity (e.g. the phrase to the corresponding input node. In this way, the talk *in/for an hour shows that the verb talk is syntactic diagnostics constituted an input pat- atelic whereas the phrase to wilt in/*for an hour tern with eight members for each verb.5 The shows that the verb wilt is telic), (2) on purpose construction of an input pattern is exemplified test for volitionality, (3) hala- ‘still’ test for dy- in (21) for the unergative verb konuş- ‘talk’. namicity, (4) and the dative test (i.e., acceptabil- ity of adding a dative term to the verb) for di- (21) A sample input pattern for DIAG. rected motion. The construction of an input pattern for SE- a. *Adam konuşarak kızardı. :0 MANP is exemplified in (22) for the unaccusa- The man talk-ArAk blush-PST tive verb sol- ‘wilt’. ‘The man blushed by talking.’

b. Adam konuşarak yürüdü. :1 (22) A sample input pattern for DIAG. The man talk-Arak walk-PST ‘The man walked by talking.’ a. Telic :1 c. *Adam kadına çocuğu konuşturttu. :0 b. Non-volitional :0 ‘The man made the woman have c. Non-dynamic :0 the boy talked.’ d. No directed motion :0 d. Adam konuşurken yürüdü. :1 The man talk-Irken walk-PST Accordingly, the input pattern for the verb sol- ‘The man walked while talking.’ ‘wilt’ is schematically shown below. e. Adam konuşunca yürüdü. :1 The man speak-IncE walk-PST 1 0 0 0 ‘The man walked when he talked.’ f. *Konuş-uk adam :0 4.3 The Training Phase Talk-Ik man The network was trained by providing patterns ‘The talked man’ for 52 verbs that are recognized as unaccusa- g. *Konuş-muş adam :0 tives in the SI literature or placed closer to the Talk-mIş man unaccusative end rather than the unergative end ‘The talked man’ of the Auxiliary Selection Hierarchy (ASH, So- h. Törende konuşuldu. :1 race, 2000); and 52 verbs that are recognized to Ceremony-LOC talk-PASS be unaccusative in the SI literature or placed ‘It was spoken in the ceremony.’ closer to the unergative end rather than the un- accusative end of the ASH. As a result, a total 5 One of the syntactic diagnostics (the gerund suffix of 104 input patterns, each composed of eight –ArAk ) involves two verbs (i.e., the target and the matrix nodes, were used to train the DIAG model and verb). Therefore, two sentences/phrases were formed–one 104 input patterns, each composed of four nodes, with unaccusatives and the other with unergatives–which provided two binary values with the input pattern. were used to train the SEMANP model. The

116 single output node was set to 0 if the verb with also classified all Group-B verbs as unergatives. the given input pattern was unaccusative and it Finally, the model categorized three Group-C was set to 1 if the verb was unergative. Super- verbs that were reported to show variable be- vised learning method was used, as employed havior (kana- ‘bleed’, parla- ‘shine’ and üşü- by the backpropagation algorithm. ‘be, feel cold’) as unaccusative verbs in Turkish. One hidden layer with a variable number of The distribution of weights after the training hidden units was used (see below for the analy- showed that the –mIş participle received the sis of model parameters). Sigmoid activation highest weight, which indicates that the –mIş function, shown in (23), was used for modeling participle is the most reliable diagnostics for the activation function. analyzing unaccusative/unergative distinction in Turkish. 1 (23) f (x) = 5.2 The SEMANP Model 1 e −x + The SEMANP model classified two of the Group-A verbs (namely, gir- ‘enter’ and yetiş- The number of maximum iterations per epoch ‘grow’) as unaccusatives and the three remain- was set to 20. The system sensitivity was de- ing verbs (dur- ‘remain, stay’, kal- ‘stall, stay, fined by a global variable (ε=0.01) which de- and varol- ‘exist’) as unergatives. The model cided whether the loops in the code converge or also classified four of five Group-B verbs (gül- not. ‘laugh’, sırıt- ‘grin’, söylen- ‘mutter’, yakın- ‘complain’) as unergatives and the remaining 4.4 The Test Phase verb (yüz- ‘swim’) as unaccusative. Finally, the The DIAG and SEMANP models were tested model categorized three Group-C verbs (kana- by providing the following input patterns: ‘bleed’, parla- ‘shine’ and üşü- ‘be, feel cold’) Group A: five verbs that are either recog- as unaccusative verbs in Turkish. nized as unaccusatives in the SI literature or The distribution of weights after the training placed closer to the unaccusative end rather than showed that among the four semantic parame- the unergative end of the ASH. ters that were selected in this study, telicity re- ceived the highest weight, which indicates that Group B: Five verbs that are either recog- unaccusative and unergative verbs are most sen- nized as unergatives in the SI literature or sitive to the telicity aspect of the verb in Turkish. placed closer to the unergative end rather than the unaccusative end of the ASH. 5.3 Evaluation of Model Parameters Group C: Three verbs that are reported to Four model design parameters, their initial val- exhibit variable behavior within the ASH. ues and acceptable ranges after optimization are discussed below. After the training, the networks provided the binary outputs for the test verbs, which showed The number of hidden units: The number of whether a test verb was unaccusative or unerga- hidden layers was set to 1 as a non-variable de- tive according to the models. sign parameter of the network. The initial num- ber of hidden units was set to 3. Keeping the 5 Results learning rate (η=-0.25) and the momentum term (λ=0.25) constant, the number of hidden units The results are presented in the two sections was adjusted and the behavior of the network below, separately for the DIAG model and for was observed. The analyses showed that the the SEMANP model. optimum range for the number of hidden units 5.1 The DIAG Model was between 2 and 6. After the training of the network and the opti- The learning rate: The learning rate was ini- mization of the number of hidden units and the tially set to η=-0.25. Keeping the number of learning rate, the DIAG model classified all hidden units (hidden_size=3) and the momen- verbs in Group A as unaccusatives. The model tum term (λ=0.25) constant, adjusting the learn-

117 ing rate between η=-0.005 and η=-0.9 did not 7 Conclusion and Future Research have an effect on the results. This study contributes to our understanding of The momentum term: The momentum term the distinction in several respects. was set to λ=0.25 initially. Keeping the number Firstly, it proposes a novel computational ap- of hidden units (hidden_size=3) and the learn- proach that tackles the unaccusative/unergative ing rate (η=-0.25) constant, adjusting the mo- distinction in Turkish. The model confirms that mentum term between λ=0.01 and λ=1.0 did not a split between unaccusative and unergative have an effect on the results. However, the sys- verbs indeed exists in Turkish but that the divi- tem did not converge to a solution for the mo- sion is not clear-cut. The model suggests that mentum term equal to and greater than λ=1.0. certain verbs (e.g., stative verbs) behave incon- sistently, as mentioned in most accounts in the 6 Discussion literature. Moreover, the model reflects a corre- A major finding of the suggested model is that spondence between syntactic diagnostics and the predictions of the two models are compati- semantics, which supports the view that unaccu- ble with the UH (Perlmutter, 1978) in that they sativity is semantically determined and syntacti- divide most intransitive verbs into two, as ex- cally manifested (Permutter, 1978, Levin & pected. Furthermore, the differences between Rappaport-Hovav, 1995). Since this approach the decisions of the diagnostics-based DIAG uses relevant language-dependent features, it is model and the semantic-parameters-based particularly applicable to languages that lack SEMANP model reflect a reported finding in explicit syntactic diagnostics of SI. the unaccusativity literature, i.e., the tests used The computational approach is based on the to differentiate between unaccusatives and connectionist paradigm which employs feed- unergatives do not uniformly delegate all verbs forward artificial neural networks with a back- to the same classes (the solution of why such propagation algorithm. There are several dimen- mismatches occur in Turkish is beyond the sions in which the model will further be devel- scope of this study, see Sorace, 2000; Randall, oped. First, the reliability of input node values 2007, for some suggestions). More specifically, will be strengthened by conducting acceptability the three Group-A verbs that were predicted as judgment experiments with native speakers, and unaccusative by the DIAG model and unerga- the training of the model will be improved by tive by the SEMANP model (dur- ‘remain, stay’, increasing the number of verbs used for training. kal- ‘stall, stay, and varol- ‘exist’) are stative Acceptability judgments are influenced not only verbs, which are known to show inconsistent by verbs but also by other constituents in behavior in the literature and classified as vari- clauses or sentences; therefore the input data able-behavior verbs by Sorace (2000). An un- will be improved to involve different senses of expected finding is the Group-B verb (yüz- verbs under various sentential constructions. ‘swim’), which is predicted as unergative by the Second, alternative classifiers, such as decision DIAG model and unaccusative by the SEMANP trees and naïve Bayes, as well as the classifiers model. This seems to reflect the role of seman- that use discretized weights may provide more tic parameters other than telicity (namely, dy- informative accounts of the findings of SI in namicity and directed motion) in Turkish. The Turkish. These alternatives will be investigated remaining nine verbs of thirteen tested verbs in further studies. were predicted to be of the same type (either unaccusative or unergative) by both models. Acknowledgements We thank Cem Another finding of the model is the alignment Bozşahin and three anonymous reviewers for between the most weighted syntactic diagnos- their helpful comments and suggestions. All tics for unaccusative/unergative distinction in remaining errors are ours. Turkish, namely the –mIş participle which re- ceived the highest weight after the training, and the most weighted semantic parameter, namely telicity.

118 References E. Eser Taylan, 129-150, John Benjamins. Alexiadou, Artemis, Anagnostopoulou, Elena, and Oshita, Hiroyuki. 1997. The unaccusative trap: L2 Everaert, Martin (eds.). 2004. The unaccusativity acquisition of English intransitive verbs. Universi- puzzle: Explorations of the syntax-lexicon inter- ty of Southern California dissertation. face. Oxford University Press. Özkaragöz, İnci Z. 1986. The relational structure of Anderson, James, and Rosenfeld, Edward. 1988. Turkish syntax, San Diego: University of Califor- Neurocomputing: Foundations of research. MIT nia dissertation. Press. Perlmutter, David M. 1978. Impersonal passives and Aranovich, Raúl (ed.). 2007. Split auxiliary systems. the Unaccusative Hypothesis. In Proceedings of Amsterdam/Philadelphia: John Benjamins. the Fourth Annual Meeting of the Berkeley Lingui- stics Society, ed. Farrell Ackerman et al., 157-189. Burzio, Luigi. 1986. Italian syntax: A Government- Binding approach. Dordrecht: Reidel. Randall, J. 2007. Parameterized auxiliary selection: A fine-grained interaction of features and linking Hertz, John, Krogh, Anders, and Palmer, Richard G. rules. In Split auxiliary systems, ed. Raúl Arano- 1991. Introduction to the theory of neural compu- vich, 207-236. John Benjamins. tation. Addison-Wesley, Massachusetts. Richa, Srishti. 2008. Unaccusativity, unergativity Hirakawa, Makiko. 1999. L2 acquisition of Japanese and the causative alternation in Hindi: A minima- unaccusative verbs by speakers of English and list analysis. New Delhi: Jawaharlal Nehru Uni- Chinese. In The acquisition of japanese as a se- versity dissertation. cond language, ed. Kazue Kanno, 89-113. Am- sterdam: John Benjamins. Rosen, Carol G. 1984. The interface between seman- tic roles and initial grammatical relations. In Stu- Hoekstra, Teun, and Mulder, René. 1990. Unergati- dies in relational grammar, eds. Carol G. Rosen ves as copular verbs: Locational and existential and David M. Perlmutter. University of Chicago predication. The Linguistic Review 7: 1-79. Press. Keller, Frank. 2000. Gradience in grammar: Expe- Sezer, Engin. 1991. Issues in Turkish syntax. Cam- rimental and computational aspects of degrees of bridge: Harvard University dissertation. grammaticality. University of Edinburgh disserta- tion. Sorace, Antonella. 2000. Gradients in auxiliary se- lection with intransitive verbs. Language 76: 859- Kishimoto, Hideki. 1996. Split intransitivity in Japa- 890. nese and the Unaccusative Hypothesis. Language 72: 248-286. Sorace, Antonella, and Shomura, Yoko. 2001. Lexi- cal constraints on the acquisition of split intransiti- Legendre, Géraldine. 2007. On the typology of auxi- vity. Evidence from L2 Japanese. Studies in Se- liary selection. Lingua 117: 1522-1540. cond Language Acquisition 23: 247-278. Levin, Beth, and Rappaport-Hovav, Malka. 1995. van Valin, Robert D., Jr. 1990. Semantic parameters Unaccusativity at the syntax-lexical semantics in- of split transitivity. Language 66: 221-260. terface. MIT Press. Zaenen, Annie. 1988. Unaccusative verbs in Dutch Lieber, R., and Baayen, H. 1997. A semantic princi- and the syntax-semantics interface. CSLI Reports ple of auxiliary selection in Dutch. Natural Lan- 88-123, CSLI, Stanford University. guage and Linguistic Theory 15: 789-845. Zaenen, Annie. 1993. Unaccusativity in Dutch: Inte- McFadden, Thomas. 2007. Auxiliary selection. Lan- grating syntax and lexical semantics. In Semantics guage and Linguistics Compass 1/6: 674-708. and the lexicon, ed. James Pustejovsky. Dordrecht: Mithun, Marianne. 1991. Active/agentive case mar- Kluwer Academic Publishers. king and its motivations. Language 67: 510-546. Nakipoğlu, Mine. 1998. Split intransitivity and the syntax-semantics interface in Turkish, Minneapo- lis: University of Minnesota dissertation. Nakipoğlu-Demiralp, Mine. 2001. The referential properties of the implicit arguments of impersonal passive constructions. In The verb in Turkish, ed.

119 A Preliminary Work on Causative Verbs in Hindi

Rafiya Begum, Dipti Misra Sharma Language Technologies Research Centre, IIIT. [email protected], [email protected]

Abstract Causative verbs are differently realized in dif- ferent languages. These verbs have been an in- teresting area of study. The study of causative This paper introduces a preliminary constructions is important as it involves the in- work on Hindi causative verbs: their teraction of various components such as seman- classification, a linguistic model for tics, syntax and morphology (Comrie, 1981). their classification and their verb This paper presents the preliminary work on frames. The main objective of this work Hindi causative verbs. is to come up with a classification of the Hindi causative verbs. In the classifica- 2 Causative Verbs tion we show how different types of Hindi verbs have different types of Causative verbs mean that some actor makes causative forms. It will be a linguistic somebody else do something or causes him to be resource for Hindi causative verbs in a certain state (Agnihotri, 2007). The causal which can be used in various NLP ap- verb indicates the causing of another to do plications. This resource enriches the al- something, instead of doing it oneself (Greaves, ready available linguistic resource on 1983). Semantically causative verbs refer to a Hindi verb frames (Begum et al., causative situation which has two components: 2008b). This resource will be helpful in (a) the causing situation or the antecedent; (b) getting proper insight into Hindi verbs. the caused situation or the consequent. These In this paper, we present the morpholo- two combine to make a causative situation (Ne- gy, semantics and syntax of the causa- dyalkov and Silnitsky, 1973). There are differ- tive verbs. The morphology is captured ent ways in which causation is indicated in dif- by the word generation process; seman- ferent languages. There are three types of causa- tics is captured by the linguistic model tives: Morphological causatives, Periphrastic followed for classifying the verbs and causatives and Lexical causatives (Comrie, the syntax has been captured by the verb 1981). frames using relations given by Panini. Morphological Causatives indicate causa- tion with the help of verbal affixes. Sanskrit, 1 Introduction Hindi/Urdu, Persian, Arabic, Hebrew, Japanese, Khmer and Finnish languages have morpholog- Verbs play a major role in expressing the mean- ical causatives. Periphrastic causatives indi- ing of a sentence and its syntactic behavior. cate causation with the help of a verb which They decide the number of participants that will occurs along with the main verb. For example, participate in the action. Semantically verbs are in English in a sentence such as: classified into action verbs, state verbs and process verbs. Syntactically they are classified (1) John made the child drink milk. into intransitives, transitives and ditransitives. The morphological, semantic and syntactic In the above example the verb make is ex- properties of verbs play an important role in pressing causation which is occurring along deeper level analysis such as parsing. with the verb drink which in turn is expressing

120 Proceedings of the 8th Workshop on Asian Language Resources, pages 120–128, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing the main action. English, German and French transitive verb becomes the patient and the are some of the languages which have periph- agent of the affective transitive verbs becomes rastic causatives. Lexical causatives are those the recipient in the first causal and they both in which there is no morphological similarity will take a ko postpositon (Hindi case marker). between the base verb root and the causative Effective verbs and ditransitive verbs have only verb form. A different lexical item is used to one causal form. The agent of the effective verb indicate causation. For example, the causative and ditransitive verb becomes the causative of English eat is feed. English and Japanese agent in the first causal. So this causative agent have lexical causatives. English has both pe- in the first causal takes a se ‘with’ postposition riphrastic and lexical causatives. (Hindi case marker). Verbs belonging to karnaa ‘to do’ class come under the effective verbs. 3 Causative verbs in Hindi The major studies in Hindi causatives: Kach- ru (1966), Kachru (1980), Kachru (2006), Causatives in Hindi are realized through a mor- McGregor (1995), Greaves (1983), Kellogg phological process. In Hindi, a base verb root (1972), Agnihotri (2007), Sahaay (2004), Sa- changes to a causative verb when affixed by haay (2007), Singh (1997) and Tripathi (1986). either an ‘-aa’ or a ‘-vaa’ suffix. Kachru (1966) has given the classification of

Hindi verbs based on their causativization beha- Base verb First causal Second causal vior. The others have mostly talked about the so sul-aa sul-vaa derivation process of the causative verbs. ‘sleep’ ‘put to sleep’ ‘cause to put to However the classification of causative verbs sleep’ in Hindi remains an issue of discussion. Since

they are morphologically related, the decision of In each step of causative derivation there is what is the base verb form of these verbs re- an increase in the valency of the verb (Kachru, mains a point of discussion. There are two ap- 2006; Comrie, 1981) proaches which are followed in deciding the

base verb: (1) causative formation based only (2) baccaa soyaa on morphology, (2) causative formation based child sleep.Pst on morphology and semantics. ‘The child slept.’

I.Based on Morphology aayaa ne bacce ko sulaayaa (3) maid Erg. child Acc. sleep.Caus.Pst Base verb First causal Second causal ‘The maid put the child to sleep.’ (Intransitive) (Transitive) (Causative) khul khol khulvaa (4) maa.N ne aayaa se bacce ko ‘open’ ‘open’ ‘cause to open’ mother Erg. maid by child Acc. sulvaayaa II. Based on Morphology and Semantics

sleep.Caus.Pst ‘Mother caused the maid to put the child to Intransitive Base verb Second causal sleep.’ (Intransitive) (Transitive) (Causative) khul khol khulvaa Hindi verbs are divided into two groups ‘open’ ‘open’ ‘cause to open’ based on their behaviour in causative sentences: affective verbs and effective verbs (Kachru, In I, khul ‘open’ is taken as the base verb and 2006). The action of affective verbs benefits or khol ‘open’ and khulvaa ‘cause to open’ are de- affects the agent. Affective verbs will have both rived from it by adding suffix ‘-aa’ and ‘-vaa’ first and second causal forms. Verbs such as respectively to the base verb (Kachru, 1966; ronaa ‘to cry’ and dau.Dnaa ‘to run’ are affec- Kachru, 1980). The arrow denotes the direction tive intransitive verbs. Only verbs belonging to khaanaa ‘to eat’ class come under affective of derivation from base verb. Here, the forward arrow denotes the increment of the argument transitive verbs. The agent of the affective in- from base to the causal forms. On the other

121 hand, in II, khol ‘open’ is taken as the base verb. door open.Pst Here, other than morphology, the semantics of ‘Door opened.’ the verbs is also taken into consideration. Here khul ‘open’ and khulvaa ‘cause to open’ are de- (6) raam ne darvaazaa kholaa rived from the base verb khol ‘open’. khulvaa ram Erg. door open.Pst ‘cause to open’ is a causative verb which is de- ‘Ram opened the door.’ rived from the base verb by adding suffix ‘-vaa’ to it. khul ‘open’ is a derived intransitive form. (7) maiM ne raam se darvaazaa The agent of the base verb khol ‘open’ is not I Erg. ram by door realized on the surface level of the derived in- khulvaayaa transitive verb khul ‘open’ though it is implied open.Caus.Pst semantically. Here there is both forward and ‘I made Ram open the door.’ backward derivation. From base verb to the de- rived intransitive it is a backward derivation When we come to the causatives, the notion which means there is a reduction of one argu- of prayojak karta ‘causer’, prayojya karta ment from base verb to the derived intransitive ‘causee’ and madhyastha karta ‘mediator verb (Tripathi, 1986; Reinhart, 2005). causer’ are introduced. prayojak karta ‘causer’ In this paper, we motivate our work by pre- is the initiator of the action. prayojya karta senting our approach for classifying the causa- ‘causee’ is the one who is made to do the action tive verbs in Hindi. by the prayojak karta ‘causer’. madhyasta karta ‘mediator causer’ is the causative agent of the 4 Our Approach action. The karta of the base verb becomes the prayojya karta of the causative verb and the 4.1 Linguistic Model for Classifying Caus- prayojak karta of the first causative becomes ative verbs the madhyasta karta of the second causative. We have followed Paninian Grammatical This model takes both semantics and framework in this model as the theoretical basis morphology into consideration. for our approach. The meaning of every verbal root (dhaatu) consists of: (a) activity (vyaapaa- 4.1.1 Semantics ra) and; (b) result (phala). Activity denotes the (8) caabii ne taalaa kholaa actions carried out by the various participants key Erg lock open.Pst (karakas) involved in the action. Result denotes ‘The key opened the lock.’ the state that is reached, when the action is completed (Bharati et al., 1995). The partici- (9)* raam ne caabii se taalaa khulvaayaa pants of the action expressed by the verb root ram Erg. key by lock open.Caus.Pst are called karakas. There are six basic karakas, ‘Ram caused the key to open the lock.’ namely; adhikarana ‘location’, apaadaan ‘source’, sampradaan ‘recipient’, karana ‘in- (10)raam ne mohan dvaaraa caabii se taalaa strument’, karma ‘theme’ and karta ‘agent’ ram Erg. mohan by key with lock (Begum et al., 2008a). Here the mapping be- khulvaayaa tween karakas and theta roles is a rough map- open.Caus.Pst. ping. ‘Ram made Mohan open the lock with the The karta karaka is the locus of the activity. key.’ Similarly karma karaka is the locus of the re- sult. The locus of the activity implied by the In (8), caabii ‘key’ is the karta of the transi- verbal root can be animate or inanimate. Sen- tive verb khol ‘open’. caabii ‘key’ is an inani- tence (2) given above, is the example where the mate karta so this sentence can’t be causati- locus of the activity is animate. Sentence (5) vized. (8) has been tried to causativize in (9) given below, is the example where the locus of which is unacceptable. (9) is actually interpreted the activity is inanimate. as (10) where an inanimate noun with a se ‘with’ postposition acts as an instrument and not (5) darvaazaa khulaa as a prayojya karta ‘causee’. So in (10), caabii

122 ‘key’ is an inanimate noun and takes se ‘with’ khilaayaa postposition so caabii se ‘with the key’ acts as eat.Caus.Pst instrument and mohan ‘Mohan’ acts as the ‘Sita fed Ram an apple.’ prayojya karta ‘causee’ (Kachru, 1966). It seems that only those verbs can be causativized (13)maa.N ne(pk1) siitaa se(mk1) raam ko(jk1) which take an animate karta. mother Erg. sita by ram Acc. Out of the above two given approaches, we seb(k2) khilvaayaa are following approach II where both morphol- apple eat.Caus.Pst. ogy and semantics are considered. In our ap- ‘Mother caused Sita to feed Ram an apple.’ proach we are saying that only those base verbs can be causativized which take an animate karta (14) naukar ne(k1) kaam(k2) kiyaa and it should also have volitionality (Tripathi, servant Erg. work do.Pst 1986; Reinhart, 2005). Those base verbs which ‘The servant did the work.’ take an inanimate karta can’t be causativized. So in our approach the prayojya karta ‘causee’ (15) maalik ne(pk1) naukar se(mk1) kaam(k2) in the causative sentence is always animate as master Erg. servant by work the karta of the base verb becomes the prayojya karvaayaa karta ‘causee’ in the causative sentences. In our do.Caus.Pst approach we have the notion of karmakartri ‘The master caused Ram to do the job.’ which says an intransitive can be derived out of a basic transitive verb and the karma of the ba- (16) raam ne(k1) siitaa ko(k4) kitaab (k2) dii sic transitive verb becomes the karta of the de- ram Erg. sita Dat. book give.Pst rived intransitive verb. So the karta of the de- ‘Ram gave a book to Sita.’ rived intransitive verb is called karmakartri. The derived intransitive verbs are like unaccusa- (17) mohan ne(pk1) raam se(jk1) siitaa ko(k4) tive verbs of English. mohan Erg. ram by sita Dat. Whereas in approach I, the intransitive base kitaab(k2) dilaaii verbs that take both animate and inanimate kar- book give.Caus.Pst ta can be causativized. But in case of transitives, ‘Mohan made Ram give a book to Sita.’ base verbs which take only animate karta can be causativized. Ditransitives can also be causati- (18) mujhko(k4a) chaa.Nd(k1) dikhaa vized. (Kachru 1966; Kachru 1980) I.Dat. moon appear.Pst We follow the dependency tagging scheme ‘The moon became visible to me.’ proposed by Begum et al. (2008a) for the devel- opment of a dependency annotation for Indian (19) maiM ne(k1) chaa.Nd(k2) dekhaa Languages. In this scheme prayojak karta I Erg. moon see.Pst ‘causer’, prayojya karta ‘causee’ and madhyas- ‘I saw the moon.’ tha karta ‘mediator causer’ are represented as pk1, jk1 and mk1 respectively. (20) maiM ne(pk1) raam ko(jk1) chaa.Nd(k2) Some of the base verb forms and their causa- mother Erg. ram Dat. moon tive sentences are given below with dependency dikhaayaa relations marked in the brackets for the appro- see.Caus.Pst priate arguments: ‘Mother showed moon to Ram.’

(11) raam ne(k1) seb(k2) khaayaa (21) maiM ne(pk1) mohan se(mk1) ram Erg. apple eat.Pst I Erg. mohan by ‘Ram ate an apple.’ raam ko(jk1) chaa.Nd(k2) dikhlaayaa ram Dat. moon see.Caus.Pst (12) siitaa ne(pk1) raam ko(jk1) seb(k2) ‘Mother made Mohan show moon to Ram.’ sita Erg. ram Acc. apple

123 4.1.2 Morphology pii pi-l-aa pi-l-vaa ‘drink’ ‘make someone drink’ ‘cause to make In this section we have given the derivation someone drink’ process of the Hindi causative verbs. We have studied 160 Hindi verbs and have come up with 4.1.2.3 Type-3 and its causative forms: certain number of rules for the derivation process of causative verbs. Suffix ‘-vaa’ is added to the verb root to form When causative affixes are added to the base the first causal. verb roots then some of the base verb roots change in form and some don’t. Various causal No Change in the Root: affixes are added to each verb type to form khariid khariid-vaa causatives. An example of affix addition for ‘buy’ ‘cause to buy’ each verb type is discussed below. The affixes that are added are given in bold. The changes in Change in the Root: the base verb root are underlined and made bold  aa  a in both root form and the causal form.

gaa ga-vaa 4.1.2.1 Type-1 and its causative forms: ‘sing’ ‘cause to sing’

Suffix ‘-aa’ is added to the verb root to form the 4.1.2.4 Type-4 and its causative forms: first causal and ‘-vaa’ to form the second causal. Suffix ‘-aa/-vaa’ is added to the verb root to

No Change in the Root: form the first causal.

Chip Chip-aa Chip-vaa No Change in the Root: ‘hide’ ‘hide’ ‘cause to hide’ paros paros-vaa

‘serve’ ‘cause to serve’ Change in the Root:  aa a Change in the root:  e  i; In addition, 'l' is naach nach-aa nach-vaa inserted here between the root and ‘dance’ ‘make someone dance’ ‘cause to make the causative suffix. someone dance’ de di-l-aa /di-l-vaa 4.1.2.2 Type-2 and its causative forms: ‘give’ ‘cause to give’

Suffix ‘-aa’ is added to the verb root to form the In case type-5 and type-6 verbs, we can de- first causal and ‘-vaa’ to form the second rive intransitive verbs out of transitive verbs. causal. Here we have two types of word formations:

No Change in the Root:  causative formation, likh likh-aa likh-vaa  Derived intransitive verb formation ‘write’ ‘dictate’ ‘cause to dictate’

Change in the root: Causative derivation is the forward deriva- tion and intransitive derivation is backward de-  aa, ii  i;In addition, 'l' is rivation. inserted here between the root and

the causative suffix. 4.1.2.5 Type-5 and its causative forms: khaa khi-l-aa khi-l-vaa Suffix ‘-aa’ is added to the verb root to form the ‘eat’ ‘feed’ ‘cause to feed’ first causal and ‘vaa’ to form the second causal. In this verb type there is no example where the

124 verb root form doesn’t change. all causative verbs the following information is marked; causative root, base root, verb type and Causative Formation: Change in the root causative suffix. In (i), the information of base  e  i verb root from which the causative root is de- rived is missing which is captured in (ii). In the dekh dikh-aa dikh-vaa above mentioned two ways the latter gives more ‘see’ ‘show’ ‘cause to show’ detailed information than the former.

Derived intransitive formation: Change in 4.2 Methodology of the Work the above root: For this work, 160 base verbs were taken, their  i e causative forms were given and were classified. Rules for deriving causative verb forms from dikh dekh their base forms were made. Verb frames for ‘appear’ ‘see’ base verbs and their causative forms were de- veloped. Based on the analysis of the base verbs 4.1.2.6 Type-6 and its causative forms: certain problem cases were logged and generali- zations regarding causativization were made. In Suffix ‘-aa/-vaa’ is added to the transitive verb this paper, we briefly discuss about all the root to form the first causal. points mentioned above.

Causative formation: No change in the root 4.3 Classification of Hindi Causative bhar bhar-vaa /bhar-aa Verbs ‘fill’ ‘cause someone to fill’ Here Hindi verbs have been classified into 6 types based on their causativization behavior: Derived intransitive formation: No change in the root  Type-1: Basic Intransitive verb bhar bhar

‘to fill’ ‘to fill’ Basic intransitive verb has two causal forms i.e., first causal and second causal form. First Causative formation: Change in the root causal of the basic intransitive verb functions as  o  In addition, 'l' is inserted u; a transitive verb. The subject of the basic intran- here between the root and the sitive verb becomes the object of the transitive causative suffix verb or the first causal form. The subject of the

first causal form becomes the causative agent of dho dhu-l-aa /dhu-l-vaa the second causal form. Sentence (2) is the ex- ‘wash’ ‘cause to wash’ ample of basic intransitive and sentences (3) and (4) are its causative forms. Derived intransitive formation: Change in the above root:  Type-2:  u o; In addition, 'l' is inserted at Basic Transitive verb type-I (which khaanaa ‘to eat’ verb type the end of the root is similar to giv- en by Kachru (1966)) dhu-l dho ‘be washed’ ‘cause to wash’  Type-3: Basic Transitive verb type-II (which is similar to karnaa ‘to eat’ verb In the implementation of the causative verbs, type given by Kachru (1966)) the causative feature of a verb is reflected in the morph analysis. There are two possible ways to Type-2 and type-3 are transitive verbs which implement causative information: (i) All the are divided into two types based on their causa- causative verb roots are included in the root dic- tivization behavior. Basic transitive verbs of tionary of the morph analyzer with an additional type-I, like khaanaa ‘to eat’ have two causal feature marking it a causative verb type. (ii) For forms. First causal of khaanaa ‘to eat’ type verb

125 also functions as ditransitive. Whereas transitive ‘give’ ‘cause to give’ verbs of type-II, like karnaa ‘to do’ have one causal form. First causal of karnaa ‘to eat’ type  Type-5 verb functions as causative. Sentences (11-13) Intransitive Base verb are examples for type-2 verb. Sentences (14-15) dikh dekh are examples for type-3 verb. ‘appear’ ‘see’

 Type-4: Basic Ditransitive verb Base verb First causal Second Causal dekh dikhaa dikhvaa Ditransitive verbs also have one causal form. ‘see’ ‘show’ ‘cause to Sentences (16-17) are examples for type-4 verb. show’  Type-6  Type-5: Basic Transitive verb type-I, out of Intransitive Base verb which intransitive verbs can be derived khul khol which takes a dative subject, ‘open’ ‘open’

 Type-6: Basic Transitive verb type-II, out Base verb First causal of which intransitive verbs can be derived. khol khulvaa ‘open’ ‘cause to open’ Type-5 and type-6 are transitive verbs which have causal forms depending on whether it is 5 Verb Frames type-I (khaanaa ‘to eat’) transitive or type-II In this section we list out the syntactic frames (karnaa ‘to do’) transitive and in addition both for all the causative types discussed in the pre- have a derived intransitive form. Type-5 takes a vious sections. Verb frame (Begum et al., dative subject in the base form. Sentences (18- 2008b) is given for the base form and for its 21) are examples for type-5 verb. Sentences (5- first and second causal form. For ease of exposi- 7) are examples for type-6 verb. Other than the tion, below we show only the relevant informa- 4 classes classified by Kachru (1966), we have tion of a verb frame. Components not necessary two more extra classes, i.e., type-5 and type-6. for the present discussion have been left out. An example for each verb type that goes into Here the structure of a verb frame is given in the classification is given below: terms of dependency relation, postposition (Hindi case marker) and TAM. We have taken  Type-1 past tense (yaa is the past tense marker) in the Base verb First causal Second Causal TAM. Refer the examples given above for each so sulaa sulvaa type of causatives for a better understanding of ‘sleep’ ‘put to sleep’ ‘cause to put to sleep’ the frames.

 Type-2 I.Frame of Type-1 and its Causative Forms: Base verb First causal Second Causal khaa khilaa khilvaa Relation-Postposition TAM ‘eat’ ‘feed’ ‘cause to feed’ (a)k1-0 yaa  Type-3 (b)pk1-ne jk1-ko yaa Base verb First causal (c)pk1-ne mk1-se jk1- ko yaa kar karaa/karvaa ‘do’ ‘cause to do’ II.Frame of Type-2 and its Causative Forms:

 Type-4 Relation-Postposition TAM Base verb First causal de dilaa/dilvaa (a)k1-ne k2-0 yaa (b)pk1-ne jk1-ko k2-0 yaa

126 (c)pk1-ne mk1-se jk1- ko k2-0 yaa er occur as manner adverbs of another motion verb can't be causativized. Natural process verbs III. Frame of Type-3 and its Causative like khil ‘blossom’, garajnaa ‘thunder’ and ug Forms: ‘rise’ also can’t be causativized. There are three types of the verb nikal Relation-Postposition TAM ‘leave’. All the three are used as intransitives.

(a)k1-ne k2-0 yaa  derived intransitive: sense  drain out (b)pk1-ne jk1-se k2-0 yaa paanii kamre se nikal gayaa (22) IV. Frame of Type-4 and its Causative water room from leave go.Pst. Forms: ‘Water drained out of the room.’

Relation-Postposition TAM  Baisc Intransitive: sense  walked out

(a) k1-ne k4-ko k2-0 yaa (23) raam kamre se baahar nikal gayaa (b) pk1-ne jk1-se k4-ko k2-0 yaa ram room from out leave go.Pst. ‘Ram walked out of the room.’ V. Frame of Type-5 and its Causative Forms:  Baisc Intransitive which involves natural Relation-Postposition TAM process.

(a)k4a-ko k1 aa (24) gangaa gangotrii se nikaltii (b)k1-ne k2-0 aa ganga gangotri from emerge.Impf. (c)pk1-ne jk1-ko k2-0 yaa hai (d)pk1-ne mk1-se jk1-ko k2-0 yaa be.Pres. ‘Ganga emerges from Gangotri.’ VI. Frame of Type-6 and its Causative Forms: The first type is a derived intransitive which is derived from the base transitive verb nikaal Relation-Postposition TAM ‘remove’. This base transitive verb root can be causativized. The second type is basic intransi- (a)k1-0 aa tive which can also be causativized. The third (b)k1-ne k2-0 aa type which is natural process can’t be causati- (c)pk1-ne jk1-se k2-0 yaa vized. This shows how important the property of animacy for making causatives is. 6 Issues and Observations 7 Conclusion and Future Work There are some verbs which can’t be causati- vized. Motion verbs like aa ‘come’ and jaa ‘go’ In this work we flesh out the linguistic devices can’ t be causativized. After analysing certain that work for causativization. In this paper we amount of corpus we have observed that not all have introduced a preliminary work on Hindi motion verbs behave like the above verbs . aa- causative verbs. We have given the classifica- naa ‘to come’ and jaanaa ‘to go’ verbs can't be tion of causative verbs and the linguistic model causativized because these verbs always occur followed for their classification. We have also as main verbs and take the following verbs as given the verb frames of the causative verbs. manner adverbs: chalnaa, bhaagnaa, daudnaa. These insights have been incorporated in the For instance, chalkar aayaa ‘came running’ and Hindi dependency treebank (Bhatt et al., 2009). daudkar gayaa ‘went running’. Those motion We also plan to use the verb frames in a Hindi verbs which occur as manner adverbs and modi- dependency parser (Bharati et al., 2009) to im- fy another motion verb can be causativized and prove its performance. those verbs which occur as main verbs and nev-

127 References McGregor, R. S. 1995. Outline of Hindi Grammar. Oxford University Press, pp. 118-125. Agnihotri, Rama K. 2007. Hindi, An Essential Grammar. Routledge, London and New York, pp. Nedyalkov, Vladmir P., Georgij G. Silnitsky. 1973. 121-126. The typology of Morphological and Lexical Causatives, in: F. Kiefer, (ed.), Trends in Soviet Balachandran, Lakshmi B. 1973. A Case Grammar Theoretical Linguistics, Foundations of Language, of Hindi. Central Institute of India, Agra. Supplementary Series 18, Dordrecht, 1-32. Begum, Rafiya, Samar Husain, Arun Dhwaj, Dipti Reinhart, Tanya. 2005. Causativization and Decau- M. Sharma, Lakshmi Bai and Rajeev Sangal. sativization: Unpublished technical report. LSA 2008. Dependency Annotation Scheme for Indian 2005. Languages. Proceedings of The Third Interna- tional Joint Conference on Natural Language Sahaay, Chaturbhuj. 2004. Hindi ke Muul Vaakya Processing (IJCNLP), Hyderabad, India. Saanche, Kumar Prakaashan, Agra, pp. 91-100. Begum, Rafiya, Samar Husain, Dipti M. Sharma and Sahaay, Chaturbhuj. 2007. Hindi Padvigyaan, Ku- Lakshmi Bai. 2008. Developing Verb Frames in mar Prakaashan, Agra. Hindi. In Proceedings of The Sixth International Shastri, Charu D. 1990. Panini: Re-Interpreted. Mo- Conference on Language Resources and Evalua- tilal Banarasidass, Delhi. tion (LREC). Marrakech, Morocco. Singh, Rajendra and Rama K. Agnihotri. 1997. Hindi Bharati, Akshar, Vineet Chaitanya, and Rajeev San- Morphology: A Word-based Description. Motilal gal. 1995. Natural Language Processing: A Pani- Banarasidass, Delhi. nian Perspective, Prentice-Hall of India, New Delhi, pp. 65-106. Tripathi, Acharya R. 1986. Hindi Bhashaanushasan. Bihar Granth Academy, Patna. Bharati, Akshar, Samar Husain, Dipti Misra Sharma and Rajeev Sangal. 2009. Two stage constraint based hybrid approach to free word order lan- guage dependency parsing. In Proceedings of the 11th International Conference on Parsing Tech- nologies (IWPT09). Paris. Bhatt, Rajesh, Bhuvana Narasimhan, Martha Palmer, Owen Rambow, Dipti Misra Sharma, Fei Xia. 2009. A Multi-Representational and Multi- Layered Treebank for Hindi/Urdu. In the Third Linguistic Annotation Workshop (The LAW III) in conjunction with ACL/IJCNLP 2009. Singapore. Comrie, Bernard. 1981. Language Universals and Linguistic typology: Syntax and Morphology. The University of Chicago Press. Greaves, Edwin. 1983. Hindi Grammar. Asian Edu- cational Services, New Delhi, pp. 301-313 Kachru, Yamuna. 1966. Introduction To Hindi Syn- tax. University of Illinois, Illionois, pp. 62-81 Kachru, Yamuna. 1980. Aspects of Hindi Grammar. Manohar Publications. Kachru, Yamuna. 2006. Hindi. London Oriental and African Language Libtary. Kale, M.R. 1960. A Higher Sanskrit Grammar. Mo- tilal Banarasidass, Delhi. Kellogg, S. H. 1972. A Grammar of the Hindi Lan- guage. Munshiram Manoharlal, New Delhi, pp. 253-257.

128 A Supervised Learning based Chunking in Thai using Categorial Grammar

Thepchai Supnithi , Peerachet Porkaew, Chanon Onman, Asanee Kaw- Taneth Ruangrajitpakorn, Kanokorn trakul Trakultaweekool Human Language Technology, Department of Computer Engineer- National Electronics and Computer ing, Kasetsart University and Technology Center National Electronics and Computer Technology Center {thepchai.sup, peera- chet.por, taneth.rua, ka- [email protected], nokorn.tra}@nectec.or.th [email protected]

Abstract learning in CoNLL task. They achieved results better than other approaches. Molina et al. One of the challenging problems in Thai (2002) improved the accuracy of HMM-based NLP is to manage a problem on a syn- shallow parser by introducing the specialized tactical analysis of a long sentence. HMMs. This paper applies conditional random field and categorical grammar to devel- In Thai language processing, many research- op a chunking method, which can group es focus on fundamental level of NLP, such as words into larger unit. Based on the ex- word segmentation, POS tagging. For example, periment, we found the impressive re- Kruengkrai et al. (2006) introduced CRF for sults. We gain around 74.17% on sen- word segmentation and POS tagging trained tence level chunking. Furthermore we over Orchid corpus (Sornlertlamvanich et al., got a more correct parsed tree based on 1998.). However, the number of tagged texts in our technique. Around 50% of tree can Orchid is specific on a technical report, which is be added. Finally, we solved the prob- difficult to be applied to other domains such as lem on implicit sentential NP which is news, document, etc. Furthermore, very little one of the difficult Thai language pro- researches on other fundamental tools, such as cessing. 58.65% of sentential NP is cor- chunking, unknown word detection and parser, rectly detected. have been done. Pengphon et al. (2002) ana- lyzed chunks of noun phrase in Thai for infor- 1 Introduction mation retrieval task. All researches assume that sentence segmentation has been primarily done Recently, many languages applied chunking, or in corpus. Since Thai has no explicit sentence shallow parsing, using supervised learning ap- boundary, defining a concrete concept of sen- proaches. Basili (1999) utilized clause boundary tence break is extremely difficult. recognition for shallow parsing. Osborne (2000) and McCallum et al. (2000) applied Maximum Most sentence segmentation researches con- Entropy tagger for chunking. Lafferty (2001) centrate on "space" and apply to Orchid corpus proposed Conditional Random Fields for se- (Meknavin 1987, Pradit 2002). Because of am- quence labeling. CRF can be recognized as a biguities on using space, the accuracy is not im- generative model that is able to reach global pressive when we apply into a real application. optimum while other sequential classifiers focus on making the best local decision. Sha and Pe- Let consider the following paragraph which reira (2003) compared CRF to other supervised is a practical usage from news:

129 Proceedings of the 8th Workshop on Asian Language Resources, pages 129–136, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing

"สําหรับการวางกําลังของคนเสือแดง ได้มีการวางบังเกอร์โดยรอบพืนทีชุมนุม into a larger unit than word effectively?" We are และใช้นํามันราด | รวมทังมียางรถยนต์ | ขณะการจราจรยังเปิดเป็นปกติ" looking at the possibility of combining words lit: “The red shirts have put bunkers around into a larger grain size. It enables the system to the assembly area and put oil and tires. The understand the complicate structure in Thai as traffic is opened normally.” explained in the example. Chunking approach in this paper is closely similar to the work of Sha We found that three events are described in and Pereira (2003). Second question is "How to this paragraph. We found that both the first and analyze the compound noun structure in Thai?" second event do not contain a subject. The third event does not semantically relate to the previ- Thai allows a compound construction for a noun ous two events. With a literal translation to Eng- and its structures can be either a sequence of lish, the first and second can be combined into nouns or a combination of nouns and verbs. The one sentence; however, the third events should second structure is unique since the word order be separated. is as same as a word order of a sentence. We call this compound noun structure as a “senten- As we survey in BEST corpus (Kosawat tial NP”. 2009), a ten-million word Thai segmented cor- pus. It contains twelve genres. The number of Let us exemplify some Thai examples related to word in sentence is varied from one word to compound word and serial construction problem 2,633 words and the average word per line is in Figure 1. The example 1 shows a sentence 40.07 words. Considering to a News domain, which contains a combination of nouns and which is the most practical usage in BEST, we verbs. It can be ambiguously represented into found that the number of words are ranged from two structures. The first alternative is that this one to 415 words, and the average word length sentence shows an evidence of a serial verb in sentence is 53.20. It is obvious that there is a construction. The first word serves as a subject heavy burden load for parser when these long of the two following predicates. Another alter- texts are applied. native is that the first three word can be formed together as a compound noun and they refer to Example 1: “a taxi driver” which serve as a subject of the คน ขับ รถแท็กซี พบ กระเป๋าสตางค์ following verb and noun. The second alternative is more commonly used in practical language. man(n) drive(v) taxi(n) find(v) wallet(n) However, to set the “N V N” pattern as a noun can be very ambiguous since in the example 1 can be formed a sentential NP from either the

first three words or the last three words. lit1: A man drove a taxi and found a wallet. From the Example 2, an auxiliary verb serial- lit2: A taxi chauffeur found a wallet. ization is represented. It is a combination of Example 2: auxiliary verbs and verb. The word order is น่า จะ ต้อง สามารถ พัฒนา ประเทศ shown in Aux Aux Aux Aux V N sequence. should will must can develop(v) country(n) The given examples show complex cases that require chunking to reduce an ambiguity while Thai text is applied into a syntactical analysis lit: possibly have to develop country. such as parsing. Moreover, there is more chance to get a syntactically incorrect result from either rule-based parser or statistical parser with a high Figure 1. Examples of compounds in Thai amount of word per input.

Two issues are raised in this paper. The first This paper is organized as follows. Section 2 question is "How to separate a long paragraph explains Thai categorial grammar. Section 3

130 illustrates CRF, which is supervised method example, intransitive verb is needed to combine applied in this work. Section 4 explains the with a subject to complete a sentence therefore methodology and experiment framework. Sec- intransitive verb is written as s\np which means tion 5 shows experiments setting and result. Section 6 shows discussion. Conclusion and future work are illustrated in section 7.

2 Linguistic Knowledge 2.1 Categorial Grammar

Categorial grammar (Aka. CG or classical cate- gorial grammar) (Ajdukiewicz, 1935; Bar- Hillel, 1953; Carpenter, 1992; Buszkowski, 1998; Steedman, 2000) is formalism in natural Figure 2 Example of Thai CG-parsed Tree. language syntax motivated by the principle of constitutionality and organized according to the it needs a noun phrase from the left side to syntactic elements. The syntactic elements are complete a sentence. If there is a noun phrase categorised in terms of their ability to combine exists on the left side, the rule of fraction can- with one another to form larger constituents as cellation is applied as np*s\np = s. With CG, functions or according to a function-argument each constituent is annotated with its own syn- relationship. All syntactic categories in CG are tactic category as its function in text. Currently distinguished by a syntactic category identifying there are 79 categories in Thai. An example of them as one of the following two types: CG derivation from Thai is shown in Figure 2. 1. Argument: this type is a basic category, 2.2 CG-Set such as s (sentence) and np (noun phrase). 2. Functor (or function category): this cat- CG-Set are used as a feature when no CG are egory type is a combination of argu- tagged to the input. We aim to apply our chunk- ment and operator(s) '/' and '\'. Functor er to a real world application. Therefore, in case is marked to a complex constituent to that we have only sentence without CG tags, we assist argument to complete sentence will use CG-Set instead. such as s\np (intransitive verb) requires noun phrase from the left side to com- Cat- plete a sentence. Set Cat-Set Member Index CG captures the same information by associ- 0 np คุณสมบัติ ating a functional type or category with all เก็บ , กรอง grammatical entities. The notation α/β is a 2 s\np/pp,s\np/np,s\np/pp/np,s\np rightward-combining functor over a domain of α (np\np)/(np\np), into a range of β. The notation α\β is a leftward- ((s\np)\(s\np))/spnum, np, วงจร, combining functor over β into α. α and β are 3 both argument syntactic categories (np\np)\num,np\num, สัญญาณ (Hockenmaier and Steedman, 2002; Baldridge (np\np)/spnum, and Kruijff, 2003). ((s\np)\(s\np))\num 62 (s\np)\(s\np),s\s มัย , มัง , ล่ะ The basic concept is to find the core of the np/(s\np), 134 การ, ความ combination and replace the grammatical modi- np/((s\np)/np) fier and complement with set of categories based on the same concept with fractions. For Table 1 Example of CG-Set

131 The concept of CG-Set is to group words that 1  n  P(y | x) = exp ∑ F(y, x i),  their all possible CGs are equivalent to the Z  i=1  other. Therefore every word will be assigned to only one CG-Set. By using CG-Set we use the n where Z = exp ( F(y, x i), ) is a lookup table for tagging the input. Table 1 ∑ y ∑i=1 shows examples of CG-set. Currently, there are normalization factor over all state sequences. 183 CG set. F(y, x i), is the global feature vector of CRF for sequence x and y at position i . F(y, x i), 3 Conditional Random Field (CRF) can be calculated by using summation of local features. CRF is an undirected graph model in which each vertex represents a random variable whose F(y, x i), = λ f (y , y ,t) + λ g (x, y,t) distribution is to be inferred, and edge ∑ i i i−1 i ∑ j j i j represents a dependency between two random Each local feature consists of transition feature variables. It is a supervised framework for function f (y , y ,t) and per-state feature labeling a sequence data such as POS tagging i i−1 i and chunking. Let X is a random variable of function g j (x, y,t) . Where λi and λ j are observed input sequence, such as sequence of weight vectors of transition feature function and words, and Y is a random variable of label per-state feature function respectively. sequence corresponding to X , such as sequence of POS or CG. The most probable label The parameter of CRF can be calculated by sequence ( yˆ ) can be obtain by maximizing the likelihood function on the training data. Viterbi algorithm is normally yˆ = arg max p(y | x) applied for searching the most suitable output.

4 Methodology Where x = x1, x2 ,..., xn and y = y1, y2 ,..., yn p(y | x) is the conditional probability Figure 3 shows the methodology of our distribution of a label sequence given by an experiments. To prepare the training set, we input sequence. CRF defines p(y | x) as start with our corpus annotated with CG tag. Then, each sentence in the corpus was parsed by

Figure 3 Experimental Framework

132 our Thai CG parser, developed by GLR tech- with five chunk types. "NP" stands for noun technique. However, not all sentences can be phrase, "VP" stands for verb phrase, "PP" stands parsed successfully due to the complexity of the for preposition phrase, "ADVP" stands for sentence. We kept parsable sentences and adverb phrase and S-BAR stands for unparsable sentences separately. The parsable complementizer that link two phrases. sentences were selected to be the training set. Surface and CG-set are developed from CG There are four features – surface, CG, CG-set dictionary. CG is retrieved from CG tagged and chunk marker – in our experiments. CRF is corpus. IOB is developed by parsing tree. We applied using 5-fold cross validation over apply Thai CG parser to obtain the parsed tree. combination of these features. Accuracy in term Figure 4 shows an example of our prepared of averaged precision and recall are reported. data. We provide 4,201 sentences as a training data in CRF to obtain a chunked model. In this We select the best model from the experiment experiment, we use 5-fold cross validation to to implement the chunker. To investigate evaluation the model in term of F-measure. performance of the chunker, we feed the unparsable sentences to the chunker and surface cg_set cg chunk_label evaluate them manually. ใน 74 s/s/np BADVP After that, the sentences which are correctly วัน 3 np IADVP chunked will be sent to our Thai CG parser. We ที 180 (np\np)/(s\np) IADVP calculate the number of successfully-parsed ไม ่ 54 (s\np)/(s\np) IADVP sentences and the number of correct chunks. หนาว 7 s\np IADVP 5 Experiment Settings and Results หรือ 130 ((s/s)\(s/s))/(s/s) IADVP ใน 74 s/s/np IADVP 5.1 Experiment on chunking ฤดูร้อน 0 np IADVP 5.1.1 Experiment setting เขา 0 np BNP To develop chunker, we apply CG Dictionary สวม 8 s\np/np BVP and CG tagged corpus as input. Four features เสือ 0 np BNP are provided to CRF. Surface is a word surface. มา 148 (s\np)/(s\np) BVP CG is a categorial grammar of the word. CG-set is a combination of CG of the word. IOB เข้าเฝ้า 2 s\np IVP represents a method to mark chunk in a sentence. "I" means "inner" which represents Figure 4 An example of prepared data the word within the chunk. "O" means "outside" which represents the word outside the chunk. "B" means "boundary" which represents the word as a boundary position. It accompanied

Table 2 Chunking acc uracy of each chunk

133 unparsable sentences from our CG parser. These sentences are inputted in chunked model to ob- tain a chunked corpus. We manually evaluate the results by linguist. Linguists evaluate the chunked output in three types. 0 means incorrect chunk. 1 means correct chunk and 2 represents a special case for Thai NP, a sentential NP.

5.2.2 Experiment result

Table 3 Chunking accuracy based on From the experiment, we got an impressive re- word and sentence . sult. We found that 11,698 sentences (78.59%) are changed from unparsable to parsable sen- 5.1.2 Experiment result tence. Only 3,187 (21.41%) are unparsable. We manually evaluate the parsable sentence by ran- From Table 2, considering on chunk based lev- domly select 7,369 sentences. Linguists found el, we found that CG gives the best result 3,689 correct sentences (50.06%). In addition, among surface, CG-set, CG and their combina- we investigate the number of parsable chunk tion. The average on three types in terms of F- calculated from the parsable result and found measure is 86.20. When we analyze infor- 37,743 correct chunks from 47,718 chunks mation in detail, we found that NP, VP and PP (78.47%). We also classified chunk into three show the same results. Using CG shows the F- types NN VP and PP and gain the accuracy in measure for each of them, 81.15, 90.96 and each type 79.14% ,74.66% and 92.57% respec- 99.56 respectively. tively.

From Table 3, considering in both word level 6 Discussion and sentence level, we got the similar results, CG gives the best results. F-measure is 93.24 in 6.1 Error analysis word level and 74.17 in sentence level. This shows the evidence that CG plays an important From the experiment results, we found the fol- role to improve the accuracy on chunking. lowing errors.

5.2 Experiment on parsing 6.1.1 Chunking Type missing 5.2.1 Experiment setting Some chunk missing types are found in experi- We investigate the improvement of parsing con- ment results. For example, [PP บันทึก (rec- sidering unparsable sentences. There are 14,885 ord) ][ NP ตัวอักษรได้ประมาณ (character about) ]. [PP

Figure 4 An Example of sentential NP

134 บันทึก (record) ] should be defined as VP instead It is categorized into Noun, Prep, Noun of PP. Modifier, Number modifier for noun, Number modifier for verb, Number, Clause Marker, 6.1.2 Over-grouping Verb with no argument, Verb with 1 argument, Verb with 2 or more arguments, Prefix noun, Prefix predicate, Prefix predicate modifier, In the sentence “[VP ใช้ (Using) ][ NP Noun linker, Predicate Modification, Predicate (medicine) ][ VP รักษา (treat) ][ NP โรคแต่ละครังต้อง linker, and Sentence Modifier. เป็นไป (each disease have to) ][ PP ตาม (follow) ] [NP คําแนะนําของแพทย์ (doctor’s instruction) ] “, we We found that F-measure is slightly improved from 74.17% to 75.06%. This shows the evi- found that “NP โรคแต่ละครังต้องเป็นไป (each disease dence that if we carefully categorized data based have to) “ has over-grouping. IT is necessary to on linguistics viewpoint, it may improve more breakdown to NP โรคแต่ละครัง (each disease) and accuracy. VP ต้องเป็นไป (have to). The reason of this error is due to allow the sentential structure NP VP NP, 7 Conclusions and Future Work and then NP and VP are combined. In this paper, we stated Thai language problems 6.1.3 Sentential NP on the long sentence pattern and find the novel method to chunk sentence into smaller unit, We investigated the number of sentential NP. If which larger than word. We concluded that us- the number of chunk equal to 1, sentence should ing CRF accompanied with categorical grammar not be recognized as NP. Other cases are de- show the impressive results. The accuracy of fined as NP. We found that 929 from 1,584 sen- chunking in sentence level is 74.17%. We are tences (58.65 % of sentences) are correct sen- possible to collect 50% more on correct tree. tential NP. This evidence shows the impressive This technique enables us to solve the implicit results to solve implicit NP in Thai. Figure 4 sentential NP problem. With our technique, we shows an example of sentential NP. found 58% of implicit sentential NP. In the fu- ture work, there are several issues to be im- proved. First, we have to trade-off between 6.1.4 CG-set over-grouping problem and implicit sentential problem. Second, we plan to consider ADVP, Since CG-set is another representation of word SBAR, which has a very small size of data. It is and can only detect from CG dictionary. It is not adequate to train for a good result. Finally, very easy to develop a tag sequence using CG- we plan to apply more linguistics knowledge to set. We found that CG-set is more powerful than assist more accuracy. surface. It might be another alternative for less language resource situation. References 6.2 The Effect of Linguistic Knowledge on chunking Abney S., and Tenny C., editors, 1991. Parsing by chunks , Priciple-based Parsing . Kluwer Academic Publishers. Since CG is formalism in natural language syn- tax motivated by the principle of constitutionali- ty and organised according to the syntactic ele- Awasthi P., Rao D., Ravindram B., 2006. Part ments, we would like to find out whether lin- of Speech Tagging and Chunking with HMM guistic knowledge effects to the model. We and CRF , Proceeding of the NLPAI Machine grouped 89 categorial grammars into 17 groups, Learning Competition. called CG-17. Basili R., Pazienza T., and Massio F., 1999. Lexicalizing a shallow parser, Proceedings of

135 Traitement Automatique du Langage Naturel 7th Workshop on Asian Language Resources, 1999. Corgese, Corsica. ACL-IJCNLP 2009,9-16

Charoenporn Thatsanee, Sornlertlamvanich Vi- Osborne M. 2000. Shallow Parsing as Part-of- rach, and Isahara Hitoshi. 1997. Building A Speech Tagging. Proceedings of CoNLL- Large Thai Text Corpus - Part-Of-Speech 2000 and LLL-2000, Lisbon, Portugal. Tagged Corpus: ORCHID. Proceedings of Natural Language Processing Pacific Rim Pengphon N., Kawtrakul A., Suktarachan M., Symposium. 2002. Word Formation Approach to Noun Phrase Analysis for Thai , Proceedings of Kosawat Krit, Boriboon Monthika, Chootrakool SNLP2002. Patcharika, Chotimongkol Ananlada, Klaithin Supon, Kongyoung Sarawoot, Kriengket Sha F. and Pereira F., 2003. Shallow Parsing Kanyanut, Phaholphinyo Sitthaa, Puroda- with Conditional Random Fields , Proceeding kananda Sumonmas,Thanakulwarapas of HLT-NAACL. Tipraporn, and Wutiwiwatchai Chai. 2009. BEST 2009: Thai Word Segmentation Soft- ware Contest . The Eigth International Sym- posium on Natural Language Processing : 83-88.

Kruengkrai C., Sornlertlumvanich V., Isahara H, 2006. A Conditional Random Field Frame- work for Thai Morphological Analysis , Pro- ceedings of 5th International Conference on Language Resources and Evaluation (LREC- 2006).

Kudo T., and Matsumoto Y., 2001. Chunking with support vector machines , Proceeding of NAACL.

Lafferty J., McCallum A., and Pereira F., 2001. Conditional Random Fields : Probabilistic models for segmenting and labeling sequence data . In Proceeding of ICML-01, 282-289.

McCallum A., Freitag D., and Pereira F. 2000. Maximum entropy markov model for infor- mation extraction and segmentation. Pro- ceedings of ICML.

Molina A., and Pla F., 2002. Shallow Parsing using Specialized HMMs , Journal of Machine Learning Research 2,595-613

Nguyen L. Minh, Nguyen H. Thao, and Nguyen P., Thai. 2009. An Empirical Study of Viet- namese Noun Phrase Chunking with Discrim- inative Sequence Models , Proceedings of the

136         



                        !"# $%               &          &  '                      '     $   !        !    & !     '   (  !            '!   '  ) '     $      ' #      '     &   (! '      '           '   9         #            (               %   '   !  *+        #    '         ,  +-----#                   # ,     &       #                 '  !                .//'   78          !       &   0   !      0 !#    !   '       (  &  #       1   &  ! ,      #  :  ,  $     !       $    *    ;            2             (           2    #     5                                         /    < &     &       (           =     # (        >   #   34 556    '    # (         

     /  ,    '!   '   .#      '    %!    '    '  #  '       (! +--     #  %    ' 7 ?8  '                  ! 7,2&8     '   #      %        !                  #        '      #   '    7.     +  '   #  #     *--58    '    '    $   (               '       ( '  '         7/8 #   (!     %   ( '      '     #          #       '  '    @ '              7  ) '8    ' @ ' 7  , '  7,8        '8 &   ( '      ' #        (               ')  ')      +&       !         #   

137 Proceedings of the 8th Workshop on Asian Language Resources, pages 137–143, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing

 , '  7;8

  '  ( ' . (4 3  K" J 6        ) '  # ,2&!   '  O   # P  A +B  $  '     9    ,   &  #'  (       A        '  # (   27+33<8 '   7*--48  !   "  % & '  #   '   '  ! & # '    ! #  ' (        '      #    '   7+3448   ! #7+8    ' 7+33+8       '          ' 7+8            Q 6  #           !       ,        78       7+33;8   # 4     ''  '  '  O #  P 7+33+8 :#  7+33<8   !  '          !    6        &           %12   #  %   #  #       1     #    . 3   O    P #        2   #  #             !  #  $ "  1      '!  '446  ) '     E   7*---8   .//'   )     '    #  '        (      '#7*8          3+ 336     3* *<6      (  (  7*8 ,    7*---8   '     K   6 # 3+ -<6    '0  3* -;6     "  7*--+8      4 4 3 c           # 3; 546       7*--;8  O     P   ' '    '     '            K " K 6 "   A    '                '     0               4 c     # #     '   '   O     P   F      7*-->8       '    ?/  &      '   # '  3- +46    3* 536     '   # '(    '            &    '   '   !     7*--<8    .//'  '             &       . &   ! '  #7;8  #B #      '       # '  *     #          &  .     

138 *-----# #   #  &  &  %12           !  +<---- #         .//   & ;=<-   !   <----# #   4+5;- #       <;-        &  #   334<#      ! *---- #      3*6  !            7  # +--6      '   '  ! 8     5-<      .//'  9    7*--=8 ! #3333#         (  '  !  1   .! * +     &             ! '    ,       #  &    A   '     ' !    '        (!'   !    &  .// '        #       !     A           ! !'      &      !           &        ;<---#    !   ' ' #   &     #4> 56   #   7*--=8  :A   *! ,-  # 1    .  (  ,  (   #      #     %  *+      !  '              '   (        !                   ,    &        !     '            '      ! *+---#      &       3- 436         1 *!    # *---       ( #    )     F  0  #      +1+ A      #       718  &   ''       5<4<       0     1  +-+5+5#  7 /$  *--38   (           ! ɸ ) ) ) ̽ Ɣ͕ͦ͛+ 6͊ʚͨͥ ͗ͥ ʛT͊ʚ͗ͥ ʛ )    '' #  %  1     '  Y    '' 1 !      .#  '  #  ''    '       #      !                 #   '  # ! #     '  ! $   '   '    # 0  7,28     '  7,2%8  '         ) ' )     # )) 7*-->8  &  (  ɸ ̽ Ɣ͕ͦ͛+ 6ș͊ʚͨ$ ͗$ʛT͊ʚ͗$ ͗$ͯͥQ͗$ͯͦʛ       ''! $Ͱͥ   (     && 72 *---8    KUUJM 7K J *`KJəcKJ\G   #  '          K ,2& J K,2 J  (    (         .//7    ! K1JKJ*`K1JKJəcK1JKJ\G   8  KJ% ,2&K J  K KJ% ,2 J  K J2 JKUU M 7 *!% & #   ,       &                   $         ( O     P         !      2    #     !    .  #         !

139      &      %͉͍͊ʚͫ$ʛ Ɣ̼͉͍͐͊ʚͫ$ͯͥʛ Ɣ̽͐͌͊         (  ͉͍͊ʚͫ$ͯͦʛ Ɣ̼͐͆&  &    0    #  %    ͫ$̓  ,2&# #   ' '     ' %   1         1   Y    '  2 '  #  #,2% .  '      '  A  (      '     ' #    [ K%JK,2%J əc* K2JKJ]  K1JKJK1JKJȧ K1JKUUJ^ \H* + K1JKJ əc K1JKJ a # K1JKY  J K%J K ,2& J əʃK1J K 9JF>FK1JK9J% &H K2JK,2J    E ! ! '3+3  #  '   !   R' 3+3 Y Y  ;# /  O'     3+3P OY # P   &  #        ! &  #            % ͉͍͊ʚͫ$ʛ Ɣ̻̻͑͆  ͉͍͊ʚͫ$ͯͥʛ Ɣ ʚ ʛ ʚ ʛ  % ͉͍͊ ͫ$ Ɣ̼͎͐ ͉͍͊ ͫ$ͯͥ X Ɣ ̼͐͆̓ &        ͫ$ ʜ̼͐Q ̼͐̓Q ̼͐͆Q ̼͐͆̓Q ̻͏͒ʝ     ̓   ͫ$̼   *!(    . "#        1   UU #    1              ,       ' # .   &      '     '             1       #        && K1J KJȨ \  K1J KJ K1JKJȧ &          K%JK,2&JK2JK,2JȧʃK1JKUUJ& \G  1   .//   &           ! 2)              ' !   R)    '           A *# O  R) # '    P     ,    &  #         1        % ͉͍͊ʚͫ$ʛ Ɣ̼͐͆  ͉͍͊ʚͫ$ͯͥʛ Ɣ̈́̈́ ͉͍͊ʚͫ ʛ Ɣ͈͈ &     $ͯͦ    ͫ$̼ͯͥͫ$̓        &&  ( +*; 2      ,2 #    #       ,:,: #,2 .  '  / A  %12       '    ,2         *K2JK,2J[K1JKJ>#K1JK:J        K%JK,2&J  K2JK,2J a KJ%K ,:J  :    Y    .   #      O.   #   P %12        A *B   ,  &  #          ;&     Y   /$ "7*--38 

140 *!* &$  0 ,  1   ,    ' 0! ,      (   &         && &    &          (   #      A       !     A : '   (           '  '   '          *+        '   /              ,   3; *;6  &': 7    8    &   (       : 7  8     (!  (    # %12             %12            !      !  ' !  &     %12   # 1                    &     &2!2!,2    3< +56  2           )2 &             #  '  '                    34 +56 & &! (     '  7' ! ' *' #   8    #   &' +  (            (!    ' %12#1  &     (      3< 3<6          &' +B   , (           34 556  / ! )  +    Y           +  %12 : 2  :     :           *  %12   :    #      3< -=6  ;  %12 .'  :  34 -*6   ' '  5  %12(     :     Y  (        # <  %12(   .'  :   '    =  %12                3< 4=6  >  %12 .'     34 *36      4  %12(         3  %12(   .'    

&' *B:  ,(    2  -6  6  26  / ! +   345 "   ,  "   ,  "   ,  : 2 7   8 +  3; *; 35 3; <= >4 3* =5 4* 3< 33 35 3* 33 :&  %12  *  3< +5 4* <* >< -4 44 5< 4< +> 3> <5 33 ** :& .' %12  ;  34 +5 3= *+ 4> <5 3* -- 3= =* 33 5* 33 >* :&  (   5  3< 3< 4; 34 >3 +; 44 44 4= 5+ 34 *> 33 <+  :& .' (   <  34 55 3> += 3- -> 3; +- 3= =* 33 5< 33 >>  :&  %12  =  3< -= 4+ =5 >5 >> 44 *- 45 3; 3> =* 33 ** &: .' %12  >  34 -* 3< 3< 4= ;* 3+ <> 3= =* 33 *4 33 =3 &:  (   4  3< 4= 4; *5 >4 -+ 44 5< 4< => 34 >4 33 <+  &: .' (   3  34 *3 3= 4- 44 >< 3+ >4 3= =* 33 5* 33 >5  &:

141 0!  1   *  % ͉͍͊ʚͫ$ʛ Ɣ̼͐  ͉͍͊ʚͫ$ͯͥʛ Ɣ ʜ̼͐Q ̼͐̓Q ̼͐͆ʝ &     ͫ  &    '   $ ̓   '      &  ;  %͉͍͊ʚͫ ʛ Ɣ̼͉͍͐͊ʚͫ ʛ Ɣ̽͐͌͊ (     (     $ $ͯͥ ͉͍͊ʚͫ ʛ Ɣ̼͐͆&  '       '     $ͯͦ  ͫ ̓  '      &     '   $ 5  % ͉͍͊ʚͫ ʛ Ɣ̼͐͆  ͉͍͊ʚͫ ʛ Ɣ̈́̈́    #    $ $ͯͥ ʚ ʛ    &  (   #    ͉͍͊ ͫ$ͯͦ Ɣ͈͈ &     0     '        '   ͫ$̼ͯͥͫ$̓              '  <  % ͉͍͊ʚͫ$ʛ Ɣ̼͐͆ ͉͍͊ʚͫ$ͯͥʛ Ɣ͈͈  '     '   ͉͍͊ʚͫ$ͯͦʛ Ɣ̈́̈́ &           #      ͫ$̼ͯͥͫ$̓     % '  #  ! =  % ͉͍͊ʚͫ$ʛ Ɣ̼͐͆̓  ͉͍͊ʚͫ$ͯͥʛ Ɣ           ;6 ! ͈͈͊  ͉͍͊ʚͫ$ͯͦʛ Ɣ͈͈  &       /   1  ͫ$̼ͯͥͫ$̓   # %12           >  %͉͍͊ʚͫ$ʛ Ɣ̼͉͍͐͆̓͊ʚͫ$ͯͥʛ Ɣ͈͈   '      !  ͉͍͊ʚͫ$ͯͦʛ Ɣ ͈͈͊  &         ͫ$̼ͯͥͫ$̓  &      '  ''   4  % ͉͍͊ʚͫ$ʛ Ɣ̻̻͑͆  ͉͍͊ʚͫ$ͮͥʛ Ɣ  #       &    ͈͈͉͍͊ʚͫ$ͮͦʛ Ɣ͈͈&     '    '    ͫ$ͮͥ̓  .         3  % ͉͍͊ʚͫ$ʛ Ɣ̻̻͑͆ ͉͍͊ʚͫ$ͮͥʛ Ɣ̈́̈́ .           &   ͉͍͊ʚͫ$ͮͦʛ Ɣ͈͈ &               '   !  ͫ$ͮͥ̓  '       ! +- % ͉͍͊ʚͫ$ʛ Ɣ̻̻͑͆  ͉͍͊ʚͫ$ͯͥʛ Ɣ     #. A  ̼͐̓&  ͫ$̓  #   (  #    ++ % ͉͍͊ʚͫ$ʛ Ɣ̻̻͑͆  ͉͍͊ʚͫ$ͯͥʛ Ɣ  '   '            ̼͐͆̓&  ͫ$̓    #   +* % ͉͍͊ʚͫ$ʛ Ɣ̻̻͑͆  ͉͍͊ʚͫ$ͯͥʛ Ɣ          !  ! ̼͐͆̓ &        ͫ$  ' #  #  # .  ̓           #   #      # 9   ' #   ,.       #                '   '    +33+      " #    '   7 #    9     # . / *--= !"   %    #        '  !   # !  !$!%      '        &  %   %/   !  .// '         (/'%  '   #  Y        (     (   2/ +33<  " !!%'      ' '   ()  B%'   # (          2  &  *---         !     '  34 556   %   (  !               Y!   $ B**5M*;+  &     '    #B  ' 9 / .  :  ,  2    *--4  *   !%!  +  % ͉͍͊ʚͫ$ʛ Ɣ̼͎͐ ͉͍͊ʚͫ$ͯͥʛX Ɣ    +!%  ,'  !    ʜ̼͐Q ̼͐̓Q ̼͐͆Q ̼͐͆̓Q ̻͏͒ʝ&  71%F-48/   "   ͫ$̼ 

142   "!.    .!.  +33;  +!   !*---!*---B+<>M      %     :1%F +<3  ,%  E F   U   &  &  *---  *  '   "  +344  + !    !%     %     ! *---  !   !   '  % !*---B+=;M+=<                    B+;=M+5;  9    )" #    *--=  *   !      + ' %%,!+!  %    %/    (/'%  F   &': *--> $   $   %   A !     :     . *--5   &   " !!!  + - ! () '   / &                   ' ###       " &   / w  *--+   #  ,  /      *--+B+-+;M+-+<  /$   . *--3 +  .!% ! ( "  !!  %   >Y  !  :   %UR-3 !   /$ " *--3 O& / 2 !     QY R    P %   !   ! /0 +   !  "! +  1 +  $!2 "+1+R45      & (     '   Byy###            E 2  *--;  '    !%  * $%!  1   %       5+ !   /        !,  +B53>M<-5  :#  / /  +33<  '     !% !     %          #   ,        UB4*M35  )). *--> "  !"  ! () '  /&    !                ' ###         2        :  *--<  * !*  %    %U!-

143 Development of the Korean Resource Grammar: Towards Grammar Customization

Sanghoun Song Jong-Bok Kim Francis Bond Jaehyung Yang Dept. of Linguistics School of English Linguistics and Multilingual Studies Computer Engineering Univ. of Washington Kyung Hee Univ. Nanyang Technological Univ. Kangnam Univ. [email protected] [email protected] [email protected] [email protected]

Abstract data with theory-oriented approaches, is unable to yield efficient parsing or generation. The addi- The Korean Resource Grammar (KRG) tional limit of the KRG1 is its unattested parsing is a computational open-source grammar efficiency with a large scale of naturally occur- of Korean (Kim and Yang, 2003) that has ring data, which is a prerequisite to the practical been constructed within the DELPH-IN uses of the developed grammar in the area of MT. consortium since 2003. This paper re- Such weak points have motivated us to move ports the second phase of the KRG devel- the development of KRG to a data-driven ap- opment that moves from a phenomena- proach from a theory-based one upon which the based approach to grammar customiza- KRG1 is couched. In particular, this second tion using the LinGO Grammar Matrix. phase of the KRG (henceforth, KRG2) also starts This new phase of development not only with two methods: shared grammar libraries (the improves the parsing efficiency but also Grammar Matrix (Bender et al., 2002; Bender et adds generation capacity, which is nec- al., 2010)) and data-driven expansion (using the essary for many NLP applications. Korean portions of multilingual texts). Next, we introduce the resources we used ( 2). this is followed by more detailed motiva- § 1 Introduction tion for our extensions ( 3). We then detail how § The Korean Resource Grammar (KRG) has been we use the grammar libraries from the Grammar Matrix to enable generation ( 2) and then ex- under development since 2003 (Kim and Yang, § pand the coverage based on a corpus study ( 5). 2003) with the aim of building an open source § grammar of Korean. The grammatical frame- work for the KRG is Head-driven Phrase Struc- 2 Background ture Grammar (HPSG: (Pollard and Sag, 1994; Sag et al., 2003)), a non-derivational, constraint- 2.1 Open Source NLP with HPSG based, and surface-oriented grammatical archi- The Deep Linguistic Processing with HPSG tecture. The grammar models human languages Initiative (DELPH-IN: www.delph-in.net) as systems of constraints on typed feature struc- provides an open-source collection of tools and tures. This enables the extension of grammar grammars for deep linguistic processing of hu- in a systematic and efficient way, resulting in man language within the HPSG and MRS (Min- linguistically precise and theoretically motivated imal Recursion Semantics (Copestake et al., descriptions of languages. 2005)) framework. The resources include soft- The initial stage of the KRG (hereafter, ware packages, such as the LKB for parsing and KRG1) has covered a large part of the Korean generation, PET (Callmeier, 2000) for parsing, grammar with fine-grained analyses of HPSG. and a profiling tool [incr tsdb()] (Oepen, 2001). However, this version, focusing on linguistic There are also several grammars: e.g. ERG; the

144 Proceedings of the 8th Workshop on Asian Language Resources, pages 144–152, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing

English Resource Grammar (Flickinger, 2000), Jacy; a Japanese Grammar (Siegel and Bender, 2002), the Spanish grammar, and so forth. These Figure 1: HPSG/MRS-based MT Architecture along with some pre-compiled versions of pre- the KRG1 with no generation function is limited processing or experimental tools are packaged in only to the Source Analysis in Figure 1. In addi- the LOGON distribution.1 Most resources are un- tion, since its first aim was to develop a Korean der the MIT license, with some parts under other grammar that reflects its individual properties in open licenses such as the LGPL.2 The KRG has detail, the KRG1 lacks compatible semantic rep- been constructed within this open-source infras- resentations with other grammars such as ERG tructure, and is released under the MIT license3. and Jacy. The mismatches between the compo- 2.2 The Grammar Matrix nents of the KRG1 and those of other grammars The Grammar Matrix (Bender et al., 2002; Ben- make it difficult to adopt the Korean grammar for der et al., 2010) offers a well-structured envi- an MT system. To take a representative example, ronment for the development of precision-based the KRG1 treats tense information as a feature grammars. This framework plays a role in build- type of HEAD, while other grammars incorpo- ing a HPSG/MRS-based grammar in a short rate it into the semantics; thus, during the trans- time, and improving it continuously. The Gram- fer process in Figure 1, some information will be mar Matrix covers quite a few linguistic phe- missing. In addition, KRG1 used default inheri- nomena constructed from a typological view. tance, which makes the grammar more compact, There is also a starter-kit, the Grammar Matrix but means it could not used with the faster PET customization system which can build the back- parser. We will discuss this issue in more detail bone of a computational grammar from a linguis- in Section 4.1. tic description. Another main issue in the KRG1 is that some of the defined types and rules in the grammar are 2.3 A Data-driven Approach inefficient in generation. Because the declared Normally speaking, building up a computational types and rules are defined with theoretical mo- grammar is painstaking work, because it costs tivations, the run-time for generating any parsing too much time and effort to develop a grammar units within the system takes more than expected by hand only. An alternative way is a data-driven and further causes memory overflow errors to approach which ensures ‘cheap, fast, and easy’ crop up almost invariably, even though the in- development. However, this does not mean that put is quite simple. This problem is partially due one is better than the other. Each of these two to the complex morphological inflection system approaches has its own merits. To achieve the in the KRG1. Section 4.2 discusses how KRG2, best or highest performance of parsing and gen- solves this problem. eration, each needs to complement the other. Third it is better “to be robust for parsing and strict for generation” (Bond et al., 2008). That 3 Directions for Improvement means robust rules will apply in parsing, though the input sentence does not sound perfect, but not 3.1 Generation for MT in generation. For example, the sentence (1b), HPSG/MRS-based MT architecture consists of the colloquial form of the formal, standard sen- parsing, transfer, and generation, as assumed in tence (1a), is used more frequently in colloquial Figure 1 (Bond et al., 2005). As noted earlier, context:

1wiki.delph-in.net/moin/LogonTop (1) a. ney-ka cham yeppu-ney. 2www.opensource.org/licenses/ you-NOM really pretty-DECL 3It allows people “. . . without limitation the rights to ‘You are really pretty.’ use, copy, modify, merge, publish, distribute, sublicense, b. ni-ka cham ippu-ney and/or sell copies” so long as “The above copyright notice and this permission notice shall be included . . . ” The grammar needs to parse both (1a) and

145 (1b) and needs to yield the same MRS be- suitable for MT in the long term. In developing cause both sentences convey the same truth- the grammar in the next phrase, we assumed the conditional meaning. However, the KRG1 han- following principles: dles only the legitimate sentence (1a), exclud- (2) a. The Grammar Matrix will apply when a judg- ing (1b). The KRG1 is thus not sophisticated ment about structure (e.g. semantic represen- enough to distinguish these two stylistic differ- tation) is needed. ent sentences. Therefore we need to develop the b. The KRG will apply when a judgment about Korean is needed. generation procedures that can choose a proper c. The resulting grammar has to run on both sentence style. Section 4.3 proposes the ‘STYLE’ PET and LKB without any problems. feature structure as the choice device. d. Parsing needs to be accomplished as robustly as possible, and generation needs to be done 3.2 Exploiting Corpora as strictly as possible. One of the main motivations for our grammar 4 Generation improvement is to achieve more balance between linguistic motivation and practical purpose. We It is hard to alter the structure of the KRG1 have first evaluated the coverage and perfor- from top to bottom in a relatively short time, mance of the KRG1 using a large size of data mainly because the difficulties arise from con- to track down the KRG1’s problems that may verting each grammar module (optimized only cause parsing inefficiencies and generating clog. for parsing) into something applicable to gener- In other words, referring to the experimental re- ation, and further from making the grammar run sults, we patterned the problematic parts in the separately for parsing and generation. current version. According to the error pattern, Therefore, we first rebuilt the basic schema of on the one hand, we expanded lexicon from oc- the KRG1 on the Grammar Matrix customiza- curring texts in our generalization. On the other tion system, and then imported each grammar hand, we fixed the previous rules and sometimes module from KRG1 to the matrix-based frame introduced new rules with reference to the occur- ( 4.1). In addition, we reformed the inflectional § rence in texts. hierarchy assumed in the KRG1, so that the grammar does not impede generation any longer 3.3 How to Improve ( 4.2). Finally, we introduced the STYLE feature § In developing the KRG, we have employed two structure for sentence choice in accordance with strategies for improvement; (i) shared grammar our principles (2c-d) ( 4.3). libraries and (ii) exploiting large text corpora. § We share grammar libraries with the Gram- 4.1 Modifying the Modular Structure mar Matrix in the grammar (Bender et al., 2002) The root folder krg contains the basic type as the foundation of KRG2. The Grammar Ma- definition language files (*.tdl. In the trix provides types and constraints that assist the KRG2, we subdivided the types.tdl into: grammar in producing well-formed MRS repre- matrix.tdl file which corresponds to gen- sentations. The Grammar Matrix customization eral principles; korean.tdl with language system provides with a linguistically-motivated particular rules; types-lex.tdl for lex- broad coverage grammar for Korean as well as ical types and types-ph.tdl for phrasal the basis for multilingual grammar engineering. types. In addition, we reorganized the KRG1’s In addition, we exploit naturally occurring texts lexicons.tdl file into the lex folder con- as the generalization corpus. We chose as our sisting of several sub-files in accordance with the corpora Korean texts that have translations avail- POS values (e.g.; lex-v.tdl for verbs). able in English or Japanese, because they can be The next step is to revise grammar modules the baseline of multilingual MT. Since the data- in order to use the Grammar Matrix to a full ex- driven approach is influenced by data type, mul- tent. In this process, when inconsistencies arise tilingual texts help us make the grammar more between KRG1 and KRG2, we followed (2a-b).

146 We further transplanted each previous module at least 10,000 (10[A] 10[B] 10[C] 10[D]) × × × into the KRG2, while checking the attested test calculations. In contrast, in (b), only 100 (10[A] items used in the KRG1. The test items, con- 10[D]) calculations is enough to generate node × sisting of 6,180 grammatical sentences, 118 un- D. That means, the deeper the hierarchy is, the grammatical sentences, were divided into sub- more the complexity increases. Table 1 shows groups according to the related phenomena (e.g. (a) requires more than 52 times as much com- light verb constructions). plexity as (b), though they have the same number of nodes. 4.2 Simplifying the Inflectional Hierarchy Table 1: Complexity of (a) and (b) Korean has rigid ordering restrictions in the mor- (a) (b) phological paradigm for verbs, as shown in (3). B’ 10[A] 10[B’] 100 10[A] 10[B’] × × C’ 10[A] 10[B] 10[C’] 1,000 10[A] 10[C’] × × × (3) a. V-base + HON + TNS + MOOD + COMP D’ 10[A] 10[B] 10[C] 10[D’] 10,000 10[A] 10[D’] × × × × D 10[A] 10[B] 10[C] 10[D] 10,000 10[A] 10[D] × × × × b. ka-si-ess-ta-ko ‘go-HON-PST-DC-COMP’ Σ 21,100 400 KRG1 dealt with this ordering of suffixes by us- ing a type hierarchy that represents a chain of in- When generation is processed by LKB, all po- flectional slots (Figure 2: Kim and Yang (2004)). tential inflectional nodes are made before syntac- tic configurations according to the given MRS. Thus, if the hierarchy becomes deeper and con- tains more nodes, complexity of (a)-styled hi- erarchy grows almost by geometric progres- sion. This makes generation virtually impossi- ble, causing memory overflow errors to the gen- eration within the KRG1. Figure 2: Korean Verbal Hierarchy A fully flat structure (b) is not always supe- rior to (a). First of all, the flat approach ig- This hierarchy has its own merits, but it is not nores the fact that Korean is an agglutinative so effective for generating sentences. This is be- language. Korean morphological paradigm can cause the hierarchy requires a large number of yield a wide variety of forms; therefore, to enu- calculations in the generation process. Figure 3 merate all potential forms is not only undesirable and Table 1 explains the difference in computa- but also even impossible. tional complexity according to each structure.In The KRG2 thus follows a hybrid approach (c) that takes each advantage of (a) and (b). (c) is more flattened than (a), which lessens computa- tional complexity. On the other hand, in (c), the depth of the inflectional hierarchy is fixed as two, and the skeleton looks like a unary form, though each major node (marked as a bigger circle) has Figure 3: Calculating Complexity its own subtypes (marked as dotted lines). Even though the depth has been diminished, the hier- Figure 3, (a) is similar to Figure 2, while (b) is on archy is not a perfectly flat structure; therefore, it the traditional template approach. Let us com- can partially represent the austere suffix ordering pare each complexity to get the target node D. in Korean. The hierarchy (c), hereby, curtails the For convenience’ sake, let us assume that each cost of generation. node has ten constraints to be satisfied. In (a), In this context, we sought to use the minimum since there are three parents nodes (i.e. A, B, and number of possible inflectional slots for Korean. C) on top of D, D cannot be generated until A, We need at least three: root + semantic slot(s) B, and C are checked previously. Hence, it costs + syntactic slot(s). That is, a series of suffixes

147 Table 2: Complexity of (a-c) forms that stem from the corresponding canoni- Depth Complexity cal forms falls under robust in Figure 4. For in- (a) n 3 10,000 stance, if the text domain for generation is news- (b) n≥ = 1 ≥ 100 (c) n = 2 10,000 paper, we can select only written as our sentence choice, which excludes other styled sentences that denote semantic information attaches to the from our result. second slot, and a series of suffixes, likewise, Let us see (1a-b) again. ni ‘you’ in (1b) is a di- attaches to the third slot. Since semantic suf- alect form of ney, but it has been used more pro- fixes are, almost invariably, followed by syntac- ductively than its canonical form in daily speech. tic ones in Korean, this ordering is convincing, In that case, we can specify STYLE of ni as di- granting that it does not fully represent that there alect as given below. In contrast, the neutral is also an ordering among semantic forms or syn- form ney has an unspecified STYLE feature: tactic ones. (4) is an example from hierarchy (c). ni := n-pn-2nd-non-pl & There are three slots; root ka ‘go’, semantic suf- [ STEM < ‘‘ni’’ >, STYLE dialect ]. fixes si-ess, and syntactic ones ta-ko. ney := n-pn-2nd-non-pl & [ STEM < ‘‘ney’’ > ]. (4) a. V-base + (HON+TNS) + (MOOD+COMP) Likewise, since the predicate in (1b) ippu b. ka-si+ess-ta+ko ‘go-HON+PST-DC+COMP’ ‘pretty’ stems from yeppu in (1a), they share the predicate name ‘ yeppu a 1 rel’ (i.e. the RMRS standard for predicate names such as Assuming there are ten constraints on each node, ‘ lemma pos sense rel’), but differ in each the complexity to generate D in (c) is just 10,000. STYLE feature. That means (1a-b) share the The measure, of course, is bigger than that of (b), same MRS structure (given below). KRG hereby but the number never increases any more. That can parse (1b) into the same MRS as (1a) and means, all forms at the same depth have equal generate (1a) from it. complexity, and it is fully predictable. Table 2 compares the complexity from (a) to (c). By con- mrs LTOP h1 h   verting (a) to (c), we made it possible to generate INDEX e2 e     with KRG2.  person rel exist q rel     LBL h3 h LBL h5 h         x4   x , ARG0 ,        4.3 Choosing a Sentence Style  ARG0 x4 PNG.PER 2nd  RSTR h6 h           PNG.NUM non-pl  BODY h7 h  RELS      The choice between formal or informal (collo-  *    +       cham d 1 rel yeppu a 1 rel    quial) sentence styles depends on context. A ro-  LBL h1 LBL h10 h    ,     ARG0 e9 e ARG0 e2    bust parser should cover both styles, but we gen-  ARG1 h8 h  ARG1 x4                     erally want a consistent style when generating.  qeq qeq    HCONS HARG h6 , HARG h8         * LARG h3 LARG h10 +                Figure 5: MRS of (1a-b)

These kinds of stylistic differences can take place at the level of (i) lexicon, (ii) morpholog- Figure 4: Type Hierarchy of STYLE ical combination, and (iii) syntactic configura- tion. The KRG2 revised each rule with reference In such a case, the grammar resorts to STYLE to its style type; therefore, we obtained totally to filter out the infelicitous results. The type hi- 96 robust rules. As a welcoming result, we could erarchy is sketched out in Figure 4. strict is near manipulate our generation, which was successful to school grammar (e.g. written is a style of respect to (2c-d). Let us call the version recon- newspapers). On the other hand, some variant structed so far ‘base’.

148 5 Exploiting Corpora 5.2 Methods We tried to do experiments and improve the 5.1 Resources KRG, following the three steps repeatedly: (i) This study uses two multilingual corpora; one is evaluating, (ii) identifying, and (iii) exploiting. the Sejong Bilingual Corpora: SBC (Kim and In each of the first step, we tried to parse the nine Cho, 2001), and the other is the Basic Travel Ex- test suites and generate sentences with the MRS pression Corpus: BTEC (Kikui et al., 2003). We structures obtained from the parsing results, and exploited the Korean parts in each corpus, taking measured their coverage and performance. Here, them as our generalization corpus. Table 3 repre- ‘coverage’ means how many sentences can be sents the configuration of two resources (KoEn: parsed or generated, and ‘performance’ repre- Korean-English, KoJa: Korean-Japanese): sents how many seconds it takes on average. In the second step, we identified the most serious Table 3: Generalization Corpora problems. In the third step, we sought to exploit SBC BTEC our generalization corpora in order to remedy the Type Bilingual Multilingual drawbacks. After that, we repeated the proce- Domain Balanced Corpus Tourism KoEn : 243,788 dures until we obtain the desired results. 914,199 Words KoJa : 276.152 KoEn : 27.63 92.7 5.3 Experiments T/T ratio KoJa : 20.28 KoEn : 16.30 So far, we have got two versions; KRG1 and Avr length 8.46 KoJa : 23.30 base. Our further experiments consist of four phases; lex, MRS, irules, and KRG2. We also make use of nine test suites sorted Expanding the lexicon: To begin with, in or- by three types (Each test suite includes 500 sen- der to broaden our coverage, we expanded our tences). As the first type, we used three test lexical entries with reference to our generaliza- sets covering overall sentence structures in Ko- tion corpus and previous literature. Verbal items rean; Korean Phrase Structure Grammar (kpsg; are taken from Song (2007) and Song and Choe Kim (2004)), Information-based Korean Gram- (2008), which classify argument structures of mar (ibkg; Chang (1995)), and the SERI test set Korean verbal lexicon into subtypes within the (seri; Sung and Jang (1997)). HPSG framework in a semi-automatic way. The Second, we randomly extracted sentences reason why we do not use our corpus here is from each corpus, separately from our gener- that verbal lexicon commonly requires subcat- alization corpus; two suites were taken from egorization frames, but we cannot induce them the Korean-English and Korean-Japanese pair in so easily only using corpora. For other word SBC (sj-ke and sj-kj, respectively). The other classes, we extracted lexical items from the POS two suites are from the BTEC-KTEXT (b-k), tagged SBC and BTEC corpora. Table 4 explains and the BTEC-CSTAR (b-c); the former consists how many items we extracted from our general- of relatively plain sentences, while the latter is ization corpus. Let us call this version ‘lex’. composed of spoken ones. Table 4: Expansion of Lexical Items Third, we obtained two test suites from sample verbal nouns 4,474 sentences in two dictionaries; Korean-English verbs and adjectives 1,216 (dic-ke), and Korean-Japanese (dic-kj). These common nouns 11,752 proper nouns 7,799 suites assume to have at least two advantages adverbs 1,757 with respect to our evaluation; (i) the sentence numeral words 1,172 length is longer than that of BTEC as well as shorter than that of SBC, (ii) the sample sen- MRS: Generation in LKB, as shown in Fig- tences on dictionaries are normally made up of ure 1, deploys MRS as the input, which means useful expressions for translation. our generation performance hinges on the well-

149 formedness of MRS. In other words, if our MRS Table 5: Comparison between KRG1 and KRG2 KRG1 KRG2 is broken somewhere or constructed inefficiently, # of default types 121 159 generation results is directly affected. For in- # of lexical types 289 593 stance, if the semantic representation does not # of phrasal types 58 106 # of inflectional rules 86 244 scope, we will not generate correctly. We were # of syntactic rules 36 96 able to identify such sentences by parsing the # of lexicon 2,297 39,190 corpora, storing the semantic representations and then using the semantic well formedness check- ers in the LKB. We identified all rules and lexi- cal items that produced ill-formed MRSs using a small script and fixed them by hand. This had an immediate and positive effect on coverage as Figure 6: Size of Lexicon well as performance in generation. We refer to 5.4 Evaluation these changes as ‘ ’. MRS Table 6 shows the evaluation measure of this Different inflectional forms for sentence study. ‘p’ and ‘g’ stand for ‘parsing’ and ‘gener- styles: Texts in our daily life are actually com- ation’, respectively. ‘+’ represents the difference posed of various styles. For example, spoken compared to KRG1. Since KRG1 does not gen- forms are normally more or less different from erate, there is no ‘g+’. written ones. The difference between them in Table 6: Evaluation Korean is so big that the current version of KRG coverage (%) ambiguity can hardly parse spoken forms. Besides, Ko- p p+ g s p g kpsg 77.0 -5.5 55.2 42.5 174.9 144.4 rean has lots of compound nouns and derived ibkg 61.2 41.8 68.3 41.8 990.5 303.5 words. Therefore, we included these forms into seri 71.3 -0.8 65.7 46.8 289.1 128.4 b-k 43.0 32.6 62.8 27.0 1769.4 90.0 our inflectional rules and expanded lexical en- b-c 52.2 45.8 59.4 31.0 1175.8 160.6 sj-ke 35.4 31.2 58.2 20.6 358.3 170.3 tries again (3,860 compound nouns, 2,791 de- sj-kj 23.0 19.6 52.2 12.0 585.9 294.9 rived words). This greatly increased parsing cov- dic-ke 40.4 31.0 42.6 17.2 1392.7 215.9 dic-kj 34.8 25.2 67.8 23.6 789.3 277.9 erage. We call this version ‘irules’. avr 48.7 24.5 59.1 28.8 836.2 198.4 Grammaticalized and Lexicalized Forms: There are still remaining problems, because our On average, the parsing coverage increases test suites contain some considerable forms. 24.5%. The reason why there are negative val- First, Korean has quite a few grammaticalized ues in ‘p+’ of kpsg and seri is that we discarded forms; for instance, kupwun is composed of a some modules that run counter efficient process- definite determiner ku and a classifier for human ing (e.g., the grammar module for handling float- pwun “the person”, but it functions like a sin- ing quantifiers sometimes produces too many gle word (i.e. a third singular personal pronoun). ambiguities.). Since KRG1 has been constructed In a similar vein, there are not a few lexical- largely around the test sets, we expected it to ized forms as well; for example, a verbal lexeme perform well here. If we measure the parsing kkamek- is composed of kka- “peel” and mek- coverage again, after excluding the results of 4 “eat”, but it conveys a sense of “forget”, rather kpsg and seri, it accounts for 32.5%. The gen- than “peel and eat”. In addition, we also need to eration coverage of KRG2 accounts for almost cover idiomatic expressions (e.g. “thanks”) for 60% per parsed sentence on average. Note that robust parsing. Exploiting our corpus, we added KRG1 could not parse at all. ‘s’ (short for ‘suc- 1,720 grammaticalized or lexicalized forms and cess’) means the portion of both parsed and gen- erated sentences (i.e. ‘p’ ‘g’), which accounts 352 idioms. Now, we call this ‘KRG2’. × 4 Table 5 compares KRG2 with KRG1, and Fig- The running times, meanwhile, becomes slower as we would expect for a grammar with greater coverage. How- ure 6 shows how many lexical items we have ever, we can make up for it using the PET parser, as shown covered so far. in Figure 9.

150 90 25.0 80 kpsg kpsg 70 ibkg 20.0 ibkg 60 seri seri 15.0 50 b-k b-k % 40 sec. b-c 10.0 b-c 30 sj-ke sj-ke 20 sj-kj 5.0 sj-kj 10 dic-ke dic-ke 0 dic-kj 0.0 dic-kj KRG1 base lex MRS irules KRG2 KRG1 base lex MRS irules KRG2

Figure 7: Parsing Coverage (%) Figure 9: Parsing Performance (s) 90 25.0 80 kpsg kpsg 70 ibkg 20.0 ibkg 60 seri seri 15.0 50 b-k b-k % 40 sec. b-c 10.0 b-c 30 sj-ke sj-ke 20 sj-kj 5.0 sj-kj 10 dic-ke dic-ke 0 dic-kj 0.0 dic-kj base lex MRS irules KRG2 base lex MRS irules KRG2

Figure 8: Generation Coverage (%) Figure 10: Generation Performance (s) for about 29%. Ambiguity means ‘# of parses/# to KRG2) yields fairly good performance. It is of sentences’ for parsing and ‘# of realizations/# mainly because we deployed the PET parser that of MRSes’ for generation. The numbers look runs fast, whereas KRG1 runs only on LKB. Fig- rather big, which should be narrowed down in ure 10, on the other hand, shows that the revi- our future study. sion of MRS also does much to enhance gen- In addition, we can find out in Table 6 that eration performance, in common with coverage there is a coverage ordering with respect to the mentioned before. It decreases the running times type of test suites; ‘test sets > BTEC > dic > by about 3.1 seconds on average. SBC’. It is influenced by three factors; (i) lexi- cal variety, (ii) sentence length, and (iii) text do- 6 Conclusion main. This difference implies that it is highly The newly developed KRG2 has been success- necessary to use variegated texts in order to im- fully included in the LOGON repository since prove grammar in a comprehensive way. July, 2009; thus, it is readily available. In fu- Figure 7 to 10 represent how much each exper- ture research, we plan to apply the grammar in iment in 5.3 contributes to improvement. First, § an MT system (for which we already have a let us see Figure 7 and 8. As we anticipated, prototype). In order to achieve this goal, we lex and irules contribute greatly to the growth of need to construct multilingual treebanks; Korean parsing coverage. In particular, the line of b-c in (KRG), English (ERG), and Japanese (Jacy). Figure 8, which mostly consists of spoken forms, rises rapidly in irules and KRG2. That implies Acknowledgments Korean parsing largely depends on richness of We thank Emily M. Bender, Dan Flickinger, lexical rules. On the other hand, as we also ex- Jae-Woong Choe, Kiyotaka Uchimoto, Eric pected, MRS makes a great contribution to gen- Nichols, Darren Scott Appling, and Stephan eration coverage (Figure 8). In MRS, the growth Oepen for comments and suggestions at vari- accounts for 22% on average. That implies test- ous stages. Parts of this research was conducted ing with large corpora must take precedence in while the first and third authors were at the Na- order for coverage to grow. tional Institute for Information and Communi- Figure 9 and 10 shows performance in pars- cations Technologies (NICT), Japan; we thank ing and generation, respectively. Comparing to NICT for their support. Our thanks also go to KRG1, our Matrix-based grammars (from base three anonymous reviewers for helpful feedback.

151 References the 17th Pacific Asia Conference on Language, In- formation and Computation. Bender, Emily M., Dan Flickinger, and Stephan Oepen. 2002. The Grammar Matrix: An Open- Kim, Jong-Bok and Jaehyung Yang. 2004. Projec- Source Starter-Kit for the Rapid Development of tions from Morphology to Syntax in the Korean Cross-Linguistically Consistent Broad-Coverage Resource Grammar: Implementing Typed Feature Precision Grammars. In Procedings of the Work- Structures. In Lecture Notes in Computer Science, shop on Grammar Engineering and Evaluation at volume 2945, pages 13–24. Springer-Verlag. the 19th International Conference on Computa- tional Linguistics. Kim, Jong-Bok. 2004. Korean Phrase Structure Grammar. Hankwuk Publishing, Seoul. Bender, Emily M., Scott Drellishak, Antske Fokkens, Michael Wayne Goodman, Daniel P. Mills, Laurie Oepen, Stephan. 2001. [incr tsdb()] — competence Poulson, and Safiyyah Saleem. 2010. Grammar and performance laboratory. User manual. Tech- Prototyping and Testing with the LinGO Grammar nical report, Computational Linguistics, Saarland Matrix Customization System. In Proceedings of University. ACL 2010 Software Demonstrations. Pollard, Carl and Ivan A. Sag. 1994. Head-Driven Bond, Francis, Stephan Oepen, Melanie Siegel, Ann Phrase Structure Grammar. The University of Copestake, and Dan Flickinger. 2005. Open Chicago Press, Chicago, IL. Source Machine Translation with DELPH-IN. In Proceedings of Open-Source Machine Transla- Sag, Ivan A., Thomas Wasow, , and Emily M. Bender. tion: Workshop at MT Summit X. 2003. Syntactic Theory: A Formal Introduction. CSLI Publications, Stanford, CA. Bond, Francis, Eric Nichols, Darren Scott Appling, and Michael Paul. 2008. Improving Statistical Siegel, Melanie and Emily M. Bender. 2002. Effi- Machine Translation by Paraphrasing the Train- cient Deep Processing of Japanese. In Proceed- ing Data. In Proceedings of the 5th International ings of the 3rd Workshop on Asian Language Re- Workshop on Spoken Languaeg Translation. sources and International Standardization.

Callmeier, Ulrich. 2000. PET–a Platform for Exper- Song, Sanghoun and Jae-Woong Choe. 2008. imentation with Efficient HPSG Processing Tech- Automatic Construction of Korean Verbal Type niques. Natural Language Engineering, 6(1):99– Hierarchy using Treebank. In Proceedings of 107. HPSG2008.

Chang, Suk-Jin. 1995. Information-based Korean Song, Sanghoun. 2007. A Constraint-based Analysis Grammar. Hanshin Publishing, Seoul. of Passive Constructions in Korean. Master’s the- sis, Korea University, Department of Linguistics. Copestake, Ann, Dan Flickinger, Carl Pollard, and Ivan A. Sag. 2005. Minimal Recursion Seman- Sung, Won-Kyung and Myung-Gil Jang. 1997. SERI tics: An Introduction. Research on Language and Test Suites ‘95. In Proceedings of the Confer- Computation, 3(4):281–332. ence on Hanguel and Korean Language Informa- tion Processing. Flickinger, Dan. 2000. On Building a More Efficient Grammar by Exploiting Types. Natural Language Engineering, 6 (1) (Special Issue on Efficient Pro- cessing with HPSG):15 – 28.

Kikui, Genichiro, Eiichiro Sumita, Toshiyuki Takezawa, and Seiichi Yamamoto. 2003. Creating corpora for speech-to-speech translation. In Proc. of the EUROSPEECH03, pages 381–384, Geneve, Switzerland.

Kim, Se-jung and Nam-ho Cho. 2001. The progress and prospect of the 21st century Sejong project. In ICCPOL-2001, pages 9–12, Seoul.

Kim, Jong-Bok and Jaehyung Yang. 2003. Ko- rean Phrase Structure Grammar and Its Implemen- tations into the LKB System. In Proceedings of

152 An Open Source Urdu Resource Grammar

Shafqat M Virk Muhammad Humayoun Aarne Ranta Department of Applied IT Laboratory of Mathmatics Department of CS & Eng University of Gothenburg University of Savoie University of Gothenburg [email protected] [email protected] [email protected]

Abstract from the dialect of Delhi region called khari boli (Masica, 1991). We develop a grammar for Urdu in We develop a grammar for Urdu that addresses Grammatical Framework (GF). GF is a problems related to automated text translation programming language for defining using an Interlingua approach and provide a way multilingual grammar applications. GF to precisely translate text. This is described in resource grammar library currently Section 2. Then we describe different levels of supports 16 languages. These grammars grammar development including morphology follow an Interlingua approach and (Section 3) and syntax (Section 4). In Section 6, consist of morphology and syntax we discuss an application in which a semantics- modules that cover a wide range of driven translation system is built upon these features of a language. In this paper we components. explore different syntactic features of the Urdu language, and show how to fit them 2. GF (Grammatical Framework) in the multilingual framework of GF. We also discuss how we cover some of the GF (Grammatical Framework, Ranta 2004) is a distinguishing features of Urdu such as, tool for working with grammars, implementing a ergativity in verb agreement (see Sec programming language for writing grammars 4.2). The main purpose of GF resource which in term is based on a mathematical theory grammar library is to provide an easy about languages and grammars1. Many way to write natural language multilingual dialog and text generation applications without knowing the details applications have been built using GF. GF of syntax, morphology and lexicon. To grammars have two levels the abstract and the demonstrate it, we use Urdu resource concrete syntax2. The abstract syntax is grammar to add support for Urdu in the language independent and is common to all work reported in (Angelov and Ranta, languages in GF grammar library. It is based on 2010) which is an implementation of common syntactic or semantic constructions, Attempto (Attempto 2008) in GF. which work for all the involved languages on an appropriate level of abstraction. The concrete 1. Introduction syntax is language dependent and defines a mapping from abstract to actual textual Urdu is an Indo-European language of the Indo- representation in a specific language2. GF uses Aryan family, widely spoken in south Asia. It is the term ‘category’ to model different parts of a national language of Pakistan and one of the speech (e.g verbs, nouns adjectives etc.). An official languages of India. It is written in a abstract syntax defines a set of categories, as modified Perso-Arabic script from right to left. well as a set of tree building functions. Concrete As regards vocabulary, it has a strong influence syntax contains rules telling how these trees are of Arabic and Persian along with some linearized. Separating the tree building rules borrowing from Turkish and English. Urdu is an (abstract syntax) from linearization rules SOV language having fairly free word order. It (concrete syntax) makes it possible to have ______is closely related to Hindi as both originated multiple concrete syntaxes for one abstract. This 1 http://www.grammaticalframework.org 2 In given example code ’fun’ and ’cat’ belongs to abstract syntax, ’lin’ and ’lincat’ belongs to concrete syntax

153 Proceedings of the 8th Workshop on Asian Language Resources, pages 153–160, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing makes it possible to parse text in one language syntax shows how these words (parts of speech) and translate it to multiple languages. are grouped together to build well formed Grammars in GF can be roughly classified into phrases. In this section we show how this works two kinds: resource grammars and application and is implemented for Urdu. grammars. Resource grammars are general purpose grammars (Ranta, 2009a) that try to 4.1 Noun Phrases (NP) cover the general aspects of a language linguistically and whose abstract syntax encodes When nouns are to be used in sentences as part syntactic structures. Application grammars, on of speech, then there are several linguistic the other hand, encode semantic structures, but details which need to be considered. For in order to be accurate they are typically limited example other words can modify a noun, and to specific domains. However, they are not nouns have characteristics such as gender, written from scratch for each domain, but they number etc. When all such required details are use resource grammars as libraries (Ranta grouped together with the noun, the resulting 2009b). structure is known as noun phrase (NP). The Previously GF had resource grammars for 16 basic structure of Urdu noun phrase is, “(M) H languages: English, Italian, Spanish, French, (M)” according to (Butt M., 1995), where (M) is Catalan, Swedish, Norwegian, Danish, Finish, a modifier and (H) is the head of a NP. Head is Russian, Bulgarian, German, Interlingua (an the word which is compulsory and modifiers can artificial language), Polish, Romanian and or cannot be there. In Urdu modifiers are of two Dutch. Most of these languages are European types pre-modifiers i.e modifiers that come languages. We developed resource grammar for before the head for instance ( kali: bli: Urdu making it the 17th in total and the first “black cat”), and post-modifiers which come south Asian language. Resource grammars for after the head for instance ( tm sb “you several other languages (e.g. Arabic, Turkish, all”). In GF resource library we represent NP as Persian, Maltese and Swahili) are under a record construction. lincat NP : Type = {s : NPCase => Str ; a : 3. Morphology Agr} ;

In GF resource grammars a test lexicon of 350 where words is provided for each language. These words are built through lexical functions. The NPCase = NPC Case | NPErg | NPAbl rules for defining Urdu morphology are |NPIns|NPLoc1NPLoc2 borrowed from (Humayoun et el., 2006), in |NPDat;|NPAcc which Urdu morphology was developed in the Case = Dir | Obl | Voc ; Functional Morphology toolkit (Forsberg and Agr = Ag Gender Number UPerson ; Ranta, 2004). Although it is possible to Gender = Masc | Fem ; automatically generate equivalent GF code from UPerson = Pers1| Pers2_Casual it, we wrote the rules of morphology from | Pers2_Familiar | Pers2_Respect scratch in GF, to receive better abstractions than | Pers3_Near | Pers3_Distant; are possible in generated code. Furthermore, we Number = Sg | Pl ; extend this work by including compound words. However, the details of morphology are beyond Thus NP is a record with two fields, ’s’ and ’a’. the scope of this paper, and its focus is on ‘s’ is an inflection table and stores different syntax. forms of a noun. The Urdu NP has a system of syntactic cases 4. Syntax which is partly different from the morphological cases of the category noun (N). The case While morphological analysis deals with the markers that follow nouns in the form of post- formation and inflection of individual words, positions cannot be handled at lexical level

154 through morphological suffixes and are thus the corresponding forms of adjective and handled at syntactic level (Butt et el., 2002). common noun from their inflection tables using Here we create different forms of a noun phrase selection operator (‘!’). Since CN does not to handle case markers for Urdu nouns. Here is a inflect in degree but the adjective does, we fix short description of the different cases of NP : the degree to be positive (Posit) in this construction. Other modifiers include possibly • NPC Case: this is used to retain the adverbs, relative clauses, and appositional original case of Noun attributes. • NPErg: Ergative case with case marker A CN can be converted to a NP using different ;functions: common nouns with determiners ’ ﮯ :ne‘ • NPAbl: Ablative with case marker ‘se: proper names; pronouns; and bare nouns as mass :terms ’ ﮯ • NPIns: Instrumental case with case (fun DetCN : Det -> CN -> NP (e.g the boy ’ ﮯ :marker ‘se • NPLoc1: Locative case with case fun UsePN : PN -> NP (e.g John) fun UsePron : Pron -> NP (e.g he) ’ ﮟ marker ‘mi: ɳ • fun MassNP : CN -> NP (e.g milk) NPLoc2: Locative case with case marker ‘pr ’ These different ways of building NP’s, which • NPDat: Dative case with case marker are common in different languages, are defined ʋ ‘k ’ in the abstract syntax of the resource grammar, • NPAcc: Accusative case with case but the linearization of these functions is marker ‘kʋ ’ language dependent and is therefore defined in the concrete syntaxes. And ‘a’ (Agr in the code sample given in previous column) is the agreement feature of the 4.2 Verb Phrases (VP) the noun that is used for selecting the appropriate form of other categories that agree A verb phrase is a single or a group of words with nouns. that act as a predicate. In our construction Urdu A noun is converted to an intermediate category verb phrase has following structure common noun (CN; also known as N-Bar) which is then converted to NP category. CN lincat VP = { deals with nouns and their modifiers. As an s : VPHForm => {fin, inf: Str} ; example consider adjectival modification: obj : {s : Str ; a : Agr} ; vType : VType ; fun AdjCN : AP -> CN -> CN ; comp : Agr => Str; embComp : Str ; lin AdjCN ap cn = { ad : Str } ; s = \\n,c => ap.s ! n ! cn.g ! c ! Posit ++ cn.s ! n ! c ; where g = cn.g } ; VPHForm = VPTense VPPTense Agr The linearization of AdjCN gives us common | VPReq HLevel | VPStem

ʈȹn ɖa pani: “cold ﭨﮉا ) nouns such as water”) where a CN ( pani: “water”) is and ʈ ɖ ;ȹn a “cold”). VPPTense = VPPres |VPPast |VPFutr , ﭨﮉا ) modified by an AP Since Urdu adjectives also inflect in number, HLevel = Tu |Tum |Ap |Neutr gender, case and degree, we need to concatenate the appropriate form of adjective that agrees with common noun. This is ensured by selecting

155 In GF representation a VP is a record with question is the complement of the verb. In that different fields. The most important field is ‘s’ case complement of the verb comes at the very وﮦ :which is an inflectional table and stores different end of clause e.g ( ʋo khta he: kh ʋo dʋɽ ti: he .(”he says that she runs“ ﮩ ﮨﮯ ہ وﮦ دوڑ ﮨﮯ .forms of Verb At VP level we define Urdu tenses by using a We have two different fields named ‘compl’ and simplified tense system, which has only three ‘embCompl’ in the VP to deal with these tenses, named VPPres, VPPast, VPFutr. In case different situations. of VPTense for every possible combination of ‘vType’ field is used to store information about VPPTense and agreement (gender, number, type of a verb. In Urdu a verb can be transitive, person) a tuple of two string values {fin, inf : intransitive or double-transitive (Schmidt R. L., Str} is created. ‘fin’ stores the coupla (auxiliary 1999). This information is important when verb) , and ‘inf’ stores corresponding form of dealing with ergativity in verb agreement. The verb. VPStem is a special tense which stores the information about the object of the verb is stored root form of verb. This form is used to create the in ‘obj’ field. All this information that a VP full set of Urdu tenses at clause level (tenses in carries is used when a VP is used in the which the root form of verb is used, i.e. construction of a clause. perfective and progressive tenses). Handling A distinguishing feature of Urdu verb agreement tenses at clause level rather than at verb phrase is ‘ergativity’. Urdu is one of those languages level simplifies the VP and results in a more that shows split ergativity at verb level. Final efficient grammar. verb agreement is with direct subjective except The resource grammar has a common API in the transitive perfective tense. In transitive which has a much simplified tense system, perfective tense verb agreement is with direct which is close to Germanic languages. It is object. In this case the subject takes the ergative divided into tense and anteriority. There are only construction (subject with addition of ergative .( ﮯ :four tenses named as present, past, future and case marker (ne conditional, and two possibilities of anteriority However, in the case of the simple past tense, (Simul , Anter). This means it creates 8 verb shows ergative behavior, but in case of combinations. This abstract tense system does other perfective tenses (e.g immediate past, not cover all the tenses in Urdu. We have remote past etc) there are two different covered the rest of tenses at clause level, even approaches, in first one auxiliary verb (tʃka ) though these tenses are not accessible by the is used to make clauses. If (tʃka ) is used, common API, but still can be used in language verb does not show ergative behavior and final specific modules. verb agreement is with direct subjective. Other forms for verb phrases include request Consider the following example form (VPReq), imperative form (VPImp). There ﮍ ب ﮨﮯ are four levels of requests in Urdu. Three of honor ( پ them correspond to (t ʋ , tm , a:p lɽka Direct ktab Direct xri:d Root tʃka aux_verb he: levels and the fourth is neutral with respect to The boy has bought a book honorific levels. . The Urdu VP is a complex structure that has The second way to make the same clause is different parts: the main part is a verb and then ﮍﮯ ﮯ ب ﮨﮯ there are other auxiliaries attached to verb. For example an adverb can be attached to a verb as a lɽke: ne: ktab xri:di: he: modifier. We have a special field ‘ad’ in our VP Erg Direct_Fem Direct_Fem The boy has bought a book representation. It is a simple string that can be attached with the verb to build a modified verb. ɽ boy”) is“ ﮍ ,In Urdu the complement of a verb precedes the In the first case the subject (l ka in direct case and auxiliary verb agrees to :ʋo d ʋɽ na tʃahti وﮦ دوڑ ﮨ ﮨﮯ ) actual verb e.g subject, but in second case verb is in agreement (”tʃahna “want ﮨ ) he: “she want to run”), here with object and ergative case of subject is used. ʋɽ d na “run”), However, in the current implementation we دوڑ ) is complement of verb except in the case where, a sentence or a follow the first approach.

156 In the concrete syntax we ensure this ergative When a comparative AP is created from an is used ” ﮯ :behavior through the following code segment in adjective and a NP, constant “se GF. However the code given here is just a between oblique form of noun and adjective. For segment of the code that is relevant. example linearization of above function is case vt of { lin ComparA a np = { VPPast => case vp.vType of { s = \\n,g,c,d => np.s ! NPC Obl ++ "se:" (Vtrans| VTransPost) => ++ a.s ! n ! g ! c ! d ; _ => } ; } ; _ => 4.4 Clauses } ; e.g in case of simple past tense if verb is A clause is a syntactic category that has variable transitive then ergative case of noun is used and tense, polarity and order. Predication of a NP agreement is with object of verb. In all other and VP gives simplest clause cases direct case of noun is used and agreement is with subject of verb. fun PredVP : NP -> VP -> Cl ; A VP is constructed in different ways; the simplest is The subject-verb agreement is insured through agreement feature of NP which is passed to verb fun UseV : V -> VP ; as inherent feature. A clause is of following type where V is the morphological category and VP lincat Clause : Type = {s : VPHTense => is the syntactic category. There are other ways to Polarity => Order => Str} ; make a VP from other categories, or combinations of categories. For example Here VPHTense represents different tenses in Urdu. Even though current abstract level of fun AdvVP : VP -> Adv -> VP ; common API does not cover all tenses of Urdu, we cover them at clause level and can be An adverb can be attached to a VP to make an accessed through language specific module. So, VPHTense is of following type ﮩں adverbial modified VP. For example (i:ha ɳ ) VPHTense = VPGenPres | VPPastSimple 4.3 Adjective Phrases (AP) | VPFut | VPContPres | VPContPast | VPContFut Adjectives (A) are converted into the much | VPPerfPres | VPPerfPast richer category adjectival phrases (AP) at syntax | VPPerfFut | VPPerfPresCont level. The simplest function to convert is | VPPerfPastCont | VPPerfFutCont | VPSubj fun PositA : A -> AP ; Polarity is used to make positive and negative Its linearization is very simple, since in our case sentences; Order is used to make simple and AP is similar to A e.g. interrogative sentences. These parameters are of following forms fun PositA a = a ; Polarity = Pos | Neg There are other ways of making AP for example Order = ODir | OQuest fun ComparA : A -> NP -> AP ; PredVP function will create clauses with variable tense, polarity and order which are

157 fixed at sentence level by different functions, fun IdetQuant : IQuant -> Num -> IDet ; one is. fun PrepIP : Prep -> IP -> IAdv ; fun UseCl : Temp -> Pol -> Cl -> S 5. Example

Here Temp is syntactic category which is in the As an example consider the translation of form of a record having field for Tense and following sentence from English to Urdu, to see Anteriority. Tense in the Temp category refers how our proposed system works at different to abstract level Tense and we just map it to levels. Urdu tenses by selecting the appropriate clause. This will create simple declarative sentence, He drinks hot milk. other forms of sentences (e.g Question sentences) are handled in Questions categories Figure 1 shows the parse tree for this sentence. of GF which follows next. As a resource grammar developer our goal is to provide correct concrete level linearization of 4.5 Question Clauses and Question this tree for Urdu. Sentences

Common API provides different ways to create question clauses. The simplest way is to create from simple clause fun QuestCl : Cl -> QCl ;

In Urdu simple interrogative sentences are created by just adding “ki:a ” at the start of a direct clause that already have been created at clause level. Hence, the linearization of above function simply selects appropriate form of clause and adds “ki:a ” at the start. However this clause still has variable tense and polarity which is fixed at sentence level e.g Figure 1. Parse tree of an example sentence fun UseQCl : Temp -> Pol -> QCl -> QS The nodes in this tree represent different categories and its branching shows how a Other forms of question clauses include clauses particular category is built from other categories made with interrogative pronouns (IP), and/or leaves (words from lexicon). In GF interrogative adverbs (IAdv), and interrogative notation these are the syntactic rules which are determiners (IDet), categories. Some of the declared at abstract level. For example category functions for creating question clauses are CN can be built from an AP (adjectival phrase) and a CN. So in GF representation it has fun QuestVP : IP -> VP -> QCl (e.g who following type signature. walks) fun QuestIAdv : IAdv -> Cl -> QCl (e.g why fun AdjCN : AP -> CN -> CN ; does he walk) A correct implementation of this rule in Urdu IP, IAdv, IDet etc are built at morphological concrete syntax ensures correct formation of a level and can also be created with following ʋ (”grm d dȺ “hot milk م دوده ) functions. common noun dʋdȺ “milk”) modified by an دوده ) from a CN .(”grm “hot , م ) fun AdvIP : IP -> Adv -> IP Adjective

158 A NP is constructed from this CN by one of the implemented in a different but related NP construction rules (see section 4.1 for formalism. details). A VPSlash (object missing VP) is build Like the GF resource library, Pargram project from a two place verb ( pi:ta “drinks”). This (Butt et el., 2007) aims at building a set of VPSlash is then converted to VP through parallel grammars including Urdu. The function grammars in Pargram are connected with each other by transfer functions, rather than a fun ComplSlash : VPSlash -> NP -> VP ; common representation. Further, the Urdu grammar is still one of the least implemented Resulting VP and NP are grouped together to grammars in Pargram at the moment. This project is based on the theoretical framework of :ʈgrm dʋdȺ pi:ta he م دوده ﮨﮯ ) make a VP .(lexical functional grammar (LFG م دوده ﮨﮯ ) drinks hot milk”). Finally clause“ ʋ ʋ Other than Pargram, most work is based on LFG h grm d dȺ pi:ta he: “he drinks hot milk”) is and translation is unidirectional i.e. from وﮦ ʋ h “he”) which is build from English to Urdu only. For instance, English to وﮦ ) build from NP Urdu MT System is developed under the Urdu م دوده ﮨﮯ ) ʋh “he”) and VP وﮦ ) pronoun grm dʋdȺ pi:ta he: “drinks hot milk”). Language Localization Project (Hussain, 2004), (Sarfraz dependent concrete syntax assures that correct and Naseem, 2007) and (Khalid et el., 2009). forms of words are selected from lexicon and Similarly, (Zafer and Masood, 2009) reports word order is according to rules of that specific another English-Urdu MT system developed language. While, morphology makes sure that with example based approach. On the other correct forms of words are built during lexicon hand, (Sinha and Mahesh, 2009) presents a development. strategy for deriving Urdu sentences from English-Hindi MT system. However, it seems to 6. An application: Attempto be a partial solution to the problem.

An experiment of implementing Controlled 8. Future Work languages in GF is reported in (Angelov and Ranta, 2010). In this experiment, a grammar for The common resource grammar API does not Attempto Controlled English (Attempto, 2008) cover all the aspects of Urdu language, and non- is implemented and then ported to six languages generalizable language-specific features are (English, Finnish, French, German, Italian, and supposed to be handled in language-specific Swedish) using the GF resource library. To modules. In our current implementation of Urdu demonstrate the usefulness of our grammar and resource grammar we have not covered those to check its correctness, we have added Urdu to features. For example in Urdu it is possible to this set. Now, we can translate Attempto build a VP from only VPSlash (VPSlash ﮨﮯ ) documents between all of these seven languages. category represents object missing VP) e.g The implementation followed the general recipe kȹata he:) without adding the object. This for how new languages can be added (Angelov rule is not present in the common API. One and Ranta, 2009) and created no surprises. direction for future work is to cover such However the details of this implementation are language specific features. beyond the scope of this paper. Another direction for future work could be to include the causative forms of verb which are 7. Related Work not included in the current implementation due to efficiency issues. A suite of Urdu resources were reported in (Humayoun et el., 2006) including a fairly 9. Conclusion complete open-source Urdu morphology and a small fragment of syntax in GF. In this sense, it The resource grammar we develop consists of 3 is a predecessor of Urdu resource grammar, 44 categories and 190 functions which cover a fair enough part of language and is enough for

______3 http://www.grammaticalframework.org/lib/doc/synopsis.html 159 building domain specific application grammars Grammar Development. Proceedings of the including multilingual dialogue systems, Conference on Language & Technology 2009. controlled language translation, software Masica C., 1991. The Indo-Aryan Languages, localization etc. Since a common API for Cambridge, Cambridge University Press, ISBN multiple languages is provided, this grammar is 9780521299442. useful in applications where we need to parse and translate the text from one to many other Ranta A., Grammatical Framework: A Type- languages. Theoretical Grammar Formalism . The Journal of However our approach of common abstract Functional Programming 14(2) (2004) 145–189. syntax has its limitations and does not cover all Ranta A. The GF Resource Grammar Library aspects of Urdu language. This is why it is not A systematic presentation of the library from the possible to use our grammar for arbitrary text linguistic point of view. to appear in the on-line parsing and generation. journal Linguistics in Language Technology, 2009a. 10. References Ranta A. Grammars as Software Libraries. From Semantics to Computer Science, Cambridge Angelov K. and Ranta A. 2010. Implementing University Press, Cambridge, pp. 281-308, 2009b. controlled Languages in GF. Controlled Natural Rizvi, S. M. J. 2007. Development of Algorithms and Language (CNL) 2009, LNCS/LNAI Vol. 5972 Computational Grammar of Urdu . Department of (To appear) Computer & Information Sciences/ Pakistan Attempto 2008. Project Homepage. Institute of Engineering and Applied Sciences attempto.ifi.uzh.ch/site/ Nilore Islamabad. Pakistan. Butt M., 1995. The Structures of Complex Predicate Sarfraz H. and Naseem T., 2007. Sentence in Hindi Stanford: CSLI Publications Segmentation and Segment Re-Ordering for English to Urdu Machine Translation. In Butt M., Dyvik H., King T. H., Masuichi H., and Proceedings of the Conference on Language and Rohrer C. 2002. The Parallel Grammar Project. Technology, August 07-11, 2007, University of In Proceedings of COLING-2002 Workshop on Peshawar, Pakistan. Grammar Engineering and Evaluation. pp. 1-7. Schmidt R. L., 1999. Urdu an Essential Butt, M. and King, T. H. 2007. Urdu in a Parallel Grammar, Routledge Grammars. Grammar Development Environment' . In T. Takenobu and C.-R. Huang (eds.) Language Sinha R., and Mahesh K., 2009. Developing English- Resources and Evaluation : Special Issue on Asian Urdu Machine Translation Via Hind., Third Language Processing: State of the Art Resources Workshop on Computational Approaches to and Processing 41:191-207. Arabic Script-based Languages (CAASL3) in conjunction with The twelfth Machine Translation Forsberg M., and Ranta A., 2004. Functional Summit. Ottawa, Ontario, Canada. Morphology. Proceedings of the Ninth ACM SIGPLAN International Conference of Functional Zafar M. and Masood A., 2009. Interactive English Programming, Snowbird, Utah. to Urdu Machine Translation using Example- Based Approach . International Journal on Humayoun M., Hammarström H., and Ranta A. Computer Science and Engineering Vol.1(3), Urdu Morphology, Orthography and Lexicon 2009, pp 275-282. Extraction. CAASL-2: The Second Workshop on Computational Approaches to Arabic Script-based Languages, July 21-22, 2007, LSA 2007 Linguistic Institute, Stanford University. 2007 Hussain, S. 2004. Urdu Localization Project. COLING :WORKSHOP ON Computational Approaches to Arabic Script-based Languages, Geneva. pp. 80-81 Khalid, U., Karamat, N., Iqbal, S. and Hussain, S. 2009. Semi-Automatic Lexical Functional

160 A Current Status of Thai Categorial Grammars and Their Applications

Taneth Ruangrajitpakorn and Thepchai Supnithi Human Language Technology Laboratory National Electronics and Computer Technology Center

{taneth.ruangrajitpakorn,thepchai.supnithi}@nec- tec.or.th

Abstract turned 75% of better and smoother translation from human evaluation (Porkaew et al., 2009). This paper presents a current status of Though CG has a high potential in immediate Thai resources and tools for CG devel- constituency analysis for Thai, it sill lacks of a opment. We also proposed a Thai cat- dependency analysis which is also important in egorial dependency grammar (CDG), an syntactical parsing. In this paper, we propose a extended version of CG which includes category dependency grammar which is an up- dependency analysis into CG notation. graded version of CG to express a dependency Beside, an idea of how to group a word relation alongside an immediate constituency that has the same functions are presen- bracketing. Moreover, some Thai dependency ted to gain a certain type of category per banks such as NAIST dependency bank word. We also discuss about a difficulty (Satayamas and Kawtrakul, 2004) have been de- of building treebank and mention a veloped. It is better to be able to interchange toolkit for assisting on a Thai CGs tree data between a Thai CG treebank and a Thai de- building and a tree format representa- pendency bank in order to increase an amount tions. In this paper, we also give a sum- of data since building treebank from scratch has mary of applications related to Thai high cost. CGs. In the point of resources and applications, Thai CG and CDG still have a few number of 1 Introduction supported tools. Our CG treebank still contains insufficient data and they are syntactically Recently, CG formalism was applied to several simple and do not reflect a natural Thai usage. Thai NLP applications such as syntactic inform- To add complex Thai tree, we found that Thai ation for Thai to English RBMT (Ruangrajit- practical usage such as news domain contains a pakorn et al., 2007), a CG treebank (Ruangrajit- number of word and very complex. pakorn et al., 2009), and an automatic CG tag- An example of natural Thai text from news, ger (Supnithi et al., 2010). CG shows promises which contains 25 words including nine under- to handle Thai syntax expeditiously since it can lined function words, is instanced with transla- widely control utilisations of function words tion in Figure 1. which are the main grammatical expression of Thai. In the previous research, CG was employed สำำหรับ|กำร|วำง|กำำลัง|ของ|คน|เสื้อ|แดง| |ได้|มี|กำร| as a feature for an English to Thai SMT and it วำง|บังเกอร์|รอบ|พื้นที่|ชุมนุม| |เอำ|น้ำำมัน|รำด| |รวม ทั้ง|ยำง|รถยนต์|ที่|เสื่อม|สภำพ|แล้ว resulted better accuracy in term of BLEU score for 1.05% (Porkaew and Supnithi, 2009). CG lit: The red-shirts have put bunkers around the as- sembly area and poured oil and worn-out tires. was also used in a research of translation of noun phrase from English to Thai using phrase- Figure 1. An example of Thai usage in natural based SMT with CG reordering rules, and it re- language

161 Proceedings of the 8th Workshop on Asian Language Resources, pages 161–168, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing

We parsed the example in Figure 1 with CG chance to generate a correct tree and a manual and our parser returned 1,469 trees. The result is tree construction is also required as a gold in a large number because many Thai structural standard. Thus, we recently implemented an as- issues in a syntactic level cause ambiguity. sistant toolkit for tree construction and tree rep- The first issue is many Thai words can have resentation to reduce linguists' work load and multiple functions including employing gram- time consumption. matical usage and representing a meaning. For This paper aims to explain the current status instance, a word “ที่” /tee/ can be a noun, a relat- of resource and tool for CG and CDG develop- ive clause marker, a classifier, a preposition, and ment for Thai language. We also listed open an adjective marker. A word “คน” /kon/ can tools and applications that relate to CGs in this refer to a person, a classifier of human being paper. and it can denote an action. A word “กำำลัง” The rest of the paper is organised as follows. /kumlung/ can serve as an auxiliary verb to ex- Section 2 presents a Thai categorial grammar press progressive aspect and also refers a mean- and its related formalism. Section 3 explains ing as a noun. A function word is a main gram- status of CGs resources including syntactic dic- matical representation and it hints an analyser to tionary and treebank. Section 4 shows details of clarify an overall context structure. Regretfully, a toolkit which assists linguist to manage and it is difficult for system to instantly indicate the construct CGs derivation tree and tree represent- Thai function words by focusing on the lexical ations. Section 5 provides information of applic- surface and their surrounding lexicons. This cir- ations that involve Thai CGs. Lastly, Section 6 cumstance is stimulates an over generation of concludes this paper and lists future works. many improper trees. The second issue is a problem of Thai verb 2 Thai Categorial Grammars utilisations. Thai ordinarily allows to omit either 2.1 Categorial Grammar a subject or an object of a verb. Moreover, a Thai intransitive verb is occasionally attached Categorial grammar (Aka. CG or classical its indirect object without a preposition. Fur- categorial grammar) (Ajdukiewicz, 1935; thermore, Thai adjective allows to perform as a Carpenter, 1992; Buszkowski, 1998) is a predicate without a marker. With an allowance formalism in natural language syntax motivated of verb serialisation, these complexify linguists by the principle of constitutionality and to design a category into well-crafted category organised according to the syntactic elements. set for verb. Therefore, many Thai verbs contain The syntactic elements are categorised in terms several syntactic categories to serve their many of their ability to combine with one another to functions. form larger constituents as functions or The last issues is a lack of an explicit bound- according to a function-argument relationship. ary for a word, a phrase and a sentence in Thai. CG captures the same information by associ- A Thai word and phrase boundary is implicit ating a functional type or category with all and a space is not significantly signified a grammatical entities. Each word is assigned boundary in the context. In addition, most of with at least one syntactic category, denoted by modifiers are attached after a core element. This an argument symbol (such as np and num) or a leads to ambiguity of finding an ending of a functional symbol X/Y and X\Y that require Y subject with an attached adjective and relative from the right and the left respectively to form clause since the verbs in attachment can be seri- X. alised and consequently placed with following The basic concept is to find the core of the main verb phrase (which is likely to be serial- combination and replace the grammatical modi- ised either) without a signified indicator. fier and complement with set of categories With these issues, a parser with only syntactic based on the same concept of the rule of frac- information merely returns a large number of all tion cancellation as follow: possible trees. It becomes difficulty and time consuming for linguists to select the correct one among them. Moreover, with many lexical ele- ments, using a statistical parser has a very low

162 Upon applying to Thai, we have modified ar- handle serial construction in Thai including seri- gument set and designed eight arguments shown al verb construction, we permitted the exactly in Table 1. same categories which are consequent to be From the last version, two arguments were combined. For example, Thai noun phrase additionally designed. “ut” argument was added 'มติ(np)|คณะรัฐมนตรี(np)' (lit: a consensus of the to denote utterance that is followed after a word government) contains two consequent nouns “ว่ำ”. The word “ว่ำ” has a special function to let without a joint word to form a noun phrase. Un- the word after it perform as an exemplified ut- fortunately, there still remain limits of syntactic terance and ignore its appropriate category as it parsing in CG that can not handle long depend- is signified an example in context. Comparing ency and word omission in this state. to “ws” argument, the word “ว่ำ” is functioned in a different sense which is used to denote a be- 2.2 Categorial Dependency Grammar ginner of subordinate clause. Categorial dependency grammar (CDG) is an For “X” category, it is used for punctuation or extension of CG. CDG differs from CG in that a symbol which takes the same categories from dependency direction motivated by Collins the left or right sides and produces the taken (1999) is additionally annotated to each slash category. For instance, “ๆ” is a marker to denote notation in syntactic category. The derivation after many types of content word. In details, this rules of CDG are listed as follow: symbol signifies plurality while it is after noun but it intensifies a degree of meaning while it is X/ X : h(d1) → h(d2) placed after adjective. X/>Y : d1 Y : d2 => X : h(d1) ← h(d2) Upon CG design, we allowed only binary Y : d1 X\ X : h(d1) → h(d2) bracketing of two immediate constituents. To Y : d1 X\>Y : d2 => X : h(d1) ← h(d2)

argu- where the notations h(d ) → h(d ) and h(d ) ← ment definition example 1 2 1 category h(d2) mean a dependency linking from the head of the dependency structure d1 to the head of d2, ช้ำง (elephant), np a noun phrase ผม (I, me) and that linking from the head of d2 to the head of d1, respectively. Throughout this paper, a con- a digit and a spelled-out หนึ่ง (one), num stituent type of the syntactic category c and the number 2 (two) dependency structure d is represented by c:d. a number which is suc- (one), spnum นึง Let us exemplify a dependency driven deriva- ceeding to classifier เดียว (one) tion of CDG of sentence 'Mary drinks fresh ในรถ (in car), pp a prepositional phrase milk' in Figure 2. In Figure 2(a), each pair of บนโต๊ะ (on table) constituents is combined to form a larger con- ช้ำงกินกล้วย stituent with its head word. Figure 2(b) shows a s a sentence (an elephant eats dependency structure equivalent to the deriva- a banana) tion in Figure 2(a). a specific category for * ว่ำเขำจะมำสำย Comparing to PF-CCG (Koller and Kuhl- Thai which is assigned to 'that he will mann, 2009), there is different in that their PF- a sentence or a phrase that come late' ws begins with Thai word ว่ำ * ว่ำจะมำสำย CCG dependency markers are fixed to the direc- (that : sub-ordinate clause 'that (he) will tion of slashes while CDG dependency markers marker). come late' are customised based on behaviour of a con- an utterance using to ex- stituent. คำำ ว่ำ ดี ut emplify a specific word 'the word “good”' CDG offers an efficient way to represent de- after a word ว่ำ pendency structures alongside syntactic deriva- an undefined category that tions. Apart from immediate constituency ana- เด็ก ๆ takes the same categories (plural marker) lysis, we can also investigate the correspond- X from the left or right sides สะอำด ๆ ence between the syntactic derivations and the and produces the taken (intensifier) category. dependency structures. It benefits linguists in details a grammar for a specific language be- Table 1. A list of Thai CDG arguments

163 Figure 2. Syntactic derivation of ‘Mary drinks fresh milk’ based on CDG cause we can restrain the grammar in lexical 3 Categorial Grammars Resources level. In this paper, our Thai CG was applied to To apply categorial grammars to Thai NLP, syn- CDG. For the case of serial construction, we set tactic dictionary and treebank are a mandatory. the left word as a head of dependency since 3.1 Categorial Grammars Dictionary Thai modifiers and dependents are ordinarily at- tached on right side. For using in other work and researches, we col- lected all CGs information into one syntactic 2.3 Categorial Set dictionary. An example of CGs dictionary is A categorial set is a group of lexicons that ex- shown in Table 3. In a summary, our Thai CGs actly contains the same function(s) in terms of dictionary currently contain 70,193 lexical their category amount and all their same syn- entries with 82 categories for both CG and CDG tactic categories. With a specific surface, each and 183 categorial sets. word certainly is in one categorial set. For ex- Lexicon CG CDG Cset no. ample, suppose that we have following words and categories: สมุด np np 0 เกำะ np,s\np/np,np\n np,s\np,np\> 15 word category POS p/num np/np,s\s,s/>s/>s 43 ล้อ,เกำะ,ขวบ ├ np\np/num classifier พูด s\np/pp,s\np,s\ s\pp,s\ws We can group the given words into five groups np\np,s\np np\>np,s\np,s\np,s\np,s\ 136 np ,np\np/ut ws,np\>np/>ut 4 s\np/np ล้อ,เกำะ np\np/num เพรำะ s\s/s,s/s/s s\s,s/>s/>s 43 5 np\np/num ขวบ Table 3. An example of Thai CGs dictionary Table 2. An example of categorial set 3.2 Thai CDGTreebank For current status, we attain 183 categorial sets in total and the maximum amount of cat- Our CG treebank was recently transformed into egory member in a categorial set is 22 categor- dependency-driven derivation tree with CDG. ies. An example of derivation tree of sentence |กำร|

164 ล่ำ|เสือ|เป็น|กำร|ผจญภัย| 'lit: Tiger hunting is an mented a toolkit to assist linguists on construct- adventure' comparing between CG and CDG is ing treebank with such a long and complicated illustrated in Figure 3. sentence. The manual annotated tree will be used as a gold standard and confidentially apply s s for statistical parser development. (np (np (np/(s\np)[กำร] (np/>(s\np[ล่ำ] Building a resource is a laboured work espe- np[เสือ] np[เสือ] cially a treebank construction. For Thai lan- ) ) guage which uses several function words to ex- ) ) press grammatical function in context, an imme- s\np( s\np[เป็น] analysis become difficult since many word pair np( np( can cause ambiguity and complexity among np/(s\np)[กำร] np/>(s\

165 Figure 4. A snapshot of dependency-driven derivation tree generator working on immediate constituency and de- Throughout all tree format examples, we ex- pendency annotation is illustrated in Figure 4. emplify a Thai sentence 'นักวิชำกำร ตรวจ พบ We provide a user-interface for linguists and ไวรัส โคโรน่ำ' (lit: an expert discovers corona vir- experts to easily annotate brackets covering. us.) with the following categories: Users begin a process by selecting a pair of Word CDG category words that are a terminal of leaf node. The sys- tem apparently shows only categories of the นักวิชำกำร (expert) word that can be possibly combined within the ไวรัส (virus) ├ np bracket for selecting. After choosing categories โคโรน่ำ (corona) of those two constituents, the system automatic- ตรวจ (diagnose) ├ s\np continue the process for other constituents until one top result category is left. 4.2.1 Traditional Textual Tree Format After users finish the bracketing process, de- A traditional textual tree format represents a ter- pendency relation will be generated from annot- minal (w) with its category (c) in form of c[w]. ated dependency marker within categories The brackets are enclosed two constituents split without manual assignment. by space with parentheses and the result cat- egory (c ) is placed before the open parenthesis 4.1.3 Tree Visualiser r in format cr(c[w] c[w]). Figure 5 shows an ex- The system includes a function to create a ample of a traditional textual tree format. graphical tree from a file in textual formats. It provides a function to modify a tree by editing a s(np[นักวิชำกำร] s\np[พบ] np(np[ไวรัส] np[โคโร word spelling and its syntactic category and น่ำ]))) shifting a branch of syntactic tree to another. Figure 5. An example of a traditional textual 4.2 Tree Representation tree format of 'นักวิชำกำร ตรวจ พบ ไวรัส โคโรน่ำ' The CGs toolkit allows users to export a tree 4.2.2 XML Tree Format output in two representations; traditional textual tree format and XML format. For XML tree format, we design three tag sets, i.e., word tag, tree tag and input tag. The word

166 tag bounds a terminal to mark a word. In a start- 5.2 Chunker tag of word tag, there are two attributes which With a problem of a long sentence in Thai, are cat to assign a category in a value and text chunker was implemented to group a con- to assign a given text in a value. For tree tag, it sequent of words to larger unit in order to re- marks a combination of either word tags or tree duce a difficulty on parsing too many lexical tags to form another result category. It contains elements. CRD method with syntactic informa- two previous attributes with an additional attrib- tion from CG and categorial set was applied in ute, i.e., a head attribute to fill in a notation that the system to chunk a text into noun phrase, which word has a head-outward relation value verb phrase, prepositional phrase, and adverbial where '0' value indicates head from left constitu- phrase. Moreover, the system also attempts to ent and '1' value indicates head from right con- handle a compound word that has a form like stituent. The input tag shows a boundary of all sentence. The result was impressive as it im- input and it has attributes to show line number, proved 74.17% of accuracy on sentence level raw input text and status of tree building pro- chunking and 58.65% on sentence-form like cess. Figure 6 illustrates an XML tree represent- compound noun. ation. 5.3 GLR parser for Thai CG and CDG 5 Thai CGs Related Applications Our implemented LALR parser (Aho and John- Several applications related to Thai CGs or used son, 1974) was improved to GLR parser for Thai CGs as their syntactic information have syntactically parse Thai text. This parser was been implemented recently. Below is a sum- developed to return all possible trees form input mary of their methodology and result. to show a baseline that covers all syntactic pos- sibilities. For our GLR parser, a grammar rule is 5.1 CG AutoTagger for Thai not manually determined, but it is automatically To reduce an amount of trees generated from a produced by any given syntactic notations parser with all possible categories, an automatic aligned with lexicons in a dictionary. Hence, syntactic category tagger (Supnithi et al., 2010) this GLR parser has a coverage including CG was developed to disambiguate unappropriated and CDG formalism parsing. Furthermore, our combinations of impossible categories. The sys- GLR parser accepts a sentence, a noun phrase, a tem was developed based on CRF and Statistic- verb phrase and prepositional phrase. However, al Alignment Model based on information the- the parser does not only return the best first tree, ory (SAM) algorithm. The accuracy 89.25% in but also all parsable trees to gather all ambigu- word level was acquired. This system also has a ous trees since Thai language tends to be am- function to predict a syntactic category for an biguous because of lacking explicit sentence, unknown word and 79.67% of unknown word phrase and word boundary. This parser includes are predicted correctly. a pre-process to handle named-entities, numer- ical expression and time expression.

Figure 6. An example of XML tree format of 'นักวิชำกำร ตรวจ พบ ไวรัส โคโรน่ำ'

167 6 Conclusion and Future Work Collins Micheal. 1999. Head-Driven Statistical Mod- els for Natural Language Parsing. Ph.D. Thesis, In this paper, we update our Thai CG informa- University of Pennsylvania. tion and a status of its resources. We also pro- Koller Alexander, and Kuhlmann Marco. 2009. De- pose CDG for Thai, an extended version of CG. pendency trees and the strong generative capacity CDG offers an efficient way to represent de- of ccg, Proceedings of the 12th Conference of the pendency structures with syntactic derivations. European Chapter of the Association for Compu- It benefits linguists in terms of they can restrain tational Linguistics: 460-468. Thai grammar in lexical level. With CDG de- pendency-driven derivation tree, both bracket- Kosawat Krit, Boriboon Monthika, Chootrakool Patcharika, Chotimongkol Ananlada, Klaithin ing information and dependency relation are an- Supon, Kongyoung Sarawoot, Kriengket Kanya- notated to every lexical units. In the current nut, Phaholphinyo Sitthaa, Purodakananda state, we transformed our CG dictionary and CG Sumonmas, Thanakulwarapas Tipraporn, and treebank into CDG formalism. Wutiwiwatchai Chai. 2009. BEST 2009: Thai With an attempt to increase an amount of our Word Segmentation Software Contest. The 8th In- treebank with a complex text, CDG tree toolkit ternational Symposium on Natural Language Pro- was developed for linguists to manual managing cessing: 83-88. a derivation tree. This toolkit includes a CDG Porkaew Peerachet, Ruangrajitpakorn Taneth, Trak- category tagger tool, dependency-driven deriva- ultaweekoon Kanokorn, and Supnithi Thepchai.. tion tree generator, and tree visualiser. This 2009. Translation of Noun Phrase from English to toolkit can generate an output in two formats Thai using Phrase-based SMT with CCG Reorder- which are traditional textual tree and XML tree. ing Rules, Proceedings of the 11th conference of The XML tree format is an option for standard- the Pacific Association for Computational Lin- ised format or further usage such as applying guistics (PACLING). tree for ontology. Porkaew Peerachet, and Supnithi Thepchai. 2009. We also summarised CGs related works and Factored Translation Model in English-to-Thai their accuracy. They included an automatic CG Translation, Proceedings of the 8th International tagger and a Thai phrase chunker. Symposium on Natural Language Processing. In the future, we plan to increase an amount Ruangrajitpakorn Taneth, Na Chai Wasan , Boonk- of CGs derivation trees of complex sentence wan Prachya, Boriboon Monthika, Supnithi Thep- and practical language. Moreover, we will im- chai. 2007. The Design of Lexical Information for plement a system to transform an existing Thai Thai to English MT, Proceedings of the 7th Inter- dependency bank to CDG format to gain more national Symposium on Natural Language Pro- number of trees. We also plan to include se- cessing. mantic meaning into derivation tree and repres- Ruangrajitpakorn Taneth, Trakultaweekoon Kan- ent such trees in an RDF format. In addition, okorn, and Supnithi Thepchai. 2009. A Syntactic statistical parser will be implemented based on Resource for Thai: CG Treebank, Proceedings of the CDG derivation trees. the 7th Workshop on Asian Language Resources, (ACL-IJCNLP): 96–102. References Satayamas Vee, and Kawtrakul Asanee . 2004. Wide- Ajdukiewicz Kazimierz. 1935. Die Syntaktische Coverage Grammar Extraction from Thai Tree- Konnexitat, Polish Logic. bank. Proceedings of Papillon 2004 Workshops on Multilingual Lexical Databases, Grenoble, France. Aho Alfred, and Johnson Stephen. 1974. LR Parsing, Proceedings of Computing Surveys, Vol. 6, No. 2. Supnithi Thepchai, Ruangrajitpakorn Taneth, Trak- ultaweekoon Kanokorn, and Porkaew Peerachet. Bar-Hillel Yehoshua. 1953. A quasi-arithmetical 2010. AutoTagTCG : A Framework for Automatic notation for syntactic description. 29(1): 47-58. Thai CG Tagging, Proceedings of the 7th interna- Carpenter Bob. 1992. Categorial Grammars, Lexical tional conference on Language Resources and Rules,and the English Predicative, In R. Levine, Evaluation (LREC). ed., Formal Grammar: Theory and Implementa- tion. OUP.

168 Chained Machine Translation Using Morphemes as Pivot Language

Wen Li Lei Chen Wudabala Miao Li Institute of Intelligent Institute of Intelligent Institute of Intelligent Institute of Intelligent Machines, Chinese Machines, Chinese Machines, Chinese Machines, Chinese Academy of Sciences, Academy of Sciences Academy of Sciences Academy of Sciences University of Science [email protected] [email protected] [email protected] and Technology of China [email protected]

Abstract of the source language and the target language is highly inflected. As the smallest meaning- As the smallest meaning-bearing ele- bearing elements of the languages which have ments of the languages which have rich rich morphology information, morphemes are morphology information, morphemes the compact representation of words. Using mor- are often integrated into state-of-the-art phemes as the semantic units in the parallel cor- statistical machine translation to improve pus can not only help choose the correct inflec- translation quality. The paper proposes tions, but also alleviate the data sparseness prob- an approach which novelly uses mor- lem partially. phemes as pivot language in a chained Many strategies of integrating morphol- machine translation system. A machine ogy information into state-of-the-art SMT translation based method is used therein systems in different stages have been pro- to find the mapping relations between posed. (Ramanathan et al., 2009) proposed morphemes and words. Experiments a preprocessing approach for incorporating show the effectiveness of our approach, syntactic and Morphological information within achieving 18.6 percent increase in BLEU a phrase-based English-Hindi SMT system. score over the baseline phrase-based ma- (Watanabe et al., 2006) proposed a method chine translation system. which uses Porter stems and even 4-letter pre- fixes for word alignment. (Koehn et al., 2007) 1 Introduction proposed the factored translation models which Recently, most evaluations of machine transla- combine feature functions to handle syntactic, tion systems (Callison-Burch et al., 2009) indi- morphological, and other linguistic informa- cate that the performance of corpus-based statis- tion in a log-linear model during training. tical machine translation (SMT) has come up to (Minkov et al., 2007) made use of the infor- the traditional rule-based method. In the corpus- mation of morphological structure and source based SMT, it is difficult to exactly select the language in postprocessing to improve SMT correct inflections (word-endings) if the target quality. (de Gispert et al., 2009) adopted the language is highly inflected. This problem will Minimum Bayes Risk decoding strategy to be more severe if the source language is an iso- combine output from identical SMT system, lated language with non-morphology (eg. Chi- which is trained on alternative morphological nese) and the target language is an agglutinative decompositions of the source language. language with productive derivational and inflec- Meanwhile, the SMT-based methods are tional morphology (eg. Mongolian: a minor- widely used in the area of natural lan- ity language of China). In addition, the lack of guage processing. (Quirk et al., 2004) ap- large-scale parallel corpus may cause the sparse plied SMT to generate novel paraphrases. data problem, which will be more severe if one (Riezler et al., 2007) adopted an SMT-based

169 Proceedings of the 8th Workshop on Asian Language Resources, pages 169–177, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing method to query expansion in answer retrieval. is added to maintain the syntactic environment (Jiang and Zhou, 2008) used SMT to generate of the root. For instance, the Mongolian word the second sentence of the Chinese couplets. “YABVGSAN” (walking) in the present con- As opposed to the above strategies, the pa- tinuous tense syntactic environment consists of per proposes an approach that uses morphemes the root “YABV” (walk) and the suffix “GSAN” as pivot language in a chained SMT system, for (ing). Whereas, when a verb root “UILED” (do) translating Chinese into Mongolian, which con- concatenates a noun derivational suffix “BURI”, sists of two SMT systems. First, Chinese sen- it changes to a noun “UILEDBURI” (factory). tences are translated into Mongolian morphemes According to that whether linguistic lemmatiza- instead of Mongolian words in the Chinese- tion (the reduction to base form) is considered Morphemes SMT (SMT1). Then Mongolian or not, the paper proposes two methods of mor- words are generated from morphemes in the phological segmentation. The two methods are Morphemes-Mongolian SMT (SMT2). The es- tested on the same training databases. sential part of the chained SMT system is how The root lemmatization is concerned in the to find the mapping relations between the mor- first method, which is called the SMT-based phemes and words, which is considered as a pro- morphological segmentation (SMT-MS) in this cedure of machine translation in our approach. paper. Given the Mongolian-morphemes par- More concretely, the first challenge of this ap- allel corpus, this method trains a Mongolian- proach is to investigate some effective strategies morphemes SMT to segment Mongolian words. to segment the Mongolian corpus in the Chinese- The root lemmatization is considered in the orig- Mongolian parallel corpus. And the second chal- inal morphological pre-segmented training cor- lenge is how to efficiently generate Mongolian pus. So the SMT-based method can also deal words from morphemes. Additionally, on the with root lemmatization when it segments a one hand Mongolian words may have multiple Mongolian word. For instance, the Mongo- kinds of morphological segmentations. On the lian word “BAYIG A” exhibits the change of other hand there is also the ambiguity of word spelling during the concatenation of the mor- boundaries in the processing of generating Mon- phemes “BAI” and “G A”. We also investi- golian words from morphemes. In order to solve gate whether it is effective if those roots are these ambiguities, a SMT-based method is ap- identical to the original word forms. In other plied in that word context and morphemes con- words, the root lemmatization is ignored in the text can be taken into account in this method. second method, which takes the gold standard The remainder of the paper is organized as morphological segmentation corpus as a trained follows. Section 2 introduces two methods of model of Morfessor (Creutz and Lagus, 2007) morphological segmentation. Section 3 presents and uses the Viterbi decoding algorithm to seg- the details of chained SMT system. Section 4 ment new words. Therefore, this method is describes the experiment results and evaluation. called the Morfessor-based morphological seg- Section 5 gives concluding remarks. mentation (Mor-MS). For instance, the word “BAYIG A” will be segmented to “BAYI” and 2 Morphological segmentation “G A” instead of “BAI” and “G A”. Mongolian is a highly agglutinative language The mathematical description of SMT-MS is with a rich set of affixes. Mongolian contains the same as the traditional machine transla- about 30,000 stems, 297 distinct affixes. A big tion system. In the Mor-MS method, the mor- growth in the number of possible word forms phological segmentation of a word can be re- may occur due to the inflectional and deriva- garded as a flat tree (morphological segmenta- tional productions. An inflectional suffix is a tion tree), where the root node corresponds to terminal affix that does not change the parts of the whole word and the leaves correspond to speech of the root during concatenation, which morphemes of this word. Figure 1 gives an ex-

170 ample. First, the joint probabilistic distribution Figure 2 illustrates the overview of chained (Creutz and Lagus, 2007) of all morphemes in SMT system. the morphological segmentation tree are calcu- lated. And then by using the Viterbi decoding 3.2 Phrase-based SMT algorithm, the maximum probability segmenta- The authors assume the reader to be famil- tion combination is selected. iar with current approaches to machine trans- lation, so that we briefly introduce the phrase- $0G0DB0RILAGDAHV-ACA based statistical machine translation model (Koehn et al., 2003) here, which is the founda- tion of chained SMT system. $0G0DB0RILA GDA HV -ACA In statistical machine translation, given a source language f, the aim is to seek a target Figure 1: Morphological segmentation tree language e, such that P (e f) is maximized. The | phrase-based translation model can be expressed 3 Chained SMT system by the following formula:

3.1 Overview e∗ = arg max P (e f) = arg max P (f e)P (e) e | e { | } In order to improve the performance of Chinese- Mongolian machine translation, the paper pro- where e∗ indicates the best result, P (e) is the language model and P (f e) is the translation poses an approach which incorporates the mor- | phology information within a chained SMT sys- model. According to the standard log-linear tem. More concretely, this system first translates model proposed by (Och and Ney, 2002), the best result e that maximizes P (e f) can be ex- Chinese into Mongolian morphemes instead of ∗ | Mongolian words by the Chinese-Morphemes pressed as follows: SMT. And then it uses the Morphemes- M Mongolian SMT to translate Mongolian mor- e∗ = arg max λmhm(e, f) e { } phemes into Mongolian words. Namely, mor- m=1 phemes are regarded as pivot language in this X system. where M is the number of feature functions, The chained SMT system consists of a mor- λm is the corresponding feature weight, each phological segmentation system and two phrase- hm(e, f) is a feature function. base machine translation systems, which are In our chained SMT system, SMT1, SMT2 given as follows: and the SMT for morphological segmentation (namely SMT-MS in Section 2) are all phrase- Morphological segmentation: segment- • based SMTs. ing Mongolian words (from the Chinese- Mongolian parallel corpus) into Mongo- 3.3 Features of Chained SMT system lian morphemes and obtaining two paral- As shown in Figure 2, Chinese is translated into lel corpus: Chinese-Morphemes parallel Mongolian morphemes in SMT1, which is the corpus and Morphemes-Mongolian parallel core part of the chained SMT system. Here mor- corpus. phemes are regarded as words. Therefore, mor- SMT : training the Chinese-Morphemes phemes can play important roles in SMT1 as fol- • 1 SMT on the Chinese-Morphemes parallel lows: the roots present the meaning of the word corpus. and the suffixes help select the correct grammat- ical environment. The word alignments between SMT : training the Morphemes-Mongolian Chinese words and Mongolian morphemes are • 2 SMT on the Morphemes-Mongolian paral- learned automatically by GIZA++. Figure 3 lel corpus. gives an instance of word alignment in SMT1.

171 Chinese (input)

Chinese (corpus)

SMT1: Chinese-Morphemes

Mongolian (corpus) Morphological segmentation Morphemes (corpus)

SMT2: Morphemes-Mongolian

Mongolian (output)

Figure 2: Morphemes as pivot language in Chained SMT system

We can see that the morphemes “BI”,“TAN” etc. Mongolian words from morphemes. These am- are all regarded as words. biguities can be solved in SMT-MS and SMT2 respectively, since the SMT-based method can wo he ni yiqi qu meiyou yidian bangzhu endure mapping errors and solve mapping am- biguities by the multiple features which can con- BI TAN -TAI 0CI GAD -CV NEMERI ALAG_A sider the context of Mongolian words.

Figure 3: Word alignments between Chinese 4 Experiments words and Mongolian morphemes in SMT 1 4.1 Experimental setup All the most commonly used features of stan- In the experiments, first we preprocess the cor- dard phrase-based SMT, including phrase trans- pus, such as converting Mongolian into Latin lation model, language model, distortion model Mongolian and filtering the apparent noisy seg- and word penalty, are selected in SMT1. These mentation of the gold standard morphological commonly used features determine the quality segmentation corpus. And then we evaluate the of translation together. The phrases of f and effectiveness of the SMTs which find the map- e are ensured to be good translations of each ping relations between the morphemes and their other in the phrase translation model P (f e). corresponding word forms. Namely, SMT-MS | The fluent output is guaranteed in the language and SMT2. As mentioned above, SMT1 is the model LM(e). The reordering of the input sen- core part of the chained SMT system, which de- tence is allowed in the distortion model D(e, f). cides the final quality of translation results. So The translation is however more expensive with the evaluation of SMT1 can be reflected by the the more reordering. The translation results are evaluation of translation results of whole chained guaranteed neither too long nor too short in the SMT system. Finally, we evaluate and analyze word penalty W (e). the performance of the chained SMT system by In SMT-MS and SMT2, the task is to find using the automatic evaluation tools. the mapping relations between Mongolian mor- The translation model consists of a stan- phemes and Mongolian words, which is consid- dard phrase-table with lexicalized reordering. ered as the word-for-word translation. There- Bidirectional word alignments obtained with fore, only phrase translation model and language GIZA++ are intersected using the grow-diag- model are considered. All the features weights final heuristic (Koehn et al., 2003). Translations are uniform distribution by default. Mongolian of phrases of up to 7 words long are collected words may have multiple kinds of morphologi- and scored with translation probabilities and lex- cal segmentations. And there is the ambiguity of ical weighting. The language model of mor- word boundaries in the processing of generating phemes is a 5-gram model with Kneser-Ney

172 smoothing. The language model of Mongo- from morphemes. To evaluate the effectiveness lian word is 3-gram model with Kneser-Ney of SMT-MS and SMT2, we divide the filtered smoothing too. All the language models are gold standard corpus into two sets for training built with the SRI language modeling toolkit (90%) and testing (10%) respectively. The cor- (Stolcke, 2002). The log-linear model feature rect morpheme boundaries are counted for SMT- weights are learned by using minimum error rate MS evaluation, while the correct word bound- training (MERT) (Och, 2003) with BLEU score aries are counted for SMT2 evaluation. We use (Papineni et al., 2002) as the objective function. the two measures precision and recall on dis- covered word boundaries to evaluate the effec- 4.2 Corpus preprocessing tiveness of SMT-MS and SMT2, where precision The Chinese-Mongolian parallel corpus and is the proportion of correctly discovered bound- monolingual sentences are obtain from the 5th aries among all discovered boundaries by the al- China Workshop on Machine Translation. In gorithm, and recall is the proportion of correctly the experiments, we first convert Mongolian discovered boundaries among all correct bound- corpus into Latin Mongolian. In morphologi- aries. A high precision indicates that a mor- cal segmentation, the gold standard morpholog- pheme boundary is probably correct when it is ical segmentation corpus contains 38000 Mon- suggested. However the proportion of missed golian sentences, which are produced semi- boundaries can not be obtained from it. A high automatically by using the morphological ana- recall indicates that most of the desired bound- lyzer Darhan (Nashunwukoutu, 1997) of Inner aries were indeed discovered. However it can not Mongolia University. Moreover, in order to ob- point out how many incorrect boundaries were tain the higher quality corpus, most of the wrong suggested either. In order to get a comprehensive segmentation in the results of morphological an- idea, we also make use of the evaluation method: alyzer are modified manually by the linguistic F-measure as a compromise. experts. However, there are still some wrong 1 segmentation in the gold standard corpus. There- F-measure = 1 1 1 fore, we adopt a strategy to filter the apparent ( + ) 2 precision recall noisy segmentation. In this strategy, the sum of the lengths of all the morphemes is required to These measures assume values between zero and be equivalent to the length of the original word. 100%, where high values reflect good perfor- After filtering, there are still 37967 sentences re- mance. Therefore, we evaluate the SMT-based mained. In addition, the word alignment is vul- methods by incrementally evaluating the features nerable to punctuation in SMT-MS. So all punc- used in our phrase-based SMT model. tuation of the gold standard morphological seg- Table 1 gives the evaluation results, where mentation corpus are removed to eliminate some PTM denotes Phrase Translation Model, LW de- mistakes of the word alignment. notes Lexical Weight, LM denotes Language Meanwhile, since the Chinese language Model, IPTM denotes Inverted PTM, ILW de- does not have explicit word boundaries, we notes Inverted LW. Table 1(a) and Table 1(b) also need to do the segmentation of Chinese are corresponding to the evaluations of SMT-MS words. The word segmentation tool ICTCLAS and SMT2 respectively, where P , R and F de- (Zhang, 2008) is used in the experiments. note the three measures, namely precision, recall and F-measure. 4.3 Evaluation of SMT-MS and SMT2 The results show that when we add more fea- The tasks of SMT-MS and SMT2 are to find tures incrementally, the precision, recall and F- the mapping relations between the morphemes measure are improved consistently. These in- and their corresponding word forms. Morpho- dicate that the features are helpful for finding logical segmentation is done by SMT-MS. Con- the mapping relations between morphemes and trarily, SMT2 is used to generate the words Mongolian words.

173 Table 1: Evaluation of SMT-MS and SMT2 Table 2: Evaluation of systems (a) without MERT (a) Evaluation of SMT-MS NIST BLEU (%) Feature P (%) R(%) F (%) Baseline 5.3586 20.71 (1): PTM+LW 73.35 72.45 72.90 Chain1 5.6471 23.91 (2): (1)+LM 94.91 94.91 94.91 Chain2 5.6565 24.57 (3): (2)+IPTM+ILW 94.95 94.95 94.95 (b) with MERT (b) Evaluation of SMT2 NIST BLEU (%) Feature P (%) R(%) F (%) Baseline 5.6911 24.13 (1): PTM+LW 75.86 60.04 67.03 Chain1 5.7439 24.70 (2): (1)+LM 95.13 89.92 92.45 Chain2 5.8401 25.80 (3): (2)+IPTM+ILW 95.13 90.02 92.51

flected forms which are identical to the original 4.4 Evaluation of chained SMT system word forms. As the example in Section 2, the word “BAYIG A” is segmented into “BAI+G A” We use NIST score (Doddington, 2002) and in Chain and “BAYI+G A” in Chain . Mean- BLEU score (Papineni et al., 2002) to evaluate 1 2 while, “BAI” is an independent Mongolian word chained SMT system. The training set contains in the corpus. So Chain can not discriminate the 67288 Chinese-Mongolian parallel sentences. 1 word “BAI” from the morpheme “BAI”. The test set contains 400 sentences, where each As well known, the translation quality of SMT sentence has four reference sentences which are relies on the performance of morphological seg- translated by native experts. mentation. We give the following example to In the training phase, we convert Mongolian intuitively show the quality of translation of the into Latin Mongolian. And while in the test chained SMT system. phase, we convert the Latin Mongolian back into the traditional Mongolian words. We com- Example 1 Table 3 gives four examples of pare the chained SMT system with the standard translating Chinese into Mongolian. In each phrase-based SMT. Table 2 gives the evaluation example, four reference sentences translated by of experiment result of each system, where Base- native experts are also given. These examples indicate that the chained SMT system can help line is the standard phrase-based SMT, Chain1 is choose the correct inflections, and partly allevi- a chained SMT consisting of SMT-MS, SMT1 ate the data sparseness problem. and SMT2, Chain2 is also a chained SMT con- In Table 3(a), the Mongolian word “HAGAS” sisting of Mor-MS, SMT1 and SMT2. In Table 2(b), we use MERT to train the feature weights (corresponding to the Chinese word “yiban”) of the baseline system and the feature weights of has multiple inflectional forms as follows: SMT1 in Chain1 and Chain2. Mongolian Chinese The experiment results show that both Chain1 HAGAS-VN yi bande and Chain2 are much better than the baseline sys- HAGAS-IYAR yiban de tem. The BLEU score is improved by 18.6 per- HAGAS-TV zaiban cent, from 20.71 (Baseline) to 24.57 (Chain2). HAGAS-I ba ban In addition, Chain2 is better than Chain1. We From the above example, we can see that believe that it is essentially related to the dif- the baseline system translates the Chinese word ferent morphemes corpus of Chain1 and Chain2. “ban” to the incorrect inflection “HAGAS-TV, The morphemes corpus of Chain1 takes lemma- while Chain2 translates it to the correct inflec- tization into account, while the morphemes cor- tion “HAGAS” which is the morpheme of all the pus of Chain2 changes all morphemes to in- other inflections.

174 Table 3: Examples of translating Chinese into Mongolian (a) Lexicalization of morphemes Chinese xianzai shi jiu dian ban . Baseline 0D0 B0L YISUN CAG HAGAS-TV. Chain1 0D0 B0L YISUN CAG HAGAS-TV. Chain2 0D0 B0L YISUN CAG HAGAS B0LJV BAYIN A. 0D0 YISUN CAG HAGAS B0LJV BAYIN A. 0D0 YISUN CAG HAGAS. References 0D0 YISUN CAG HAGAS B0LBA. 0D0 YISUN CAG HAGAS B0LJV BAYIN A.

(b) Tense Chinese qunian zheshihou ni zai ganshenme ? Baseline NIDVNVN ENE HIRI CI YAGV HIJU BAYIHV BVI? Chain1 NIDVNVN ENE UYES CI YAGV HIJU BAYIHV BVI? Chain2 NIDVNVN ENE UY E-DU CI YAGV HIJU BAYIG A BVI? NIDVNVN-V ENE UYE-DU TA YAGV HIJU BAYIG A BVI? NIDVNVN ENE UY E-DU TA YAGV HIJU BAYIBA? References NIDVNVN JIL-VN ENE UYES TA YAGV HIJU BAYIBA? 0D0 NIDVNVN-V ENE UYE-DU TA YAGV HIJU BAYIG A BVI?

(c) Syntax Chinese wo xiwang jinnian dong tian hui xiaxue . Baseline BI ENE JIL EBUL-UN EDUR-UN CASV 0R0JV B0L0N A. Chain1 BI ENE EBUL-UN EDUR CASV 0R0HV-YI HUSEJU BAYIN A. Chain2 BI ENE EBUL-UN EDUR CASV 0R0HV-YI HUSEJU BAYIN A. BI ENE EBUL CAS 0R0HV-YI HUSEJU BAYIN A. ENE EBUL CASV 0R0HV-YI HUSEJU BAYIN A. References BI ENE EBUL CASV 0R0HV-YI HUSEN E. BI ENE EBUL CAS 0R0HV-YI HUSEJU BAYIN A.

(d) Out-Of-Vocabulary words Chinese wo guoqu chang yidazao chuqu sanbu . Baseline ... URGULJI yidazao GADAN AGARCVSELEGUCENALHVLABA. Chain1 ... URGULJI BODORIHU-BER GADAGVR ALHVLAN A. Chain2 ... URGULJI ORLOGE ERTE GARCV ALHVLAN A. ... URGULJI OROLGE ERTE GARCV AVHVDAG. ... URGULJI ORLOGE ERTE GADAGVR ALHVLADAG. References ... YERU NI ORLOGE ERTE B0S0GAD GADAGVR ALHVLADAG. ... URGULJI OROLGE ERTE GARCV AVHVDAG.

In Table 3(b), the word “BAYIN” in the result chain1 and chain2. of the baseline system indicates the past tense en- vironment. However, the correct environment is In Table 3(c), the baseline system translates the past continuous tense which is indicated by “dongtian” into “EDUR-UN” as an attribute, the word “BAYIN A” appearing in the results of while the correct translation should be “EDUR” as the subject of the object clause.

175 The statistical data-sets from word alignment Schroeder. 2009. Findings of the 2009 workshop corpus show that the vocabularies of the base- on statistical machine translation. In StatMT, line system includes 376203 Chinese-Mongolian pages 1–28. word pairs, while Chain1 and Chain2 contain [Creutz and Lagus2007] Creutz, Mathias and Krista 326847 and 291957 Chinese-Morphemes pairs Lagus. 2007. Unsupervised models for morpheme respectively. This indicates that the chained SMT segmentation and morphology learning. TSLP, system can partly alleviates the data sparseness 4(1):1–34. problem. As shown in Table 3(d), the baseline [de Gispert et al.2009] de Gispert, Adri`a, Sami Virpi- system can not translate the word “yidazao”, oja, Mikko Kurimo, and William Byrne. 2009. Minimum bayes risk combination of translation while Chain1 and Chain2 can. hypotheses from alternative morphological de- 5 Concluding remarks compositions. In HLT, pages 73–76. [Doddington2002] Doddington, George. 2002. Auto- The paper proposes the chained SMT system us- matic evaluation of machine translation quality us- ing morphemes as pivot language for translat- ing n-gram co-occurrencestatistics. In HLT, pages ing an isolated language with non-morphology 128–132. into an agglutinative language with productive [Jiang and Zhou2008] Jiang, Long and Ming Zhou. derivational and inflectional morphology. The 2008. Monolingual machine translation for para- experiment results show that the performance of phrase generation. In COLING, pages 377–384. the chained SMT system is encouraging. And the SMT-based method is quite effective for find- [Koehn et al.2003] Koehn, Philipp, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based ing mapping relations between morphemes and translation. In HLT-NAACL, pages 48–54. words. When adding more features, the preci- sion, recall and F-measure are all improved more [Koehnetal.2007] Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello obviously. Federico, Nicola Bertoldi, Brooke Cowan, Wade In the future, we will consider the confusion Shen, Christine Moran, Richard Zens, Chris Dyer, network or lattice of N-best translation results in- Ondrej Bojar, Alexandra Constantin, and Evan stead of one best translation result in the chained Herbst. 2007. Moses: Open source toolkit for sta- tistical machine translation. In ACL, pages 177– SMT system. Meanwhile, the distortion of mor- 180. pheme order in Mongolian is still obscure and needs more investigation. And comparing our [Minkov et al.2007] Minkov, Einat, Kristina work with other language pairs, such as English- Toutanova, and Hisami Suzuki. 2007. Generating complex morphology for machine translation. In to-French translation, English-to-Spanish trans- ACL, pages 128–135. lation, and so on, is also concerned. [Nashunwukoutu1997] Nashunwukoutu. 1997. An Acknowledgments automatic segmentation system for the root, stem, suffix of the mongolian. Journal of Inner Mongo- The authors would like to thank the anonymous lia University, 29(2):53–57. reviewers for their helpful reviews. The work is [Och and Ney2002] Och, Franz Josef and Hermann supported by the National Key Technology R&D Ney. 2002. Discriminative training and maximum Program of China under No. 2009BAH41B06 entropy models for statistical machine translation. and the Dean Foundation (2009) of Hefei Insti- In ACL, pages 295–302. tutes of Physical Science, Chinese Academy of [Och2003] Och, Franz Josef. 2003. Minimum error Sciences. rate training in statistical machine translation. In ACL, pages 160–167.

References [Papinenietal.2002] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. [Callison-Burch et al.2009] Callison-Burch, Chris, BLEU: a method for automatic evaluation of Philipp Koehn, Christof Monz, and Josh machine translation. In ACL, pages 311–318.

176 [Quirk et al.2004] Quirk, Chris, Chris Brockett, and William B. Dolan. 2004. Generating chinese cou- plets using a statistical MT approach. In EMNLP, pages 142–149. [Ramanathan et al.2009] Ramanathan, Ananthakrish- nan, Hansraj Choudhary, Avishek Ghosh, and Pushpak Bhattacharyya. 2009. Case markers and morphology: addressing the crux of the fluency problem in English-Hindi SMT. In ACL-IJCNLP, pages 800–808. [Riezler etal.2007] Riezler, Stefan, Alexander Vasserman, Ioannis Tsochantaridis, Vibhu O. Mittal, and Yi Liu. 2007. Statistical machine translation for query expansion in answer retrieval. In ACL, pages 464–471. [Stolcke2002] Stolcke, Andreas. 2002. SRILM - an extensible language modeling toolkit. In Proc. Intl. Conf. on Spoken Language Processing, pages 901–904. [Watanabeet al.2006] Watanabe, Taro, Hajime Tsukada, and Hideki Isozaki. 2006. Ntt system description for the wmt2006 shared task. In WMT, pages 122–125. [Zhang2008] Zhang, Huaping. 2008. ICTCLAS. http://ictclas.org/.

177 Author Index

Acarturk, Cengiz, 111 Muaz, Ahmed, 95 Akram, Misbah, 88 Ali, Wajid, 137 Norbu, Sithar, 95

Bandyopadhyay, Sivaji, 47, 56 Onman, Chanon, 129 Begum, Rafiya, 120 Palingoon, Pornpimon, 64 Belz, Anja, 38 Park, Heum, 14 Bond, Francis, 144 Park, Woo Chul, 14 Budiono, Budiono,9 Porkaew, Peerachet, 129

Choejey, Pema, 95 Rabgay, Jurmey, 103 Chungku, Chungku, 103 Ranta, Aarne, 153 Das, Amitava, 56 Riza, Hammam,9 Das, Dipankar, 47 Ruangrajitpakorn, Taneth, 129, 161 Dendup, Tenzin, 95 Sangkeettrakarn, Chatchawal, 64 Faaß, Gertrud, 103 Sharma, Dipti Misra, 120 Fujita, Atsushi,1 Song, Sanghoun, 144 Supnithi, Thepchai, 129, 161 GAO, Yahui, 22 Takeuchi, Koichi,1 Gao, Yan, 72 Takeuchi, Nao,1 Hakim, Chairil,9 Terai, Asuka, 38 Haruechaiyasak, Choochart, 64 Tokunaga, Takenobu, 38 Humayoun, Muhammad, 153 Trakultaweekoon, Kanokorn, 129 Hussain, Sarmad, 88, 95, 137 Tsunakawa, Takashi, 30

Iida, Ryu, 38 Van, Channa, 80 Inui, Kentaro,1 Virk, Shafqat Mumtaz, 153

Kaji, Hiroyuki, 30 WANG, Ruibo, 22 Kameyama, Wataru, 80 Wen, Li, 169 Kawtrakul, Asanee, 129 Wudabala, Han, 169 Kim, Jong-Bok, 144 Yang, Erhong, 72 Kongthon, Alisa, 64 Yang, Jaehyung, 144 Kwon, Hyuk-Chul, 14 Yasuhara, Masaaki, 38 Lei, Chen, 169 Yoon, Ae sun, 14 LI, Jihong, 22 Zeng, Qingqing, 72 Miao, Li, 169 Zeyrek, Deniz, 111 Morris, David, 38 Zou, Hongjian, 72

178