IJCNLP 2008

The 6th Workshop on Asian Language Resources (ALR 6)

Proceedings of the Workshop

11-12 January 2008 Indian School of Business, Hyderabad, India °c 2008 Asian Federation of Natural Language Processing

Sponsor

Special Coordination Funds for Promoting Science and Technology, Ministry of Education, Culture, Sport, Science and Technology, MEXT Japan. Preface

This volume contains the papers presented at the sixth workshop on Asian Language Resources, held on 11–12 January 2008 in conjunction with the third International Joint Conference on Natural Langauge Processing (IJCNLP 2008). Language resources have played an essential role in empirical approaches natural language processing (NLP) for the last two decades. Previous concerted efforts on construction of language resources, particu- larly in the US and European countries, have laid solid foundation for the pioneering NLP researches in these two communities. In comparison, the availability and accessibility of many Asian language resources are still very limited except for a few languages. Moreover, there is a greater diversity in Asian languages with respect to character sets, grammatical properties and the cultural background. Motivated by such a context, have organised a series of workshops on Asian language resources since 2001. This workshop series has contributed to the activation of the NLP research in Asia particularly of building and utilising corpora of various types and languages. In this sixth workshop, we had 31 submis- sions encompassing 13 languages. The paper selection was highly competitive compared with the last five workshops. The program committee selected 10 regular papers, 3 short papers and 8 resource reports for presentation at the workshop. The workshop is comprised of two parts, technical sessions and a session devoted to reporting activities related to language resources in several languages. Following the resource report session, we have an open discussion on the collaboration in building, standardising and exchanging language resources in Asia. We hope this workshop further accelerates the already thriving NLP research in Asia.

Chu-Ren Huang Mikami Yoshiki Workshop Co-chairs

Hasida Koitiˆ Tokunaga Takenobu Program Co-chairs

Organiser

Workshop chairs Huang, Chu-Ren Academia Sinica Mikami, Yoshiki Nagaoka University of Technology

Program chairs Hasida, Koitiˆ National Institute of Advanced Industrial Science and Technology Tokunaga, Takenobu Tokyo Institute of Technology

Program Committee Bhattacharyya, Pushpak IIT, Bombay Fang, Alex Chengyu City University of Hong Kong Riza, Hammam IPTEKnet–BPPT Hasida, Koitiˆ National Institute of Advanced Industrial Science and Technology , Tingting Huazhong Normal University Huang, Chu-Ren Academia Sinica Hussain, Sarmad National University of Computer & Emerging Sciences Itahashi, Shuichi National Institute of Informatics Lu, Qin Hong Kong Polytechnic University Luong, Mai National Center for Sciences and Technologies of Vietnam Mikami, Yoshiki Nagaoka University of Technology Nandasara, Shakrange Turrance University of Colombo, School of Computing Nguyen, Thi Minh Huyen Hanoi University of Sciences Oo, Thein Myanmar Computer Federation Rau, Victoria Providence University Rim, Hae-Chang Korea University Roxas, Rachel Edita De La Salle University, Manila Shirai, Kiyoaki Japan Advanced Institute of Science and Technology Sornlertlamvanich, Virach Thai Computational Linguistics Laboratory, NICT Sui, Zhifang Peking University Tokunaga, Takenobu Tokyo Institute of Technology Vikas, Om Indian Institute of Information Technology and Management Zhao, Jun Chinese Academy of Sciences

This workshop is supported by Special Coordination Funds for Promoting Science and Technology, Min- istry of Education, Culture, Sport, Science and Technology, MEXT Japan.

ii Workshop Program 11-12 January 2008 Indian School of Business, Hyderabad, India

Day 1 (11 January)

9:00 Registration 9:20 Opening 9:30 Development of Bengali Named Entity Tagged Corpus and its Use in NER Systems Asif Ekbal and Sivaji Bandyopadhyay 9:55 Gazetteer Preparation for Named Entity Recognition in Indian Languages Sujan Kumar Saha, Sudeshna Sarkar and Pabitra Mitra 10:20 Preliminary Chinese Term Classification for Ontology Construction Gaoying Cui, Qin Lu and Wenjie Li 10:45 Break 11:05 Technical Terminology in Asian Languages: Different Approaches to Adopting Engineering Terms Makiko Matsuda, Tomoe Takahashi, Hiroki Goto, Yoshikazu Hayase, Robin Lee Nagano and Yoshiki Mikami 11:30 Selection of XML tag set for Myanmar National Corpus Wunna Ko and Thin Zar Phyo 11:55 Myanmar Word Segmentation using Syllable level Longest Matching Hla Hla Htay and Kavi Narayana Murthy 12:20 Lunch 13:50 The Link Structure of Language Communities and its Implication for Language-specific Crawling Rizza Caminero and Yoshiki Mikami 14:15 A Multilingual Multimedia Indian Sign Language Dictionary Tool Tirthankar Dasgupta, Sambit Shukla, Sandeep Kumar, Synny Diwakar and Anupam Basu 14:40 A Discourse Resource for Turkish: Annotating Discourse Connectives in the METU Corpus Deniz Zeyrek and Bonnie Webber 15:05 Towards an Annotated Corpus of Discourse Relations in Hindi Rashmi Prasad, Samar Husain, Dipti Sharma and Aravind Joshi 15:30 Break 15:50 A Semantic Study on Yami Ontology in Traditional Songs Yin-Sheng Tai, D. Victoria Rau and Meng-Chien Yang 16:05 Assessment and Development of POS Tag Set for Telugu Rama Sree R.J., Uma Maheswara Rao G and Madhu Murthy K.V. 16:20 Designing a Common POS-Tagset Framework for Indian Languages Sankaran Baskaran, Kalika Bali, Tanmoy Bhattacharya, Pushpak Bhattacharyya, Girish Nath Jha, Rajendran S, Saravanan K, Sobha L and Subbarao K V

iii Day 2 (12 January)

9:00 Resources Report on Languages of Indonesia Hammam Riza 9:15 Confirmed Language Resource for Answering How Type Questions Developed by Using Mails Posted to a Mailing List Ryo Nishimura, Yasuhiko Watanabe and Yoshihiro Okada 9:30 Corpus building for Mongolian language Purev Jaimai and Odbayar Chimeddorj 9:45 Resources for Urdu Language Processing Sarmad Hussain 10:00 Balanced Corpus of Contemporary Written Japanese Kikuo Maekawa 10:15 Break 10:35 A Basic Framework to Build a Test Collection for the Vietnamese Text Catergorization Viet Hoang-Anh, Thu Dinh-Thi-Phuong and Thang Huynh-Quyet 10:50 Enhanced Tools for Online Collaborative Language Resource Development Virach Sornlertlamvanich, Thatsanee Charoenporn, Suphanut Thayaboon, Chumpol Mokarat and Hitoshi Isahara 11:05 Japanese Effort Toward Sharing Text and Speech Corpora Shuichi Itahashi and Koitiˆ Hasida 11:20 Open Discussion 12:20 Closing

iv Table of Contents hRegular papersi Development of Bengali Named Entity Tagged Corpus and its Use in NER Systems Asif Ekbal and Sivaji Bandyopadhyay ...... 1 Gazetteer Preparation for Named Entity Recognition in Indian Languages Sujan Kumar Saha, Sudeshna Sarkar and Pabitra Mitra ...... 9 Preliminary Chinese Term Classification for Ontology Construction Gaoying Cui, Qin Lu and Wenjie Li ...... 17 Technical Terminology in Asian Languages: Different Approaches to Adopting Engineering Terms Makiko Matsuda, Tomoe Takahashi, Hiroki Goto, Yoshikazu Hayase, Robin Lee Nagano and Yoshiki Mikami ...... 25 Selection of XML tag set for Myanmar National Corpus Wunna Ko Ko and Thin Zar Phyo ...... 33 Myanmar Word Segmentation using Syllable level Longest Matching Hla Hla Htay and Kavi Narayana Murthy ...... 41 The Link Structure of Language Communities and its Implication for Language-specific Crawling Rizza Caminero and Yoshiki Mikami ...... 49 A Multilingual Multimedia Indian Sign Language Dictionary Tool Tirthankar Dasgupta, Sambit Shukla, Sandeep Kumar, Synny Diwakar and Anupam Basu ...... 57 A Discourse Resource for Turkish: Annotating Discourse Connectives in the METU Corpus Deniz Zeyrek and Bonnie Webber ...... 65 Towards an Annotated Corpus of Discourse Relations in Hindi Rashmi Prasad, Samar Husain, Dipti Sharma and Aravind Joshi ...... 73 hShort papersi A Semantic Study on Yami Ontology in Traditional Songs Yin-Sheng Tai, D. Victoria Rau and Meng-Chien Yang ...... 81 Assessment and Development of POS Tag Set for Telugu Rama Sree R.J., Uma Maheswara Rao G and Madhu Murthy K.V...... 85 Designing a Common POS-Tagset Framework for Indian Languages Sankaran Baskaran, Kalika Bali, Tanmoy Bhattacharya, Pushpak Bhattacharyya, Girish Nath Jha, - jendran S, Saravanan K, Sobha L and Subbarao K V ...... 89 hResource reportsi Resources Report on Languages of Indonesia Hammam Riza ...... 93 Confirmed Language Resource for Answering How Type Questions Developed by Using Mails Posted to a Mailing List Ryo Nishimura, Yasuhiko Watanabe and Yoshihiro Okada ...... 95

v Corpus building for Mongolian language Purev Jaimai and Odbayar Chimeddorj ...... 97 Resources for Urdu Language Processing Sarmad Hussain ...... 99 Balanced Corpus of Contemporary Written Japanese Kikuo Maekawa ...... 101 A Basic Framework to Build a Test Collection for the Vietnamese Text Catergorization Viet Hoang-Anh, Thu Dinh-Thi-Phuong and Thang Huynh-Quyet ...... 103 Enhanced Tools for Online Collaborative Language Resource Development Virach Sornlertlamvanich, Thatsanee Charoenporn, Suphanut Thayaboon, Chumpol Mokarat and - toshi Isahara ...... 105 Japanese Effort Toward Sharing Text and Speech Corpora Shuichi Itahashi and Koitiˆ Hasida ...... 107

vi The 6th Workshop on Asian Languae Resources, 2008

Development of Bengali Named Entity Tagged Corpus and its Use in NER Systems

Asif Ekbal Sivaji Bandyopadhyay Department of Computer Science and Department of Computer Science and Engineering, Jadavpur University Engineering, Jadavpur University Kolkata-700032, India Kolkata-700032, India [email protected] [email protected]

tional Random Field (CRF) and Support Abstract Vector Machine (SVM). Experimental - sults of the 10-fold cross validation test The rapid development of language tools have demonstrated that the SVM based using machine learning techniques for less NER system performs the best with an computerized languages requires appropri- overall F-Score of 91.8%. ately tagged corpus. A Bengali news cor- pus has been developed from the web ar- 1 Introduction chive of a widely read Bengali newspaper. A web crawler retrieves the web pages in The mode of language technology work has been Hyper Text Markup Language (HTML) changed dramatically since the last few years with format from the news archive. At present, the web being used as a data source in a wide the corpus contains approximately 34 mil- range of research activities. The web is anarchic, lion wordforms. The date, location, re- and its use is not in the familiar territory of compu- porter and agency tags present in the web tational linguistics. The web walked into the ACL pages have been automatically named en- meetings started in 1999. The use of the web as a tity () tagged. A portion of this partially corpus for teaching and research on language tech- NE tagged corpus has been manually anno- nology has been proposed a number of times tated with the sixteen NE tags with the help (Rundel, 2000; Fletcher, 2001; Robb, 2003; 1 of Sanchay Editor , a text editor for Indian Fletcher, 2003). There is a long history of creating languages. This NE tagged corpus contains a standard for western language resources. The 150K wordforms. Additionally, 30K word- human language technology (HLT) society in forms have been manually annotated with Europe has been particularly zealous for the stan- the twelve NE tags as part of the IJCNLP- dardization, making a series of attempts such as 08 NER Shared Task for South and South EAGLES3, PROLE/SIMPLE (Lenci et al., 2000), 2 East Asian Languages . A table driven ISLE/MILE (Calzolari et al., 2003; Bertagna et al., semi-automatic NE tag conversion routine 2004) and more recently multilingual lexical data- has been developed in order to convert the base generation from parallel texts in 20 European sixteen-NE tagged corpus to the twelve-NE languages (Giguet and Luquet, 2006). On the other tagged corpus. The 150K NE tagged cor- hand, in spite of having great linguistic and cul- pus has been used to develop Named Entity tural diversities, Asian language resources have Recognition (NER) system in Bengali us- received much less attention than their western ing pattern directed shallow parsing ap- counterparts. A new project (Takenobou et al., proach, Hidden Markov Model (HMM), 2006) has been started to create a common stan- Maximum Entropy () Model, Condi- dard for Asian language resources. They have ex- tended an existing description framework, the

1 sourceforge.net/project/nlp-sanchay 2 http://ltrc.iiit.ac.in/ner-ssea-08 3 http://www.ilc.cnr.it/Eagles96/home.html

1 The 6th Workshop on Asian Languae Resources, 2008

MILE (Bertagna et al., 2004), to describe several tion of the corpus has been manually annotated lexical entries of Japanese, Chinese and Thai. India with the twelve NE tags as part of the IJCNLP-08 is a multilingual country with the enormous cul- NER Shared Task for South and South East Asian tural diversities. (Bharati et al., 2001) reports on Languages. A table driven semi-automatic NE tag efforts to create lexical resources such as transfer conversion routine has been developed in order to lexicon and grammar from English to several In- convert this corpus to a form tagged with the dian languages and dependency tree bank of anno- twelve NE tags. The NE tagged corpus has been tated corpora for several Indian languages. Corpus used to develop Named Entity Recognition (NER) development work from web can be found in (Ek- system in Bengali using pattern directed shallow bal and Bandyopadhyay, 2007d) for Bengali. parsing approach, HMM, ME, CRF and SVM Named Entity Recognition (NER) is one of the frameworks. core components in most Information Extraction A number of detailed experiments have been (IE) and Text Mining systems. During the last few conducted to identify the best set of features for years, the probabilistic machine learning methods NER in Bengali. The ME, CRF and SVM based have become state of the art for NER (Bikel et al., NER models make use of the language independ- 1999; Chieu and Ng, 2002) and for field extraction ent as well as language dependent features. The (McCallum et al., 2000). Most prominently, Hid- language independent features could be applicable den Markov Models (HMMs) have been used for for NER in other Indian languages also. The sys- the information extraction task (Bikel et al., 1999). tem has demonstrated the highest F-Score value of Beside HMM, there are other systems based on 91.8% with the SVM framework. One possible Support Vector Machine (Sun et al., 2003) and reason behind its best performance may be the Naïve Bayes (De Sitter and Daelemans, 2003). flexibility of the SVM framework to handle the Maximum Entropy (ME) conditional models like morphologically rich Indian languages. ME Markov models (McCallum et al., 2000) and Conditional Random Fields (CRFs) (Lafferty et al., 2 Development of the Named Entity 2001) were reported to outperform the generative Tagged Bengali News Corpus HMM models on several IE tasks. The existing works in the area of NER are 2.1 Language Resource Acquisition mostly in non-Indian languages. There has been a A web crawler has been developed that retrieves very little work in the area of NER in Indian lan- the web pages in Hyper Text Markup Language guages (ILs). In ILs, particularly in Bengali, the (HTML) format from the news archive of a leading work in NER can be found in (Ekbal and Bengali newspaper within a range of dates Bandyopadhyay, 2007a; Ekbal and Bandyop- provided as input. The crawler generates the adhyay, 2007b; Ekbal et al., 2007c). Other than Universal Resource Locator (URL) address for the Bengali, the work on NER can be found in (Li and index (first) page of any particular date. The index McCallum, 2003) for Hindi. page contains actual news page links and links to Newspaper is a huge source of readily available some other pages (.g., Advertisement, TV documents. In the present work, the corpus has schedule, Tender, Comics and Weather etc.) that been developed from the web archive of a very do not contribute to the corpus generation. The well known and widely read Bengali newspaper. HTML files that contain news documents are Bengali is the seventh popular language in the identified and the rest of the HTML files are not world, second in India and the national language of considered further. Bangladesh. A code conversion routine has been written that converts the proprietary codes used in 2.2 Language Resource Creation the newspaper into the standard Indian Script Code The HTML files that contain news documents are for Information Interchange (ISCII) form, which identified by the web crawler and require cleaning can be processed for various tasks. A separate code to extract the Bengali text to be stored in the conversion routine has been developed for convert- corpus along with relevant details. The HTML file ing ISCII codes to UTF-8 codes. A portion of this is scanned from the beginning to look for tags like corpus has been manually annotated with the six- ..., teen NE tags as described in Table 3. Another por- where the BENGALI_FONT_NAME is the name

2 The 6th Workshop on Asian Languae Resources, 2008

of one of the Bengali font faces as defined in the and the various date expressions that appear in the news archive. The Bengali text enclosed within fixed places of the newspaper. These tags are not font tags are retrieved and stored in the database able to recognize the various NEs that appear after appropriate tagging. Pictures, captions and within the actual news body. tables may exist anywhere within the actual news. In order to identify NEs within the actual news Tables are integral part of the news item. The body, we have defined a tagset consisting of pictures, its captions and other HTML tags that are seventeen tags. We have considered the major four not relevant to our text processing tasks are NE classes, namely ‘Person name’, ‘Location discarded during the file cleaning. The Bengali name’, ‘Organization name’ and ‘Miscellaneous news corpus has been developed in both ISCII and name’. Miscellaneous names include the date, UTF-8 codes. The tagged news corpus contains time, number, percentage and monetary 108,305 number of news documents with about expressions. The four major NE classes are further five (5) years (2001-2005) of news data collection. divided in order to properly denote each Some statistics about the tagged news corpus are component of the multiword NEs. The NE tagset is presented in Table 1. shown in Table 3 with the appropriate examples. We have also tagged a portion of the corpus as Total number of news documents 108, 305 part of the IJCNLP-08 NER Shared Task for South in the corpus and South East Asian Languages. This tagset has Total number of sentences in the 2, 822, 737 twelve different tags. The underlying reason for corpus adopting these tags was the necessity of a slightly Avgerage number of sentences in 27 finer tagset for various natural language processing a document (NLP) applications and particularly for machine Total number of wordforms in 33, 836, 736 translation. The IJCNLP-08 NER shared task the corpus tagset is shown in Table 4. Avgerage number of wordforms 313 One important aspect of IJCNLP-08 NER in a document shared task was to annotate only the maximal NEs Total number of distinct 467, 858 and not the structures of the entities. For example, wordforms in the corpus mahatma gandhi road is annotated as location and Table 1. Corpus Statistics assigned the tag ‘NEL’ even if mahatma and gan- dhi are NE title person and person name, respec- 2.3 Language Resource Annotation tively, according to the IJCNLP-08 shared task The Bengali news corpus collected from the web is tagset. These internal structures of the entities need annotated using a tagset that includes the type and to be identified during testing. , mahatma gan- subtype of the news, title, date, reporter or agency dhi road will be tagged as mahatma/NETP gan- dhi/NEP road/NEL. The structure of the tagged name, news location and the body of the news. A 5 part of this corpus is then tagged with a tagset, element using the Shakti Standard Format (SSF) consisting of sixteen NE tags and one non-NE tag. will be as follows: Also, a part of the corpus has been tagged with a 1 (( NP tagest of twelve NE tags4, defined for the IJCNLP- 1.1 (( NP 08 NER Shared Task for South and South East 1.1.1 (( NP Asian Languages. 1.1.1.1 mahatma A news corpus, whether in Bengali or in any )) other language, has different parts like title, date, 1.1.2 gandhi reporter, location, body etc. To identify these parts )) in a news corpus the tagset, described in Table 2, 1.2 road has been defined. The reporter, agency, location, )) date, bd, day and ed tags help to recognize the person name, organization name, location name

4http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=3 5http://shiva.iiit.ac.in/SPSAL 2007/ssf.html

3 The 6th Workshop on Asian Languae Resources, 2008

2.4 Partially Tagged News Corpus Develop- dates are stored in the form ‘day month year’. ment A sequence of four Bengali digits separates the Bengali date from the Bengali day. The English A news document is stored in the corpus in XML date starts with one/two digits in Bengali font. format using the tagset, mentioned in Table 2. In Bengali date, day and English date can be distin- the HTML news file, the date is stored at first and guished by checking the appearance of the numer- is divided into three parts. The first one is the date als and these are tagged as , and , according to Bengali calendar, second one is the respectively. For e.g., 25 sraban 1412 budhbar 10 day in Bengali and the last one is the date accord- august 2005 is tagged as shown in Table 5. ing to English calendar. Both Bengali and English

Tag Definition Tag Definition Tag Definition header Header of the news day Day body Body of the news documents document title Headline of the news ed English date p Paragraph document t1 1st headline of the reporter Reporter name table Information in tabular title form t2 agency Agency tc 2nd headline of the providing news Table column title date location News location tr Table row Date of the news document bd Bengali date Table 2. News Corpus Tagset

Tag Meaning Example PER Single word person name sachin / PER, manmohan/PER LOC Single word location name jadavpur / LOC, delhi/LOC ORG Single word organization name infosys / ORG, tifr/ORG MISC Single word miscellaneous name 100% / MISC, 100/MISC B-PER Beginning, Internal or the end of sachin/ B-PER ramesh / I-PER I-PER a multiword person name tendulkar / E- PER E-PER B-LOC Beginning, Internal or the end of mahatma/ B-LOC gandhi / I-LOC road / E-LOC I-LOC a multiword location name E-LOC B-ORG Beginning, Internal or the end of bhaba / B-ORG atomic / I-ORG I-ORG a multiword organization name research / I-ORG centre / E-ORG E-ORG B-MISC Beginning, Internal or the end of 10 e / B-MISC magh / I-MISC I-MISC a multiword miscellaneous name 1402 / E-MISC E-MISC NNE Words that are not named entities neta/NNE, bidhansabha/NNE

Table 3. Named Entity Tagset

4 The 6th Workshop on Asian Languae Resources, 2008

NE tag Meaning Example NEP Person name sachin ramesh tendul- kar / NEP 1 biganni NNE NEL Location mahatma gandhi road / 2 newton PER name NEL 3 . NEO Organization bhaba atomic name research centre / NEO . NED Designation chairman/NED, sangsad/NED NEA Abbreviation b a/NEA, c m d a/NEA, Another portion of the partially NE tagged b j p/NEA Bengali news corpus has been manually annotated NEB Brand fanta/NEB, as part of the IJCNLP-08 NER shared task with windows/NEB the twelve NE tags, as listed in Table 4. The NETP Title-person sriman/NED, sree/NED annotation process has been very difficult due to NETO Title-object american beauty/NETO the presence of a number of ambiguous NE tags. NEN Number 10/NEN, dash/NEN The potential ambiguous NE tags are: NED vs NETP, NEO vs NEB, NETE vs NETO, NETO vs NEM Measure tin din/NEM, panch NETP and NEN vs NEM. For example, it is keji/NEM difficult to decide whether ‘Agriculture’ is NETE Terms hidden markov ‘NETE’, and if then whether ‘Horticulture’ is model/NETE ‘NETE’ or not. In fact, this the most difficult class NETI Time 10 e magh 1402/ to identify. This NE tagged corpus contains NETI approximately 30K wordforms. Details statistics Table 4. IJCNLP-08 NER Shared Task Tagset of this tagged corpus are shown in Table 7. This

NE tagged corpus is in the following SSF form. Original date pattern Tagged date pattern 25 sraban 1412 25 sraban 1412 1 (( NP budhbar budhbar 1.1 (( NP 1.1.1 biganni 10 august 2005 10 august 2005 )) 1.1.2 newton NEP )) Table 5. Example of a Tagged Date Pattern 2 . 2.5 Named Entity Tagged Corpus Development The partially NE tagged corpus contains 34 NE Class Number of Number of dis- million wordforms and are in both ISCII and UTF- wordforms tinct wordforms 8 forms. A portion of this corpus, containing 150K Person name 20, 455 15, 663 wordforms, has been manually annotated with the Location 11, 668 5, 579 sixteen NE tags that are listed in Table 3. The name corpus has been annotated with the help of Organization 963 867 Sanchay Editor, a text editor for Indian languages. name The detailed statistics of this NE-tagged corpus are Miscellane- 11,554 3, 227 given in Table 6. The corpus is in SSF form, ous name which has the following structure: Table 6. Statistics of the 150K-tagged Corpus

5 The 6th Workshop on Asian Languae Resources, 2008

2.6 Tag Conversion tern directed shallow parsing, HMM, ME, CRF and SVM frameworks. A tag conversion routine has been developed in order to convert the sixteen-NE tagged corpus of NE Class Number of Number of dis- 150K wordforms to the corpus, tagged with the wordforms tinct wordforms IJCNLP-08 twelve-NE tags. This conversion is a semi-automatic process. Some of our sixteen NE Person name 5, 123 3, 201 tags can be automatically mapped to some of the Location 1, 675 1, 119 IJCNLP-08 shared task tags. The tags that repre- name sent person, location and organization names can Organization 168 131 be directly mapped to the NEP, NEL and NEO name tags, respectively. Other IJCNLP-08 shared task Designation 231 102 tags can be obtained with the help of gazetteer lists Abbreviation 32 21 and simple heuristics. In our earlier NER experi- Brand 15 12 ments, we have already developed a number of Title-person 79 51 gazetteer lists such as: lists of person, location and Title-object 63 42 organization names; list of prefix words (e.g., sree, Number 324 126 sriman etc.) that predict the left boundary of a per- Measure 54 31 son name; list of designation words (e.g., mantri, Time 337 212 sangsad etc.) that helps to identify person names. Terms 46 29 The lists of prefix and designation words are help- Table 7. Statistics of the 30K-tagged Corpus ful to find the NETP and NED tags. The sixteen- NE tagged corpus is searched for the person name Sixteen-NE tagset IJCNLP-08 tags and the previous word is matched against the tagset lists of prefix and designation words. The previous PER, LOC, ORG NEP, NEL, word is tagged as NED or NETP if there is a NEO match with the lists of designation words and pre- B-PER, I-PER, E-PER NEP fix words, respectively. The NEN and NETI tags B-LOC, I-LOC, E-LOC NEL can be obtained by looking at our miscellaneous B-ORG, I-ORG, E-ORG NEO tags and using some simple heuristics. The NEN MISC NEN tags can be simply obtained by checking whether B-MISC, I-MISC, E-MISC NETI, NEM the MISC tagged element consists of digits only. Brand name gazetteer NEB The lists of cardinal and ordinal numbers have Title-object gazetteer NETO been also kept to recognize NETI tags. A list of Term gazetteer NETE words that denote the measurements (e.g., kilo- Person prefix word + PER/B- NETP gram, taka, dollar etc.) has been kept in order to PER, I-PER, E-PER get the NEM tag. The lists of words, denoting the Designation word +PER/B- NED brand names, title-objects and terms, have been PER, I-PER, E-PER prepared to get the NEB, NETO and NETE tags. The NEA tags can be obtained with the help of a Abbreviation gazetteer + NEA gazetteer list and using some simple heuristics Heuristics (whether the word contains the dot and there is no Table 8. Tagset Mapping Table space between the characters). The mapping from our NE tagset to the IJCNLP-08 NER shared task tagset is shown in Table 8. We have considered the sixteen NE tags to de- velop these systems. Named entity recognition in Indian Languages (ILs) in general and particularly 3 Use of Language Resources in Bengali is difficult and challenging as there is no concept of capitalization in ILs. The NE tagged news corpus, developed in this The Bengali NER systems that use pattern di- work, has been used to develop the Named Entity rected shallow parsing approach can be found in Recognition (NER) systems in Bengali using pat-

6 The 6th Workshop on Asian Languae Resources, 2008

(Ekbal and Bandyopadhyay, 2007a) and (Ekbal •CRF based NER: Contextual window of size five and Bandyopadhyay, 2007b). An HMM-based (current, previous two words and the next two Bengali NER system can be found in (Ekbal and words), prefix and suffix of length upto three of the Bandyopadhyay, 2007c). These systems have been current word, POS information of window three trained and tested with the corpus tagged with the (current word, previous word and the next word), sixteen NE tags. NE information of the previous word (dynamic A number of experiments have been conducted feature), different digit features and the various in order to find out the best feature set for NER in gazetteer lists. Bengali using the ME, CRF and SVM frameworks. •SVM based NER: Contextual window of size six In all these experiments, we have used a number of (current, previous three words and the next two gazetteer lists such as: first names (72, 206 en- words), prefix and suffix of length upto three of the tries), middle names (1,491 entries) and sur names current word, POS information of window three (5,288 entries) of persons; prefix (245 entries) and (current word, previous word and the next word), designation words (947 entries) that help to detect NE information of the previous two words (dy- person names; suffixes (45 and 23 entries) that namic feature), different digit features and the vari- help to identify person and location names; clue ous gazetteer lists. words (94 entries) that help to detect organization Evaluation results of the 10-fold cross validation names; location name (7, 870 entries) and organi- test for all the models are presented in Table 9. zation name (2,225 entries). Apart from these gaz- Evaluation results clearly show that the SVM etteer lists, we have used the prefix and suffix based NER model outperforms other models due to (may not be linguistically meaningful suf- it’s efficiency to handle the non-independent, di- fix/prefix) features, digit features, first word fea- verse and overlapping features of Bengali lan- ture and part of speech information of the words guage. etc. We have used the C++ based Maximum En- tropy package6, C++ based OpenNLP CRF++ pack- Model F-Score (in %) age7 and open source YamCha8 tool for ME based A 74.5 NER, CRF based NER and SVM based NER, re- B 77.9 spectively. For SVM based NER system, we have HMM 84.5 used TinySVM 9 classifier, pair wise multi-class ME 87.4 decision method and the second-degree polynomial CRF 90.7 kernel function. The brief descriptions of all the SVM 91.8 models are given below: Table 9.Results of 10-fold Cross Validation Test •A: Pattern directed shallow parsing approach without linguistic knowledge. 4 Conclusion •B: Pattern directed shallow parsing approach with linguistic knowledge. In this work we have developed a Bengali news •HMM based NER: Trigram model with additional corpus of approximately 34 million wordforms context dependency, NE suffix lists for handling from the web archive of a leading Bengali newspa- unknown words. per. The date, location, reporter and agency tags present in the web pages have been automatically •ME based NER: Contextual window of size three (current, previous and the next word), prefix and NE tagged. Around 150K wordforms of this par- suffix of length upto three of the current word, tially NE tagged corpus has been manually anno- POS information of the current word, NE informa- tated with the sixteen NE tags. We have also tion of the previous word (dynamic feature), dif- tagged around 30K wordforms with the twelve NE ferent digit features and the various gazetteer liststs. tags, defined for the IJCNLP-08 NER shared task. A tag conversion routine has also been developed

6 in order to convert any sixteen-NE tagged corpus http://homepages.inf.ed.ac.uk/s0450736/software/maxe to the twelve-NE tagged corpus. The sixteen-NE nt/maxent-20061005.tar.bz2 tagged corpus of 150K wordforms has been used to 7 http://crfpp.sourceforge.net 8 http://chasen.org/~taku/software/yamcha/ 9http://cl.aist-nara.ac.jp/taku-ku/software/TinySVM

7 The 6th Workshop on Asian Languae Resources, 2008

develop the NER systems using various machine- ognition. Language Resources and Evaluation Jour- learning approaches. nal (Accepted and to appear by December 2007). This NE tagged corpus can be used for other Fletcher, W. H. 2001. Making the Web More Useful as NLP research activities such as machine Source for Linguistics Corpora. In Ulla Conor and translation, information retrieval, cross-lingual Thomas A. Upton (eds.), Applied corpus Linguistics: event tracking, automatic summarization etc. A Multidimensional Perspective, 191-205. Fletcher, W. H. 2003. Concording the Web with References KwiCFinder. In Proceedings of the Third North Bertagna, M. and A. Lenci, M. Monachini and . Cal- American Symposium on Corpus Linguistics and zolari. 2004. CotentInteroperability of Lexical Re- Language Teaching, Boston. sources, Open Issues and “MILE” Perspectives, In Giguet, E., and P. Luquet. 2006. Multilingual Lexical Proceedings of the LREC, 131-134. Database Generation from Parallel Texts in 20 Euro- Bharthi, A., D.M. Sharma, V. Chaitnya, A. P. Kulkarni pean Languages with Endogeneous Resources. In and R. Sanghal. 2001. LERIL: Collaborative Effort Proceedings of the COLING/ACL, Sydney, 271-278. for Creating Lexical Resources. In Proceedings of Lafferty, J., McCallum, A., and Pereira, F. 2001. Condi- th the 6 NLP Pacific Rim Symposium Post-Conference tional Random Fields: Probabilistic Models for Seg- Workshop, Japan. menting and Labeling Sequence Data, In Proceedings th Bikel, D. M., Schwartz, R., Weischedel, R. M. 1999. An of the 18 International Conference on Machine Algorithm that Learns What’s in a Name. Machine Learning, 282-289. Learning, 34, 211-231. Lenci, A., N. Bel, F. Busu, N. Calzolari, E. Gola, M. Calzolari, N., F. Bertagna, A. Lenci and M. Monachni. Monachini, A. Monachini, A. Ogonowski, I. Peters, 2003. Standards and best Practice for Miltilingual W. Peters, N. Ruimy, M. Villegas and A. Zampolli. Computational Lexicons, MILE (the multilingual 2000. SIMPLE: A general Framework for the Devel- ISLE lexical entry). ISLE Deliverable D2.2 &3.2. opment of Multilingual Lexicons. International Journal of Lexicography, Special Issue, Dictionaries, Chieu, H. L., Tou Ng, H. 2002. Named Entity Recogni- Thesauri and Lexical-Semantic Relations, XIII(4): tion: A Maximum Entropy Approach Using Global 249-263. Information, In Proc. of the 6th Workshop on Very Large Corpora. Li, Wei and Andrew McCallum. 2004. Rapid Develop- ment of Hindi Named Entity Recognition Using De Sitter, A., Daelemans W. 2003. Information Extrac- Conditional Random Fields and Feature Inductions. tion via Double Classification. In Proeedings of In- ACM TALIP, Vol. 2(3), 290-294. ternational Workshop on Adaptive Text Extraction and Mining, Dubronik. McCallum, A., Freitag, D., Pereira, F. 2000. Maximum Entropy Markov Models for Information Extraction Ekbal, Asif, and S. Bandyopadhyay. 2007a. Pattern and Segmentation. In Proceedings of the 17th Inter- Based Bootstrapping Method for Named Entity Rec- national Conference Machine Learning. ognition. In Proceedings of the 6th International Con- ference on Advances in Pattern Recognition, 2007, Robb, T. 2003. Google as a Corpus Tool? ETJ Journal, India, 349-355. 4(1), Spring 2003. Ekbal, Asif, and S. Bandyopadhyay. 2007b. Lexical Rundell, M. 2000. The Biggest Corpus of All. Humanis- Pattern Learning from Corpus Data for Named Entity ing Language Teaching, 2(3). th Recognition. In Proceedings of the 5 International Sun, A., et al. 2003. Using Support Vector Machine for Conference on Natural Language Processing Terrorism Information Extraction. In Proceedings of (ICON), India, 123-128. 1st NSF/NIJ Symposium on Intelligence and Security. Ekbal, Asif, Naskar, Sudip and S. Bandyopadhyay. Takenobou, T., V. Sornlertlamvanich, T. Charoenporn, 2007c. Named Entity Recognition and Transliteration N. Calzolari, M. Monachini, C. Soria, C. Huang, X. in Bengali, Named Entities: Recognition, YingJu, Y. Hao, L. Prevot and S. Kiyoaki. 2006. In- Classification and Use, Special Issue of Lingvisticae frastructure for Standardization of Asian Languages Investigationes Journal, 30:1 (2007), 95-114. Resources. In Proceedings of the COLING/ACL Ekbal, Asif, and S. Bandyopadhyay. 2007d. A Web- 2006, Sydney, 827-834. based Bengali News Corpus for Named Entity Rec-

8 The 6th Workshop on Asian Languae Resources, 2008

Gazetteer Preparation for Named Entity Recognition in Indian Languages

Sujan Kumar Saha Sudeshna Sarkar Pabitra Mitra Indian Institute of Technology Indian Institute of Technology Indian Institute of Technology Kharagpur, West Bengal Kharagpur, West Bengal Kharagpur, West Bengal India - 721302 India - 721302 India - 721302 [email protected] [email protected] [email protected]

Abstract languages there is no reasonable size publicly avail- able list. The web contains lots of such resources, This paper describes our approaches for the which can be used for Indian language NER devel- preparation of gazetteers for named entity opment. But most of the web resources are in En- recognition (NER) in Indian languages. We glish. Our approach is to transliterate the relevant have described two methodologies for the English resources and name dictionaries into Indian preparation of gazetteers1. Since the rel- languages to make them useful for Indian language evant gazetteer lists are more easily avail- NER task. But direct transliteration from English to able in English we have used a translitera- an Indian languages is not easy. Few attempts are tion based approach to convert available En- taken to build English to Indian language transliter- glish name lists to Indian languages. The ation systems but the word agreement ratio (WAR) second approach is a context pattern induc- reached is upto 69.3% (Ekbal et al., 2006). tion based domain specific gazetteer prepa- We have attempted to build a transliteration sys- ration. This approach uses a domain specific tem which uses an intermediate alphabet. Both raw corpus and a few seed entities to learn the English and the Indian language strings are context patterns and then the corresponding transliterated to the intermediate alphabet and for a name lists are generated by using bootstrap- English-Indian language pair, if the transliterated in- ping. termediate alphabet strings are same then we have concluded that the strings are the transliteration of 1 Introduction one another. We have transliterated the available En- Named entity recognition involves locating and clas- glish name lists into the intermediate alphabet and sifying the names in text. NER is an important task, these might be used as gazetteers. The Indian lan- having applications in information extraction (IE), guage words need to be transliterated to the inter- question answering (QA), machine translation and mediate format to check whether the word is in a in most other NLP applications. gazetteer or not. This system does not transliter- NER systems have been developed for resource- ate the English name lists into Indian languages but rich languages like English with very high accura- makes them useful in Indian languages NER task. cies. But constructing an NER for a resource-poor Transliteration based approaches are useful when language is very challenging due to unavailability of there is availability of English name lists. But when proper resources. Name-dictionaries or gazetteers relevant English name lists are not available then are very helpful NER resources and in most Indian also we can prepare gazetteers from raw corpus. We have defined a semi-automatic context pattern 1Specialized list of names for a particular class of Named Entity (NE). For Example, India is in the location gazetteer, (CP) extraction based gazetteer preparation frame- Sachin is in the person first name gazetteer. work. This framework uses bootstrapping to prepare

9 The 6th Workshop on Asian Languae Resources, 2008

the gazetteers from a large raw corpus starting from several gazetteers like organization names, location few seed entities. Firstly fixed length patterns are names, person names, human titles etc. formed using the surrounding words of the seeds. We will now mention some ML based sys- Depending on the pattern precision, the patterns are tems. MENE is a MaxEnt based system de- discarded or generalized by dropping tokens from veloped by Borthwick. This system has used 8 the patterns. This set of high precision patterns ex- dictionaries (Borthwick, 1999), which are: First tracts other named entities (NEs) which are added to names (1,245), Corporate names (10,300), Corpo- the seed list for the next iteration of the process. Fi- rate names without suffix (10,300), Colleges and nally we are able to prepare the required gazetteers. Universities (1,225), Corporate suffixes (244), Date To prove the effectiveness of the gazetteer prepa- and Time (51) etc. The italics numbers in bracket ration approach, we have prepared some gazetteers indicates the size of the dictionaries. The hy- like names of cricketers, names of tennis players etc. brid system developed by Srihari et al.(2000) com- from a raw Hindi sports domain corpus. The de- bines several modules built by using MaxEnt, HMM tails of the approaches are given in the following and handcrafted rules. This system uses the fol- sections. lowing gazetteers: First name (8,000), Family The paper is organized as follows. Usefulness name (14,000) and a big gazetteer of Locations of gazetteers in NER, transliteration approaches in (250,000). There are many other systems which general and specific for Indian languages and gen- have used name dictionaries to improve the accu- eral pattern extraction methodologies are discussed racy. Kozareva (2006) described a methodology in section 2. Section 3 presents the architecture of to generate gazetteer lists automatically for Spanish the 2-phase transliteration system and preparation of and to build NER system with labeled and unlabeled gazetteers using that. In section 4 context pattern data. The location gazetteer is built by finding loca- extraction based gazetteer preparation is discussed. tion patterns which looks for specific prepositions. Finally section 5 concludes the paper. And the person gazetteer is constructed with graph exploration algorithm. 2 Previous Work Transliteration is also a very important topic and lots of transliteration systems for different lan- The main approaches to NER are Linguistic ap- guages have been developed using different ap- proaches and Machine Learning (ML) based ap- proaches. The basic approaches for transliteration proaches. The linguistic approach typically uses are phoneme based or spelling-based. A phoneme- rule-based models manually written by linguists. based statistical transliteration system from Ara- ML based techniques make use of a large amount bic to English was developed by Knight and of annotated training data to acquire high-level lan- Graehl(1998). This system uses a finite state guage knowledge. Several ML techniques like Hid- transducer that implements transformation rules to den Markov Model (HMM)(Bikel et al., 1997), do back-transliteration. A spelling-based model Maximum Entropy Model(MaxEnt) (Borthwick, that directly maps English letter sequences into 1999), Conditional Random Field(CRF) (Li and Arabic letters was developed by Al-Onaizan and McCallum, 2004) etc. have been successfully used Knight(2002). Several transliteration systems ex- for the NER task. Both the approaches may make ist for English-Japanese, English-Chinese, English- use of gazetteer information to build systems. There Spanish and many other languages to English. But are many systems which use gazetteers to improve very few attempts have been reported on the de- the accuracy. velopment of transliteration systems between Indian Ralph Grishman has developed a rule-based NER languages and English. We can mention a transliter- system which uses some specialized name dictionar- ation system for Bengali-English transliteration de- ies including names of all countries, names of major veloped by Ekbal et al.(2006). They have proposed cities, names of companies, common first names etc different models modifying the joint source chan- (Grishman, 1995). Another rule based NER system nel model. In that system a Bengali string is di- is developed by Wakao et al. (1996) which has used vided into transliteration units containing a vowel

10 The 6th Workshop on Asian Languae Resources, 2008

modifier or matra at the end of each unit. Sim- List Web Sources ilarly the English string is also divided into units. http://www.bsnl.co.in/ onlinedi- Then they defined various unigram, bigram or tri- rectory.htm gram models depending on the consideration of the First http://web1.mtnl.net.in/ direc- contexts of the units. They have also considered lin- Name tory/ guistic knowledge in the form of possible conjuncts http://www.eci.gov.in/ and diphthongs in Bengali and their representations http://hiren.info/indian-baby- in English. This system is capable of transliterat- names/ ing mainly person names. The highest transliteration http://www.indiaexpress.com/ accuracy achieved by them is 69.3% Word Agree- specials/babynames/ ment Ratio (WAR) for Bengali to English and 67.9% http://surnamedirectory.com/ WAR for English to Bengali transliteration. surname-index.html Surname http://web1.mtnl.net.in/ direc- In the field of IE, patterns play a key role in tory/ identifying relevant pieces of information. Soder- http://en.wikipedia.org land et al.(1995), Rillof and Jones(1999), Lin et India Lo- http://indiavilas.com/ indi- al.(2003), Downey et al.(2004), Etzioni et al.(2005) cation ainfo/pincodes.asp described different approaches to context pattern in- http://www.indiapost.gov.in duction. Talukder et al.(2006) combined grammati- http://www.eci.gov.in/ cal and statistical techniques to create high precision World http://www.maxmind.com/ patterns specific for NE extraction. An approach to Location app/worldcities lexical pattern learning for Indian languages is de- http://en.wikipedia.org/wiki scribed by Ekbal and Bandopadhyay (2007). They used seed data and annotated corpus to find the pat- Table 1: Web sources for some relevant name lists terns for NER.

3.1 Transliteration 3 Transliteration based Gazetteer The transliteration from English to Hindi is quite Preparation difficult. English alphabet contains 26 characters whereas the Hindi alphabet contains 52 characters. So the mapping is not trivial. We have already men- Gazetteers or name dictionaries are helpful in NER. tioned that for Bengali a transliteration system was We have already discussed about some English NER developed by Ekbal et al. Similar approach can be systems where the usefulness of the gazetteers have used to develop transliteration systems for other In- been established. However while developing NER dian languages. But this approach uses a bilingual systems in Indian languages, we tried to find relevant transliteration corpus, which requires much efforts gazetteers. But we could not obtain openly available to built, is unavailable in proper size in all Indian gazetteer lists for these languages. But we found languages. Also using this approach the word agree- that there are a lot of resources of names of Indian ment ratio obtained is below 70%. persons, Indian places, organizations etc. in English To make the transliteration process easier and available in the web. In Table 1 we have mentioned more accurate, we have decided to build a 2-phase some of the sources which contains relevant name transliteration module. Our goal is to make deci- lists. sion that a particular Indian language string is in an But it is not possible to use the available name English gazetteer or not. We need not transliterate lists directly in the Indian language NER task as directly from Indian language strings to English or these are in English. We have decided to translit- English name lists into Indian languages. Our idea is erate the English lists into Indian languages to make to define an intermediate alphabet and both English them useful in the Indian language NER task. and Indian language strings will be transliterated to

11 The 6th Workshop on Asian Languae Resources, 2008

the intermediate alphabet. For two English-Hindi S = S − G. string pair, if the intermediate alphabet is same then Go to step 2. we can conclude that one string is the transliteration = − 1 of the other. 5. Else set n n . First of all we need to decide the size of the inter- Go to step 3. mediate alphabet. Preserving the phonetic proper- Here we can take an Indian language name, ties we have defined our intermediate alphabet con- ‘surabhi’, as example to explain the procedure in sisting of 34 characters. To indicate these 34 charac- details. When the name is written in English, it ters, we have given unique character-id to each char- may be written in several ways like ‘suravi, ‘shu- acter. ravi’, ‘surabhee’, ‘shurabhi’ etc. The English string 3.2 English to Intermediate Alphabet ‘surabhi’ is transliterated to ‘ˆsˆuˆrˆaˆvˆı’ by the translit- Transliteration erator. Again if we see the transliteration for ‘shu- ravi’, then also the intermediate transliterated string For transliterating English strings into the interme- is same as the previous one. diate state, we have built a phonetic map table. This phonetic map table maps an English n-gram into an 3.3 Indian Language to Intermediate Alphabet intermediate character. A part of the map table is Transliteration given in Table 2. In the map table, the mapping is This is a 2-phase process. The first phase transliter- from strings of varying length in the English to one ates the Indian language string into itrans. Itrans is character in the intermediate alphabet. In our table representation of Indian language alphabets in terms the length of the left hand side varies from 1 to 3. of ASCII. Since Indian text is composed of syllabic English Intermediate units rather than individual alphabetic letters, itrans uses combinations of two or more letters of En- A ˆa glish alphabet to represent an Indian language syl- EE, I, II ˆı lable. However, there being multiple sounds in In- OO, ˆu dian languages corresponding to the same English B,W bˆ letter, not all Indian syllables can be represented by BH,V ˆv logical combinations of English alphabet. Hence, CH ˆc itrans uses some non-alphabetic special characters R,RH ˆr also in some of the syllables. A map table2, with SH,S ˆs some heuristic knowledge, is used for the translit- Table 2: A part of the map table eration. For example, the Hindi word ‘surabhi’ is converted ‘sUrabhI’ in itrans. In the following we have described the procedure In the second phase the itrans string is translit- of transliteration. erated into the intermediate state using the similar procedure described section 3.2. Here also we use Procedure 1: Transliteration a map-table containing the mappings from itrans to Source string - English, Output string - Intermediate. intermediate alphabet. This procedure transliterates 1. Scan the source string (S) from left to right. the example itrans word ‘sUrabhI’ to ‘ˆsˆuˆrˆaˆvˆı’.

2. Extract the first n-gram (G) from the string. 3.4 Evaluation (n = 3) In section 3.2 and 3.3 we have described two phase transliteration with an example word. We have 3. Find if it is in the map-table. shown that our transliteration system transliterates the Indian language name ‘surabhi’ and the corre- 4. If yes, insert its corresponding intermediate sponding English strings into the same intermediate state entity into target string T. Remove the n-gram from S. 2The map table is available at www.aczoom.com/itrans.

12 The 6th Workshop on Asian Languae Resources, 2008

string. The system has limitations like sometimes igrams (e.g., kolakAtA3 (Kolkata), bihAra (Bihar)), two different strings can be mapped into a same in- bigrams (e.g., nayI dillI (New Delhi), pashchima termediate alphabet string. bA.nglA (West Bengal)) and trigrams (e.g. uttaara For the evaluation of the system we have applied chabisha paraganA (North 24 Pargana)). The words the transliteration system for two languages - Hindi are matched with unigrams, sequences of two con- and Bengali. For evaluating the system for Hindi secutive words are matched against bigrams and we have created a bi-lingual corpus containing 1070 sequences of three consecutive words are matched English-Hindi word pair most of which are names. against trigrams. 980 of them are transliterated correctly by the sys- World Location: The list contains the names of tem. The system accuracy is 980 × 100/1070 = the countries, different state and city names in world 91.59%. For evaluating the system for Bengali, we and also the names of important rivers, mountains have used a similar bi-lingual corpus and the system etc. The list contains about 4,000 location names. transliterates with 89.3% accuracy. Similar to the Indian location list, this list also needs to be processed as unigram, bigrams and trigrams. 3.5 Prepared Gazetteer Lists Previously we have mentioned the web sources 4 Context Pattern Extraction based where some name lists are available. Names of Gazetteer Preparation a particular category are collected from different Gazetteers can also be prepared by extracting con- sources and merged to build a English name list of text patterns. Transliteration based gazetteer prepa- that category. Then we have applied our transliter- ration is applicable while there is availability of En- ation procedure on the list and transliterated the list glish or parallel language name list. But if such into the intermediate alphabet. This intermediate al- relevant name lists are not available, but a large phabet list acts as a gazetteer in NER task in Indian raw corpus is available, then we can use the con- languages. When an Indian language NER system text pattern extraction based methodology to prepare needs to access the gazetteer lists, it transliterates the the gazetteers. This method seeks some high pre- Indian language strings into the intermediate alpha- cision context patterns by using some seed entities bet, and searches into the list. In the following we and hits the patterns to the raw corpus to prepare the have described the prepared gazetteer lists which are gazetteers. useful for a general domain Indian language NER The overall methodology of extracting context system. patterns from a raw corpus is summarized as fol- First Name List: This list contains 10,200 first lows: names collected from the web. Most of the collected first names are of Indian origin. Apart from the In- 1. Find a large raw corpus and some seed entities dian names, we have also collected some non-Indian (E) for each class of NEs. names. These non-Indian names are generally the 2. For each seed entity e in E, from the corpus names of some famous persons, like sports stars, find context string(C) comprised of n tokens film stars, scientists, politicians, who are likely to before e, a placeholder for the class instance come in Indian language texts. In our first name list and n tokens after e. [We have used n = 3] 1500 such names are included. This set of words form initial pattern. Surname List: This is a very important list which contains common surnames. We have prepared the 3. Search the pattern to the corpus and find the surname list from different sources containing about coverage and precision. 1500 Indian surnames and 400 other surnames. 4. Discard the patterns having low precision. Indian Locations: This list contains about 14,000 entities. The names of states, cities and 5. Generalize the patterns by dropping one or towns, districts, important places in different cities more tokens to increase coverage. and even lots of village names are collected in the 3The Indian languages strings are written in italics font and list. The list needs to be processed into a list of un- using itrans transliteration.

13 The 6th Workshop on Asian Languae Resources, 2008

6. Find best patterns having good precision and For the seed sachina tedulkara (Sachin Ten- coverage. dulkar) we extract 92 initial patterns. Some of which are: The details of the context pattern extraction based gazetteer preparation methodology is de- • mAsTara blAsTara CRIC ko Tima scribed in the following subsections. We have taken • dravi.Da aura CRIC 241 nAbAda ke a Hindi sports domain raw corpus and prepared some gazetteers like names of cricketers, names of • bhAratiya ballebAja CRIC ne 100 rana tennis players to prove the effectiveness of the pro- posed methodology. • mere vichAra CRIC ko Takkara dene

4.1 Selection of Seed Entity 4.3 Pattern Quality Measure Context pattern extraction based gazetteer prepa- We measured the quality of a pattern depending on ration methodology is applied to a sports domain its precision and coverage. Precision is the ratio of corpus which contains about 20 lakhs words col- correct identification and the total identification. If lected from the popular Hindi newspaper “Dainik the precision is high then also we have assumed that Jagaran”. We have worked on preparing the lists the pattern is a good pattern. In our development of cricket players, list of tennis players. We have we have marked a pattern as good if the precision is collected the most frequent names to build the seed 100%. list. For preparing the list of tennis players, we We search the initial patterns in the correspond- have taken 5 names as seed entities : Andre Agassi, ing ‘part’ corpus to measure the precision and cov- Steffi Graf, Serena Williams, Roger Federer and Jus- erage. If the precision is less than 100% for a pat- tine Henin. Similarly the seed list of cricket players tern then we have rejected the pattern. Otherwise we name list contains only 3 names: Sachin Tendulkar, have tried to make it more generalized to increase Brian Lara and Glenn McGrath. the coverage. To make the generalization we have dropped the left most and right most tokens one by 4.2 Context Extraction one and checked the pattern quality. If for a initial To extract the patterns for a particular category, we pattern, several patterns presents with 100% preci- select a part of the corpus where the target seeds will sion then we have selected those patterns for which be available with high frequency. For example to no subset of those is a good pattern. By this way we get the patterns for the names of the cricketers, we have prepared a list of good patterns for a particular select a part of the corpus where most of the sen- gazetteer type. tences are cricket related. To select the cricket re- In time of ‘good’ pattern selection we have made lated sentences, we prepared a list containing the some interesting observations. most frequent words related to cricket like, rAna • There are some patterns which satisfy the 100% (run), ballebAja (batsman), gedabAja (bowler) etc. precision criteria but the coverage is very poor Depending upon the presence of such words we have in terms of new entity extraction. For exam- selected the ‘part’. In our development the cricket ple, “mAsTara blAsTara CRIC ko” is a pattern ‘part’ contains 120K words. Similar ‘part’ is devel- with 100% precision. The pattern has 24 in- oped for other gazetteers. For a particular seed, we stances in the ‘part’ corpus, but all the extracted find the occurrences of the seed entity in the corre- entities are ‘sachina tedulkara’. We have also sponding raw ‘part’ corpus. Then we extracted three examined the pattern in the total raw corpus. tokens immediately preceding the seed and three to- It is capable of extracting ‘sachina tedulkara’ kens immediately following the seed. A placeholder only. So in spite of fulfilling all the criteria the (CRIC for cricketers, TENS for tennis players) pattern is not a ‘good’ pattern. replaces the seed. The placeholder and the surround- ing tokens t−3 t−2 t−1 placeholder t+1 t+2 t+3) • Another interesting observation is, that there form the initial set of patterns. are some patterns which are ‘good’ patterns in

14 The 6th Workshop on Asian Languae Resources, 2008

the context of the ‘part’ corpus, but when used entities we become able to prepare a gazetteer which in the total raw corpus, it extracts non-relevant contains 412 names of the cricketers. Even using entities. For example “mere vichAra se CRIC this approach only one seed ‘Sachin Tendualkar’ ex- ko Takkara” is a ‘good’ pattern so it should ex- tracts 297 names after the second iteration. Similarly tract the names of the cricketers. But when we have collected 245 names of tennis players from this is used in the total raw corpus it extracts 5 seed entities. non-cricketer entities (e.g. tennis players, chess players) also. To make such patterns useful we 5 Conclusion have extracted all the cricket related sentences In this paper we have described our approaches for in the similar way which was used for selecting the preparation of gazetteers. We have also prepared the ‘part’ corpus and then the patterns are used some gazetteers using both the approaches to show to extract entities from these sentences. their effectiveness. These approaches are very use- • There present are patterns with very high cover- ful for the NER task in resource-poor languages and age but precision is just below 100%. We have also in domain specific NER task. analyzed these patterns and manually identified the wrongly extracted entities. If the wrong en- References tities can be grouped together and are having some specific properties then we have added Al-Onaizan Y. and Knight K. 2002. Machine Translit- eration of Names in Arabic Text. Proceedings of these entities in a ‘pattern exception list’. Then the ACL Workshop on Computational Approaches to the pattern is used as a ‘good’ pattern and the Semitic Languages. exception list is used to detect the wrong iden- Bikel Daniel M., Miller Scott, Schwartz Richard and tifications. Weischedel Ralph. 1997. Nymble: A high perfor- mance learning name-finder. In Proceedings of the In the following we have given some example of Fifth Conference on Applied Natural Language Pro- ‘good’ patterns which are useful in identification of cessing, peges 194–201. the names of the cricket players. Borthwick Andrew. 1999. A Maximum Entropy Ap- • ballebAja CRIC ko Tima ke proach to Named Entity Recognition. Ph.D. thesis, Computer Science Department, New York University. • ballebAja CRIC ne Downey D., Etzioni O., Soderland S., Weld D.S. 2004. • CRIC arddhashataka Learning text patterns for Web information extraction and assessment. In AAAI-04 Workshop on Adaptive 4.4 Gazetteer Preparation Text Extraction and Mining, pages 50–55. The extracted ‘good’ patterns are capable of iden- Ekbal A. and Bandyopadhyay S. 2007. Lexical Pattern tifying NEs from a raw corpus. These patterns are Learning from Corpus Data for Named Entity Recog- nition. In Proceedings of International Conference on then used to prepare the gazetteers. The seeds form Natural Language Processing (ICON), 2007. the initial gazetteer list for a particular gazetteer type. The ‘good’ patterns are used to extract entities Ekbal A., Naskar S. and Bandyopadhyay S. 2006. A from the total raw corpus. The entities identified by Modified Joint Source Channel Model for Translitera- tion. In Proceedings of the COLING/ACL 2006, Aus- the patterns are added to the corresponding gazetteer tralia, pages 191–198. list. In that way we can add more entities in our first phase gazetteer list. These new entities are taken as Etzioni Oren, Cafarella Michael, Downey Doug, Popescu Ana-Maria, Shaked Tal, Soderland Stephen, Weld seeds for the next phase. Then the same procedure Daniel S. and Yates Alexander. 2005. Unsupervised is followed repeatedly to develop a large gazetteer. named-entity extraction from the Web: An experimen- We have already mentioned that we have worked tal study. In Artificial Intelligence, 165(1): 91-134. with a sports domain corpus and prepared some Grishman Ralph. 1995. The New York University Sys- gazetteers. This gazetteers are prepared just to prove tem MUC-6 or Where’s the syntax? In Proceedings of the efficiency of our approach. By using only 3 seed the Sixth Message Understanding Conference.

15 The 6th Workshop on Asian Languae Resources, 2008

Knight K. and Graehl J. 1998. Machine Transliteration. Computational Linguistics, 24(4): 599–612. Kozareva Zornitsa. 2006. Bootstrapping Named Entity Recognition with Automatically Generated Gazetteer Lists. In Proceedings of EACL student session (EACL 2006). Li Wei and McCallum Andrew. 2004. Rapid Develop- ment of Hindi Named Entity Recognition using Condi- tional Random Fields and Feature Induction (Short Pa- per). In ACM Transactions on Computational Logic. Lin Winston, Yangarber Roman and Grishman Ralph. 2003. Bootstrapped learning of semantic classes from positive and negative examples. In Proceedings of ICML-2003 Workshop on The Continuum from La- beled to Unlabeled Data. Riloff E. 1996. Automatically Generating Extraction Patterns from Untagged Text. In Proceedings of the Thirteenth National Conference on Articial Intelli- gence, pages 1044–1049. Srihari R., Niu C. and Li W. 2000. A Hybrid Approach for Named Entity and Sub-Type Tagging. In Proceed- ings of the sixth conference on Applied natural lan- guage processing. Soderland Stephen, Fisher David, Aseltine Jonathan, Lehnert Wendy. 1995. CRYSTAL: Inducing a Con- ceptual Dictionary. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelli- gence. Talukdar P. Pratim, T. Brants, M. Liberman and F. Pereira. 2006. A context pattern induction method for named entity extraction. In Proceedings of the Tenth Conference on Computational Natural Lan- guage Learning (CoNLL-X). Wakao T., Gaizauskas R. and Wilks Y. 1996. Evaluation of an algorithm for the recognition and classification of proper names. In Proceedings of COLING-96.

16 The 6th Workshop on Asian Languae Resources, 2008

Preliminary Chinese Term Classification for Ontology Construction

Gaoying Cui, Qin Lu, Wenjie Li Department of Computing, Hong Kong Polytechnic University {csgycui, csluqin, cswjli}@comp.polyu.edu.hk

Abstract 1 Introduction Ontology construction has two main issues An ontology can be seen as a representa- including the acquisition of domain concepts and tion of concepts in a specific domain. Ac- the acquisition of appropriate taxonomies of these cordingly, ontology construction can be re- concepts. These concepts are labeled by the terms garded as the process of organizing these used in the domain which are described by concepts. If the terms which are used to la- different attributes. Since domain specific terms bel the concepts are classified before build- (terminology) are labels of concepts among other ing an ontology, the work of ontology con- things, terminology extraction is the first and the struction can proceed much more easily. foremost important step of domain concept Part-of-speech (PoS) tags usually carry acquisition. Most of the existing algorithms in some linguistic information of terms, so Chinese terminology extraction only produce a list PoS tagging can be seen as a kind of pre- of terms without much linguistic information or liminary classification to help constructing classification information (Yun Li and Qiangjun concept nodes in ontology because features Wang, 2001; Yan He et al., 2006; Feng Zhang et or attributes related to concepts of different al., 2006). This fact makes it difficult in ontology PoS types may be different. This paper pre- construction as the fundamental features of these sents a simple approach to tag domain terms are missing. The acquisition of taxonomies is terms for the convenience of ontology con- in fact the process of organizing domain specific struction, referred to as Term PoS (TPoS) concepts. These concepts in an ontology should be Tagging. The proposed approach makes defined using a subclass hierarchy by assigning use of segmentation and tagging results and defining properties and by defining from a general PoS tagging software to pre- relationship between concepts etc. (Van Rees, R., dict tags for extracted domain specific 2003). These methods are all concept descriptions. terms. This approach needs no training and The linguistic information associated with domain no context information. The experimental terms such as PoS tags and semantic classification results show that the proposed approach information of terms can also make up for the achieves a precision of 95.41% for ex- concept related features which are associated with tracted terms and can be easily applied to concept labels. Terms with different PoS tags different domains. Comparing with some usually carry different semantic information. For existing approaches, our approach shows example, a noun is usually a word naming a thing that for some specific tasks, simple method or an object. A verb is usually a word denoting an can obtain very good performance and is action, occurrence or state of existence, which are thus a better choice. all associated with a time and a place. Thus in ontology construction, noun nodes and verb nodes Keywords: ontology construction, part-of- should be described using different attributes with speech (PoS) tagging, Term PoS (TPoS) different discriminating characters. With this tagging. information, extracted terms can then be classified

17 The 6th Workshop on Asian Languae Resources, 2008

accordingly to help in ontology construction and information from the words themselves. A number retrieval work. Thus PoS tags can help identify the of features based on prefixes and suffixes and different features needed for concept representation spelling cues like capitalization were adopted in in domain ontology construction. these researches (Mikheev, A, 1997; Brants, Thor- sten, 2000; Mikheev, A, 1996). Mikheev presented It should be pointed out that Term PoS (TPoS) a technique for automatically acquiring rules to tagging is different from the general PoS tagging guess possible POS tags for unknown words using tasks. It is designed to do PoS tagging for a given their starting and ending segments (Mikheev, A, list of terms extracted from some terminology ex- 1997). Through an unsupervised process of rule traction algorithms such as those presented in acquisition, three complementary sets of word- (Luning Ji et al., 2007). The granularity of general guessing rules would be induced from a general PoS tagging is smaller than what is targeted in this purpose lexicon and a raw corpus: prefix morpho- paper because terms representing domain specific logical rules, suffix morphological rules and end- concepts are more likely to be compound words ing-guessing rules (Mikheev, A, 1996). Brants and sometimes even phrases, such as “文件管理 used the linear interpolation of fixed length suffix 器 ”(file manager), “ 并发描述”(description of model for word handling in his POS tagger, named concurrency), etc.. Even though current general TnT. For example, an English word ending in the word segmentation and PoS tagging can achieve suffix –able was very likely to be an adjective precision of 99.6% and 97.58%, respectively (Brants, Thorsten, 2000). (Huaping Zhang et al., 2003), its performance for domain specific corpus is much less satisfactory Some existing methods are based on the analysis (Luning Ji et al., 2007), which is why terminology of word morphology. They exploited more features extraction algorithms need to be developed. besides morphology or took morphology as sup- In this paper, a very simple but effective method plementary means (Toutanova et al., 2003; Huihsin is proposed for TPoS tagging which needs no train- Tseng et al., 2005; Samuelsson, Christer, 1993). ing process or even context information. This Toutanova et al. demonstrated the use of both pre- method is based on the assumption that every term ceding and following tag contexts via a depend- has a headword. For a given list of domain specific ency network representation and made use of some terms which are segmented and each word in the additional features such as lexical features includ- term already has a PoS tag, the TPoS tagging algo- ing jointly conditioning on multiple consecutive rithm then identifies the position of the headword words and other fine-grained modeling of word and take the tag of the headword as the tag of the features (Toutanova et al., 2003). Huihisin et al. term. Experiments show that this method is quite proposed a variety of morphological word features, effective in giving good precision and minimal such as the tag sequence features from both left computing time. and right side of the current word for POS tagging The remaining of this paper is organized as fol- and implemented them in a Maximum Entropy lows. Section 2 reviews the related work. Section 3 Markov model (Huihsin Tseng et al., 2005). gives the observations to the task and correspond- Samuelsson used n-grams of letter sequences end- ing corpus, then presents our method for TPOS ing and starting each word as word features. The tagging. Section 4 gives the evaluation details and main goal of using Bayesian inference was to in- discussions on the proposed method and reference vestigate the influence of various information methods. Section 5 concludes this paper. sources, and ways of combining them, on the abil- ity to assign lexical categories to words. The 2 Related Work Bayesian inference was used to find the tag as- signment T with highest probability P(T|M, S) Although TPoS tagging is different from general given morphology M (word form) and syntactic PoS tagging, the general POS tagging methods are context S (neighboring tags) (Samuelsson, Christer, worthy of referencing. There are a lot of existing 1993). POS tagging researches which can be classified into following categories in general. Natural ideas Other researchers inclined to regard this POS of solving this problem were to make use of the tagging work as a multi-class classification prob- lem. Many methods used in machine learning, such

18 The 6th Workshop on Asian Languae Resources, 2008

as Decision Tree, Support Vector Machines (SVM) principle, the context information of terms can help and k-Nearest-Neighbors (k-NN), were used for TPoS tagging and the individual PoS tags may be guessing possible POS tags of words (G. Orphanos good choices as classification features. and D. Christodoulakis, 1999; Nakagawa T, 2001; The proposed TPoS tagging algorithm consists Maosong Sun et al., 2000). Orphanos and Christo- of two modules. The first module is a terminology doulakis presented a POS tagger for Modern Greek extraction preprocessing module. The second and focused on a data-driven approach for the in- module carries out the TPoS tag assignment. In the duction of decision trees used as disambiguation or terminology extraction module, if the result of ter- guessing devices (G. Orphanos and D. Christodou- minology extraction algorithm is a list of terms lakis, 1999). The system was based on a high- without PoS tags, a general purpose segmenter coverage lexicon and a tagged corpus capable of called ICTCLAS1 will be used to give PoS tags to showing off the behavior of all POS ambiguity all individual words. ICTCLAS is developed by schemes and characteristics of words. Support Chinese Academy of Science, the precision of Vector Machine is a widely used (or effective) which is 97.58% on tagging general words classification approach for solving two-class pat- (Huaping Zhang et al., 2003). Then the output of tern recognition problems. Selecting appropriate this module is a list of terms, referred to as Term- features and training effective classifiers are the List, using algorithms such as the method de- main points of SVM method. Nakagawa et al. used scribed in (Luning Ji et al., 2007). substrings and surrounding context as features and achieve high accuracy in POS tag prediction (- In this paper, two simple schemes for the term kagawa T, 2001). Furthermore, Sun et al presented PoS tag assignment module are proposed. The first a POS identification algorithm based on k-nearest- scheme is called the blind assignment scheme. It neighbors (k-NN) strategy for Chinese word POS simply assigns the noun tag to every term in the tagging. With the auxiliary information such as TermList. This is based on the assumption that existing tagged lexicon, the algorithm can find out most of the terms in a specific domain represent k nearest words which were mostly similar with the certain concepts that are most likely to be nouns. word need tagging (Maosong Sun et al., 2000). Result from this blind assignment scheme can be considered as the baseline or the worse case sce- 3 Algorithm Design nario. Even in general domain, it is observed that nouns are in the majority of Chinese words with As pointed out earlier, TPoS tagging is different more than 50% among all different PoS tags (Hui from the general PoS tagging tasks. In this paper, it Wang, 2006). is assumed that a terminology extraction algorithm has already obtained the PoS tags of individual The second scheme is called head-word-driven words. For example, in the segmented and tagged assignment scheme. Theoretically, it will take the sentence “计算机/n 图形/n 学/v 中/f ,/w 物体/n tag of the head word of one term as the tag of the 常常/d 用/v 多边形/a 网格/n 来/f 表示/v 。/w”(In whole term. But here it simply takes the tag of the last word in a term. This is based on the assump- computer graphics, objects are usually represented tion that each term has a headword which in most as polygonal meshes.), the term “多边形网格” cases is the last word in a term (Hui Wang, 2006). (polygonal meshes) has been segmented into two One additional experiment has been done to verify individual words and tagged as “ 多边形/a” this assumption. A manually annotated Chinese (polygonal /a) and “ 网格/n” (meshes /n). The shallow Treebank in general domain is used for the terminology extraction algorithm would identify statistic work (Ruifeng Xu et al., 2005). There are these two words “多边形/a” and “网格/n” as a 9 different structures of Chinese phrases, (Yunfang single term in a specific domain. The proposed et al., 2003), but only 3 of them do not have algorithm is to determine the PoS of this single their head words in the tail, which are about 6.56% term “多边形网格” (polygonal meshes), thus the from all phrases. Following the examples earlier, algorithm is referred to as TPoS tagging. It can be seen that the general purpose PoS tagging and term PoS tagging assign tags at different granularity. In 1 Copyright © Institute of Computing Technology, Chinese Academy of Sciences

19 The 6th Workshop on Asian Languae Resources, 2008

the term “多边形/a 网格/n” (polygonal /a meshes ment scheme. In first part of this experiment, all /n) will be assigned the tag “/n” because the last the 3,343 terms in TermList1 are tagged as nouns. word is labeled “/n”. The result shows that the precision of the blind assignment scheme is between 78.79% and 84.77%. There are a lot of semanteme tags at the end of a The reason for the range is that there are about 200 term. For example, “/ng” presents single character terms in TermList1 which can be considered either postfix of a noun. But it would be improper if a as nouns, gerunds, or even verbs without reference 决 term is tagged as “/ng”. For example, the term “ to context. For example, the term “局域网远程访 策器 ” (decision-making machine) contains two 问” (“remote access of local area network” or “re- segments as listed with two components “决策/n” mote access to local area network”) and the term and “器/ng”. It is obvious that “决策器/ng” is in- “极化” (polarization or polarize), can be consid- appropriate. Thus the head-word-driven assign- ered either as nouns if they are regarded as courses ment scheme also includes some rules to correct of events or as verbs if they refer to the actions for this kind of problems. As will be discussed in the completing certain work. The specific type is de- experiment, the current result of TPoS tagging is pendent on the context which is not provided with- based on 2 simple induction rules applied in this out the use of a corpus. However, the experiment algorithm. result does show that in a specific domain, there is a much higher percentage of terms that are nouns 4 Experiments and Discussions than other tags in general (Hui Wang, 2006). As to The domain corpus used in this work contains TermList3, the precision of blind assignment is 16 papers selected from different Chinese IT jour- between 65.57% and 70.77% (19 mixed ones). nals between 1998 and 2000 with over 1,500,000 TermList2 is the result of a terminology extraction numbers of characters. They cover topics in IT, algorithm and there are non-term items in the ex- such as electronics, software engineering, telecom, traction result, so the blind assignment scheme is and wireless communication. The same corpus is not applied on TermList2. The blue colored bars used by the terminology extraction algorithm de- (lighter color) in Figure 1 shows the result of veloped in (Luning Ji et al., 2007). In the domain TermList1 and TermList3 using the blind assign- of IT, two TermLists are used for the experiment. ment scheme which gives the two worst result TermList1 is a manually collected and verified compared to our proposed approach to be dis- term list from the selected corpus containing a total cussed in Section 4.2 of 3,343 terms. TermList1 is also referred to as the 4.2 Experiments on the Head-Word-Driven standard answer set to the corpus for evaluation Assignment Scheme purposes. TermList2 is produced by running the terminology extraction algorithm in (Van Rees, R, The experiment in this section was designed to 2003). TermList2 contains 2,660 items out of validate the proposed head-word-driven assign- which 929 of them are verified as terminology and ment scheme. The same experiment is conducted 1,731 items are not considered terminology ac- on the three term lists respectively, as shown in cording to the standard answer above. Figure 1 in purple color (darker color). The preci- To verify the validity of the proposed method to sion for assigning TPoS tags to TermList1 is different domains, a term list containing 366 legal 93.45%. By taking the result from a terminology terms obtained from Google searching results for extraction algorithm without regards to its potential “ 法律术语大全”(complete dictionary of legal error propagations, the precision of the head-word- terms) is selected for comparison, which is named driven assignment scheme for TermList2 is TermList3. 94.32%. For TermList3, the precision of PoS tag assignment is 90.71%. By comparing to the blind 4.1 Experiment on the Blind Assignment assignment scheme, this algorithm has reasonably Scheme good performance for all three term list with The first experiment is designed to examine the precision of over 90%. It also gives 8.7% and proportion of nouns in TermList1 and TermList3, 19.9% improvement for TermList1 and TermList3, to validate of the assumption of the blind assign- respectively, as compared to the blind assignment

20 The 6th Workshop on Asian Languae Resources, 2008

scheme, a reasonably good improvement without a bunal) and “数据/n 集/vg” (data set) will be tagged heavy cost. However, there are some abnormali- as “知识产权庭/ng” and “数据集/vg”, respec- ties in these results. Supposedly, TermList1 is a tively, which are obviously wrong. So, the simple hand verified term list in the IT domain and thus its rule of “tagging terms with “/ng” and “/vg” to “/n” result should have less noise and thus should per- is applied. The performance of TPoS tag assign- form better than TermList2, which is not the case ment after applying these two fine tuning induction as shown in Figure 1. rules are shown in Table 1 below. Figure 1 Performance of the Two Assignment Table 1 Influence of Induction Rules on Different Schemes on the Three Term Lists Term Lists 100.00% Precision Improve- Term Precision after add- 80.00% ment Lists of tagging ing induc- blind assignment Percentage 60.00% tion rule head-driven TermList1 93.45% 97.03% 3.83%

Precision 40.00% assignment 20.00% TermList2 94.32% 95.41% 1.16%

0.00% TermList3 90.71% 93.99% 3.62% T1 T2 T3 Term Lists It is obvious that with the use of fine tuning us- ing induction rules, the results are much better. In By further analyzing the error result, for exam- fact the result for TermList1 reached 97.03% ple for TermList1, among these 3,343 terms, about which is quite close to PoS tagging of general do- 219 were given improper tags, such as the term “图 main data. The abnormality also disappeared as the 形学” (Graphics). In this example, two individual performance of TermList1 has the best result. The words, “图形/n” and “学/v”, form a term. So the improvement to TermList2 (1.16%) is not as obvi- output was “图形学/v” for taking the tag of the last ous as that for TermList1 and TermList3, which segment. It was a wrong tag because the whole are 3.83% and 3.62%, respectively. This, however, term was a noun. In fact, the error is caused by the is reasonable as TermList2 is produced directly general word PoS tagging algorithm because with- from a terminology extraction algorithm using a out context, the most likely tagging of “学”, a se- corpus, thus, the results are noisier. manteme, is a verb. This kind of errors in se- Further analysis is then conducted on the result manteme tagging appeared in the results of all of TermList2 to examine the influence of non-term three term lists with 169 from TermList1, 29 from items to this term list. The non-term items are TermList2 and 12 from TermList3, respectively. items that are general words or items cannot be This was a kind of errors which can be corrected considered as terminology according to the stan- by applying some simple induction rules. For ex- dard answer sheet. For example, neither of the ample, for all semantemes with multiple tags (in- terms “问题” (problem) and “模式训练是” (pat- cluding noun as in the example), the rule can be tern training is) were considered as terms because “tagging terms with noun suffixes as nouns”. For the former was a general word, and the latter example, terms “劳改/n 场/q” (reform-through- should be considered as a fragment rather than a labor camp) and “计算机/n 图形/n 学/v” (com- word. In fact, in 2,660 items extracted by the algo- puter graphics) were given different tags using the rithm as terminology, only 929 of them are indeed head-word-driven assignment scheme. They were terminology (34.92%), and rest of them do not assigned as: “劳改场/q” and “计算机图形学/v” qualify as domain specific terms. The result of this which can be corrected by this rule. Another kind analysis is listed in Table 2. of mistake is related to the suffix tags such as “/ng” (noun suffix) and “/vg”(verb suffix). For examples, “知识/n 产权/n 庭/ng” (intellectual property tri-

21 The 6th Workshop on Asian Languae Resources, 2008

Table 2 Data Distribution Analysis on TermList2 posed method in this paper is comparable to these Without Induction Induction Rules general PoS tagging algorithms in magnitude. Of Rules Applied course, one main reason of this fact is the differ-

correct correct ence in its objectives. The proposed method is for precision precision terms terms the PoS tagging of domain specific terms which Terms have much less ambiguity than tagging of general 879 94.62% 898 96.66% (929) text. Domain specific terms are more likely to be Non-terms 1,630 94.17% 1,640 94.74% nouns and there are some rules in the word- (1,731) formation patterns while general PoS tagging algo- Total 2,509 94.32% 2,538 95.41% rithms usually need training process in which large (2,660) manually labeled corpora would be involved. Ex- Results show that 31 and 50 from the 929 cor- periment results also show that this simple method rect terms were assigned improper PoS tags using can be applied to data in different domains. the proposed algorithm with and without the induc- tions rules, respectively. That is, the precisions for 5 Conclusion and Future Work correct data are comparable to that of TermList1 In this paper, a simple but effective method for (93.45% and 97.03%, respectively). For the non- assigning PoS tags to domain specific terms was terms, 91 items and 101 items from 1,731 items presented. This is a preliminary classification work were assigned improper tags with and without the on terms. It needs no training process and not even induction rules, respectively. Even though the pre- context information. Yet it obtains a relatively cisions for terms and non-terms without using the good result. The method itself is not domain de- induction rules are quite the same (94.62% vs. pendent, thus it is applicable to different domains. 94.17%), the improvement for the non-terms using Results show that in certain applications, a simple the induction rules are much less impressive than method may be more effective under similar cir- that for the terms. This is the reason for the rela- cumstances. The algorithm can still be investigated tively less impressed performance of induction over the use of more induction rules. Some context rules for TermList2. It is interesting to know that, information, statistics of word/tag usage can also even though the performance of the terminology be explored. extraction algorithm is quite poor with precision of only around 35% (929 out of 2,666 terms), it does Acknowledgments not affect too much on the performance of the This project is partially supported by CERG TPoS proposed in this paper. This is mainly be- grants (PolyU 5190/04E and PolyU 5225/05E) and cause the items extracted are still legitimate words, B-Q941 (Acquisition of New Domain Specific compounds, or phrases which are not necessarily Concepts and Ontology Update). domain specific. The proposed algorithm in this paper use mini- References mum resources. They need no training process and Yun Li, Qiangjun Wang. 2001. Automatic Term Extrac- even no context information. But the performance tion in the Field of Information Technology. In the of the proposed algorithm is still quite good and proceedings of The Conference of 20th Anniversary can be directly used as a preparation work for do- for Chinese Information Processing Society of China. main ontology construction because of its presion Yan He, Zhifang Sui, Huiming Duan, and Shiwen . of over 95%. Other PoS tagging algorithms reach 2006. Term Mining Combining Term Component good performance in processing general words. Bank. In Computer Engineering and Applications. For example, a k-nearest-neighbors strategy to Vol.42 No.33,4--7. identify possible PoS tags for Chinese words can Feng Zhang, Xiaozhong Fan, and Yun Xu. 2006. Chi- reach 90.25% for general word PoS tagging nese Term Extraction Based on PAT Tree. Journal of (Maosong Sun et al., 2000). Another method based Beijing Institute of Technology. Vol. 15, No. 2. on SVM method on English corpus can reach 96.9% in PoS tagging known and unknown words Van Rees, R. 2003. Clarity in the Usage of the Terms (Nakagawa T, 2001). These results show that pro- Ontology, Taxonomy and Classification. CIB73.

22 The 6th Workshop on Asian Languae Resources, 2008

Mikheev, A. 1997. Automatic Rule Induction. for Un- shop affiliated with 41th ACL, 184--187. Sapporo known Word Guessing. In Computational Lingusitics Japan. Vol. 23(3), ACL. Hui Wang. Last checked: 2007-08-04. Statistical studies Toutanova, Kristina, Dan Klein, Christopher Manning, on Chinese vocabulary ( 汉语词汇统计研究). and Yoram Singer. 2003. Feature-Rich Part-of- http://www.huayuqiao.org/articles/wanghui/wanghui Speech Tagging with a Cyclic Dependency Network. 06.doc. The date of publication is unknown from the In proceedings of HLT-NAACL. online source. Huihsin Tseng, Daniel Jurafsky, and Christopher Man- Ruifeng Xu, Qin Lu, Yin Li and Wanyin Li. 2005. The ning. 2005.Morphological Features Help POS Tag- Design and Construction of the PolyU Shallow Tree- ging of Unknown Words across Language Varieties. bank. International Journal of Computational Lin- In proceedings of the Fourth SIGHAN Workshop on guistics and Chinese Language Processing, V.10 N.3. Chinese Language Processing. Yunfang Wu, Baobao Chang and Weidong Zhan. 2003. Samuelsson, Christer. 1993. Morphological Tagging Building Chinese-English Bilingual Phrase Database. Based Entirely on Bayesian Inference. In proceedings Page 41-45, Vol. 4. of NCCL 9. Brants, Thorsten. 2000. TnT: A Statistical Part-of- Speech Tagger. In proceedings of ANLP 6. G. Orphanos, and D. Christodoulakis. 1999. POS Dis- ambiguation and Unknown Word Guessing with De- cision Trees. In proceedings of EACL’99, 134--141. H Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. In proceedings of International Conference on New Methods in Language Processing. Maosong Sun, Dayang Shen, and Changning Huang. 1997. CSeg& Tag1.0: a practical word segmenter and POS tagger for Chinese texts. In proceedings of the fifth conference on applied natural language processing. Ying Liu. 2002. Analysing Chinese with Rule-based Method Combined with Statistic-based Method. In Computer Engineering and Applications, Vol.7. Mikheev, A. 1996. Unsupervised Learning of Word- Category Guessing Rules. In proceedings of ACL-96. Nakagawa T, Kudoh T, and Matsumoto Y. 2001. Un- known Word Guessing and Part-of-Speech Tagging Using Support Vector Machines. In proceedings of NLPPRS 6, 325--331. Maosong Sun, Zhengping Zuo, and B K, TSOU. 2000. Part-of-Speech Identification for Unknown Chinese Words Based on K-Nearest-Neighbors Strategy. In Chinese Journal of Computers. Vol.23 No.2: 166-- 170. Luning Ji, Mantai Sum, Qin Lu, Wenjie Li, Yirong Chen. 2007. Chinese Terminology Extraction using Window-based Contextual Information. In proceed- ings of CICLING. Huaping Zhang et al. 2003. HHMM-based Chinese Lexical Analyzer ICTCLAS. Second SIGHAN work-

23 The 6th Workshop on Asian Languae Resources, 2008

24 The 6th Workshop on Asian Language Resources, 2008

Technical Terminology in Asian Languages: Different Approaches to Adopting Engineering Terms

Makiko Matsuda Tomoe Takahashi Nagaoka University of Technology Nagaoka University of Technology 1603-1 Kamitomioka, Nagaoka, Japan 1603-1 Kamitomioka, Nagaoka, Japan [email protected] [email protected] Hiroki Goto Yoshikazu Hayase University of Toyama Toyama National College of Maritime Technology Gofuku, Toyama, Japan 1-2 Ebie-Neriya, Imizu , Japan [email protected] [email protected] Robin Lee Nagano Yoshiki Mikami University of Miskolc Nagaoka University of Technology Miskolc-Egyetemváros, Hungary 1603-1 Kamitomioka, Nagaoka, Japan [email protected] [email protected]

Abstract 1 Introduction Terminology development in education, 1.1 Terminology in Development Context science and technology is a key to formu- For developing countries there is a strong necessity lating a knowledge society. The authors are developing a multilingual engineering to establish an appropriate set of terminology to terminology dictionary consisting of more enable local people to learn science and technology in an efficient manner throughout their educational than ten thousand engineering terms each programs using their mother tongues as the instruc- for ten Asian languages and English. The tional medium. Instruction in the mother tongue is dictionary is primarily designed to support foreign students who are studying engi- beneficial for students to acquire a basic concept in neering subjects at Japanese higher educa- a subject (UNESCO, 2003). International communities are fully aware of this tional institutions in the Japanese language. necessity. The World Summit on the Information Analysis of the lexical terms could provide Society (WSIS) addressed this issue, saying that useful knowledge for language resource terminology development in education, science and creators. There are two adoption ap- proaches, “phonetic adoption” (translitera- culture is a key to developing knowledge societies. tion of borrowed terms) and “semantic 1.2 Two Approaches to Adopting Terms adoption” (where the meaning is expressed using native words). The proportion of the Unlike the situation with basic vocabulary, scien- two options found in the terminology set of tific or technical terms are usually not “home- each country (or language) shows a sub- grown”, but are often imported from the outside stantial difference and seems to reflecti world. And when a scientific or technical term is language policies of the country and the in- imported from one language (the source language) fluence of foreign languages on the host to another language (the host language), typically language. This paper presents preliminary two approaches are found. results of our investigation on this question The first approach is to simply borrow a word based on a comparative study of three lan- by creating a phonetically equivalent word in the guages: Japanese, Vietnamese and Thai. host language. For example, the equivalent of the

25 The 6th Workshop on Asian Language Resources, 2008

English word "science" in Malay is "sains", pro- pare a meaningful number of vocabulary items and nounced as /sains/, and "computer" in Japanese is to train teachers in their use. written as "コンピュータ" in the Japanese kata- kana1 syllabary and is pronounced /konpju:/ . We Table 1. Pros and Cons of Phonetic and Semantic call this phonetic adoption in this paper. It is quite Approaches to Term Adoption similar to transliteration, but not exactly the same Advantages Disadvantages due to the difference in phonetic structure between Phonetic (1) easy to find (1) disturbs the purity the source and host languages. adoption connectivity to the of host language The second approach is to create a new word by original word (2) not easy to guess combining a semantically relevant word-root or (2) easy to create what it means Semantic (1) does not disturb (1) not easy to find word in the host language. For example, "science" h adoption the purity of host connectivity to the in Thai is "วิทยาศาสตร", pronounced /wítt ayasà:t/. language original word This word is derived from the word "vidya" (2) relatively easy (2) not easy to create (knowledge) in Sanskrit, a source from which the to guess what it Thai language has imported many words through- means, or at least to out its history. "Science" in Japanese is "科学", what it relates pronounced /kagaku/. This word was coined more 1.4 Third Approach: Use of foreign language than one hundred years ago as a combination of two Chinese characters, "科" and "学", meaning "a In addition to this dilemma, a completely different section, or a branch of something" and "learning", approach can be taken when the necessity is very respectively. As shown in these two cases, it often urgent. This is to design and to implement the happens that classical languages like Sanskrit or whole educational program by means of an estab- Chinese, rather than the host language itself, pro- lished foreign language. We call this approach the vide root-words in this creation process just as source language approach. In the real world to- Latin and Greek roots are used in scientific and day, this third approach is widely adopted in many technical terms in English. We call this approach developing countries, to various extents. The semantic adoption in this paper. choice of foreign language depends on subject do- mains and the economic and political condition of 1.3 Pros and Cons of the Two Approaches the country as well as on historical ties. When we look back on the historical evolution of 1.5 Objectives of the study scientific terms dating from ancient civilization to modern times, both types of adoption are found. In summary, there are three basic options. How- Both approaches to adoption have advantages and ever, little research has been carried out on how a disadvantages in terms of ease of creation and ease language community can effectively formulate of understanding, as well as other issues. The pros terminology focused on Asian languages. “Guide- and cons of semantic adoption, summarized in Ta- lines for Terminology Policy” (Infoterm 2005) ble 1, are in principle the opposites of those of talks about various issues relating to this subject, phonetic adoption. but no mention is found relating to the question of But the choice of adoption approach is not so adoption strategy. The authors believe that an simple. The phonetic adoption option is easy to analysis of the existing terminology can provide implement, but would have the danger of disturb- practical and applicable lessons to terminology ing language purity, which also has a high priority development practitioners and policy makers. This in many nations. When the semantic adoption op- paper presents preliminary results of our investiga- tion is chosen, language purity is maintained but tion on this question, based on a comparative study the adoption cost is high and it takes time to pre- of three countries: Japanese, Vietnamese and Thai.

2 Dictionary 1 In Japanese, four scripts are used to write Japanese, e.g. In this paper, the main corpus is extracted from a kanji (i.e., Chinese characters), , and multilingual engineering dictionary based on Babel the Latin alphabet. Katakana script is conventionally used to represent foreign phonetically translated words. (Hayase and Kawata 2002). Babel is a web tech-

26 The 6th Workshop on Asian Language Resources, 2008

nology dictionary for foreign students in Japan2. It adoptions into 8 languages: Vietnamese, Chinese, is composed of the 11 academic domains given in Korean, Filipino, Malay, Sinhalese, Myanmar and Table 2 and three languages, English, Japanese and Mongolian. The data are coded in Unicode. At pre- Thai. sent Vietnamese, Chinese, Korean and Sinhalese language have been translated and other languages Table 2. The Number of Words in the Japanese will be finished soon. The dictionary will be made Lexicon by Subject Domain in April, 2008. For this paper, we have selected Words Words Subject Domain three languages for analysis – Vietnamese, Thai (TYPE) TOKEN and Japanese – because these languages have rela- architecture 1,314 1,798 tively rich vocabularies. chemistry 717 958 3 Comparison of Adoption Approaches civil engineering 766 1,263 telecommunications 650 1,410 The approach to terminology adoption is examined computer science 941 1,538 in this section from various angles. control engineering 1,081 1,545 3.1 Comparison by Country electronics 681 1,200 In order to investigate the portfolio of two adoption maritime science 961 1,403 approaches, we surveyed the entire lexicon for all mathematics 952 1,284 mechanical engineering subjects. In addition, we mechanical engineering 1,117 1,186 took a random sampling in our dictionary and physics 587 945 chose approximately 25 words from two major others* 1,872 - subject fields: architecture and computer science. The portfolios of three languages are shown in Ta- Total of all words 11,639 14,530 ble 4. Compounds such as hybrids of native and *there is some overlap with other domains loaned word-components were separated from se- Table 3. A Sample of the Lexicon mantic adoption. Vietnam- English Japanese Chinese Thai ese Table 4. Comparison of Terms in 3 Domains Origin discrim- วิธีแบงแยก Subject # of Adoption Rate 判別式 判别式 Biệt thức of inant ระหวางกัน Domain terms Approach words JP VN TH factor Định lý ษฎีบทตัวปร 因数定理 因数定理 theorem thừa số ะกอบ Semantic 0.68 0.89 0.78 circular Số đo การวัดเชิงว mechanical 弧度法 弧度法 1,186 English 0.32 0.09 0.21 measure cung tròn งกลม engineering phonetic others 0.02 0.01 radical 累乗根 根 Căn ราก root semantic 0.43 0.87 0.78 limit Giá trị computer 極限値 极限值 คาจํากัด 23 English 0.57 0.13 0.21 value giới hạn science phonetic diver- others 発散 发散 Phát tán การลูออก gence semantic 0.69 0.93 0.76 natural Đối số tự ลอการิทึมธ 自然対数 自然对数 logarithm nhiên รรมชาต ิ architecture 29 English 0.31 0.03 0.24 phonetic point of จุดเปลี่ยนค others 0.03 変曲点 拐点 Điểm uốn inflection วามเวา

Table 4 shows that the Japanese language has To this primary source, we added some terms in the highest percentage of phonetically translated mathematics and physics and made a multilingual words from English. Vietnamese has the lowest database such as that shown in Table 3 by adding percentage of phonetic adopted words from Eng- lish, but includes terms from French in mechanical 2 http://www.toyama-cmt.ac.jp/%7Ehayase/Project/Babel/ engineering plus a few from Russian. Thai shows a

27 The 6th Workshop on Asian Language Resources, 2008

roughly similar balance between semantic adoption 3.3 Mathematical Terms at Different Grade and English-origin terms in all three domains. Level The high number of Thai and Vietnamese origin We have collected about 1,300 terms in mathemat- words in computer science is rather remarkable, as ics for our multilingual engineering dictionary. it indicates a consistent effort to adapt the Thai Following the Japanese course of study3 for grades language to every emerging technology. Note that 1-12, we chose 57 terms from our dictionary and over half of the computer science terms for Japa- compared those terms with their counterparts in nese are of English origin. Vietnamese draws on Thai and Vietnamese. For university-level mathe- different languages to varying degrees; it is notable matical terms, following the syllabus of the 1st year that the languages of origin vary by subject field. of the Faculty of Mathematics in one Japanese 4 3.2 Comparison by Subject Domain: A Case of University, we chose 11 terms . Then we also sur- Japanese Terms veyed those terms in the same way. Table 6 exam- ines the terms used in various grade levels. As we saw in Table 4, Japanese technical terms are often phonetically translated, but less so in archi- Table 6. Comparison of Mathematical Terms by tecture than in the other two fields. To gain a better Grade Level idea of the distribution of phonetic adoption, we Rate surveyed the entire Japanese lexicon for all subject Grade # of Origin of domains. Table 5 gives the rate of katakana usage, Level terms words JP VN TH that is, the rate of words written in the katakana Semantic 0.91 0.90 0.64 Univer- script used for transcribing foreign words. 11 (Sino) (0.91) (0.64) (0.00) sity Phonetic 0.09 0.09 0.36 Table 5. Rate of katakana Usage Semantic 1.00 1.00 0.88 High Subject Domain # of terms Katakana Rate 17 (Sino) (1.00) (0.47) (0.00) School architecture 1,798 147 0.081 Phonetic 0.00 0.00 0.12 chemistry 958 234 0.244 Junior Semantic 1.00 1.00 1.00 high 18 (Sino) (1.00) (0.61) (0.00) civil engineering 1,263 156 0.123 school Phonetic 0.00 0.00 0.00 telecommunications 1,410 422 0.299 Semantic 1.00 1.00 1.00 computer science 1,538 650 0.422 Elemen- tary 22 (Sino) (1.00) (0.50) (0.00) control engineering 1,545 497 0.321 school Phonetic 0.00 0.00 0.00 electronics 1,200 250 0.208 maritime science 1,403 51 0.036 These data indicate that all languages use either mathematics 1,284 50 0.038 original or semantically adopted from elementary mechanical engineering 1,186 385 0.324 level to junior-high school level. For Japanese al- physics 945 120 0.126 most all words are Sino-Japanese; in oral instruc- tion native words may be used in Japan but those The domains of computer science, control engi- are not used as keywords. As for high school, some neering and mechanical engineering include over words phonetically translated from English are 30% of katakana words in their terms. In contrast, used in Thai, while at university level all languages the percentage of katakana terms in architecture, use English-origin words. As for the rate of Sino- mathematics and maritime science domains is less Vietnamese words in mathematics, it is considera- than 10%, indicating that these domains are highly bly higher than other categories. This suggests that semantically adopted in their terms, or have little the likelihood that Sino-origin words can be shared need for imported vocabulary. We thus see that academic domain affects the adoption of phoneti- 3 cally translated terms. The Ministry of Education, Culture, Sports, Science and Technology, Japan (MEXT) determines the national curriculum. 4 We took terms from the 1st year syllabus of the Faculty of Mathematical Science in Doshisya University.

28 The 6th Workshop on Asian Language Resources, 2008

effectively in other countries using Chinese charac- While data are limited to just four terms, general ters is generally high in the field of mathematics. trends can be identified. For Japanese pages, while there were many hits for English, the hits for Japa- 3.4 Web Presence of English Terms and their nese terms were at least three times higher. Inter- Adopted Equivalents estingly, for operand, the one term with two We also examined the web presence of pairs of equivalents, the katakana (i.e., phonetically terms, an English word and its equivalent in the adopted) term was used far more frequently than host language. This is an attempt to find how the semantically adopted kanji term. In the Viet- widely the source language approach – direct use namese pages, while far fewer total hits were made, of foreign language – is used in technological top- the Vietnamese terms were overwhelmingly more ics. Four words from different subject domains frequent than English terms. For Thai pages the were chosen, and a search was carried out for the opposite was found, with the exception of auto- term (as an exact phrase) using the Google search matic lathe, a tool that may require marketing in engine (Table 7). the local language. These results suggest that the major language of technology for Thailand is Eng- Table 7. Web Presence of Technical Terms lish, for Vietnam it is the native language, and for Search Japan there is English use but Japanese dominates. Search term(s) # of hits* through In order to find if there is a distinction between EN ionization 121,000 use in technological fields and in general use, we JP 電離 419,000 made a comparison using three non-technical terms EN fusion welding 387 (Table 8).

JP 融接 12,900 Table 8. Web Presence of Non-technical Terms Japanese EN automatic lathe 258 Search pages Search term(s) # of hits* through JP 自動旋盤 56,600 EN water 2,200,000 EN operand 32,500 JP 水 24,200,000 JP オペランド (katakana) 138,000 Japanese EN novel 2,242,000 JP 演算数 (kanji) 10,700 pages JP 小説 9,870,000 EN ionization 89 EN zoo 2,580,000 VN Điện ly 13,900 JP 動物園 2,650,000 EN fusion welding 6 EN water 295,000 Viet- VN Hàn nóng chảy 66 namese VN nước 4,310,000 pages EN automatic lathe 2 EN novel 68,400 VN Máy tiện tự động 48 Vietnamese VN tiểu thuyết 2,130,000 EN operand 111 pages EN zoo 41,500 VN Số tính toán 810 VN vườn bách thú** 615,000 EN ionization 10,600 VN sở thú*** 1,750,000 TH การแตกประจุ 48 EN water 1,850,000 EN fusion welding 692 TH น้ํา 2,610,000 Thai TH การเชื่อมแบบหลอมละลาย 6 EN novel 269,000 pages EN automatic lathe 39 Thai pages TH วรรณกรรม 1,830,000 TH เครื่องกลึงอัตโนมัติ 195 EN zoo 272,000 EN operand 568 TH สวนสัตว  620,000 TH ตัวถูกดําเนินการ 24 *Accessed on Sept. 10, 2007 *Accessed on Sept. 18, 2007 **Northern dialect, ***Southern dialect

29 The 6th Workshop on Asian Language Resources, 2008

Again, we find that Japanese pages use the “[e]very nationality has the right to use its own Japanese word far more frequently (although zoo is language and system of writing, to preserve its na- nearly equal – perhaps due to the exotic appeal of a tional identity, and to promote its fine customs, foreign word, and one understood by almost every habits, traditions and culture.” Therefore they sel- Japanese) but that English words have a sizable dom use phonetically adopted words as such. One presence. For Vietnamese pages, the number of example of its limited use is “ôm” from the English hits is substantially higher than that for the techni- “ohm.” Another example is proper nouns, when cal terms, but the same trend exists: the native lan- foreign words are used in newspapers This politi- guage dominates. Finally, for Thai pages, the hits cal reason is the primary impetus for the semantic for Thai words far exceed those for English words adoption of terms. in all three cases, although English words are often Secondly, as in other communist nations, Eng- used. lish is not taught as an obligatory subject in Viet- A comparison of results for Tables 7 and 8 con- nam. This educational policy also brings about a firms that the role of languages differs between low rate of phonetically adopted words from Eng- daily use and situations that require technical ter- lish, because many people cannot grasp the mean- minology, even in a country that has adopted the ing. third approach of using a foreign language for sci- Thirdly, we need to consider that the Vietnam- ence and technology. ese language has been historically influenced by several foreign languages, as we can induce from 4 Social and Academic Factors Table 4’s results. Vietnam was under the influence of China for approximately 1,000 years. A number 4.1 High Rate of Semantic Adoption in Asian of Chinese terms were absorbed into Vietnamese Countries as Sino-Vietnamese words. During the French co- As can be imagined, it is not easy for Asian people lonial period, the Vietnamese language added to understand technical terms in other Asian lan- French words for the manufacture of specific guages, because semantic adoption is popular products such as automobiles and bicycles. Lastly, among Asian languages. The diversity of scripts in at the end of the 20th century Russian technical Asia is another reason of this difficulty. In Euro- terms were introduced by people who had studied pean countries, term-sharing by phonetic adoption in or technical advisers who had sent by the Soviet is easier because almost all languages were derived Union or in Russia from the same origin and are sharing the same script. Therefore, in order to achieve mutual com- 4.2.2 Japan munication using technical terms in Asian coun- As is the case with Vietnam, Chinese characters tries, we need to make special efforts to go over and words were imported from China so long ago the variation and harmonize terms each other. that the characters and words are regarded as part But we need to discuss why those languages pre- of the native Japanese language. In the end of the fer semantic adoption. Various social, political, 19th century, a number of Japanese-kango (words and historical factors influence language use. This that used Chinese characters but were created by applies also to the adoption of technical terminol- Japanese people) were invented as adoptions of ogy. Here, we discuss influences of these factors western technical terms, and many of these terms on the approaches used to adopt technical terms, were re-imported to China. including the influence of technical domains and After World WarⅡ, Japan was occupied by the academic discipline. Allied Forces led by the United States and English 4.2 Language Policies and Historical Back- became a compulsory subject from junior high ground school. English-origin words became easy to un- derstand for many Japanese. This historical change 4.2.1 Vietnam caused a sharp rise in percentage of phonetic adop- tions from English among Japanese technical terms In Vietnam, adoption is strongly promoted under (see Table 5) (Hashimoto 2007). However, English the strict language policy. The Political Resume of is not usually used as an instructional medium in the Socialist Republic of Vietnam states that

30 The 6th Workshop on Asian Language Resources, 2008

Japanese institutions of higher education. Teachers disciplines with academic societies established and students know many English loan terms but relatively early may have developed a great deal of they use these loan terms not as English but as their terminology during the Meiji era, and may Japanese, and since they are pronounced according still have a preference for Japanese-kango over to Japanese phonetic rules, they may not even be katakana. Newer disciplines, on the other hand, intelligible to English speakers. may have found it easier to use the imported terms in a relatively direct way. 4.2.2 Thai Thai has different features from the other two Table 9. Academic Societies in Japan countries in that it is less influenced by Chinese. Discipline Major Academic Society in Japan Est. But the Thai language has been influenced by an- architecture Architectural Institute of Japan 1886 other foreign language, Sanskrit, the classical In- civil Japan Society of Civil Engineers 1914 dian language. We did not examine the rate of San- engineering Chemistry and Chemical Industry skrit origin words in the Thai language in this chemistry 1878 study, but it seems likely that a high rate of terms of Japan Information Processing Society of phonetically adopted from Sanskrit will be found 1960 in our engineering dictionary. Although Thai has tele- Japan native-language equivalents at the lexical level, communica- The Institute of Electronics, Infor- tions, mation, and Communication Engi- 1987 English is daily used as an instructional medium in computer neers higher education. Students in Thailand start to science The Japan Society of Information 1983 learn English at 10 years old or earlier. Thus, even and Communication Research though they have native terms they prefer to use control The Institute of Systems, Control 1957 terms phonetically adopted from English or to use engineering and Information Engineers English. The Institute of Electrical Engi- electronics 1888 4.3 Other Factors neers of Japan maritime The Japan Society of Naval Archi- 1898 4.3.1 When the subject was introduced? science tects and Ocean Engineers In the Meiji era (1868-1912) in Japan, experts mathematics Mathematical Society of Japan 1877 mechanical The Japan Society of Mechanical translated technical terms from Western languages 1897 engineering Engineers semantically, into Japanese-kango. Since World War II, however, it has been the mainstream to physics The Physical Society of Japan 1877 transliterate into katakana. Therefore a decisive 4.3.2. Is it indigenous? factor in adoption approach is when the subject was introduced into Japan. Table 9 shows the Each country has specific characteristics of the foundation year for major academic societies and region, such as regional climate, folk, culture, life- institutes in engineering fields, which roughly in- style and other indigenous factors. Technical terms dicates the time of introduction of the subject. in these fields are therefore strongly connected As shown in Table 5, computer science is the with regional characteristics. Before unfamiliar field that uses technical terms in katakana most terms or new technologies were brought over to the frequently, while the sampled vocabulary from host country, experts in those fields had already telecommunications also includes nearly 30% of created and used their own technical terms in their katakana words. These disciplines are relatively native language, and also they were already aware newly developed in Japan; the societies or insti- of the concept of a thing indicated by an unfamiliar tutes related to these fields were established in the term in alien language. All they had to do was to latter half of the 20th century. make a one-to-one correspondence and correctly The oldest societies, the Mathematical Society translate a foreign term into the term that was al- of Japan and The Physical Society of Japan, were ready in use in the mother tongue. established in 1877; our study showed that only 3.8% of mathematical terms and 12.6% of terms in physics are phonetically adopted terms. Therefore,

31 The 6th Workshop on Asian Language Resources, 2008

5 Conclusion References This preliminary study looks at the different ap- Christian Galinski. 1995. Terminology and Standardiza- proaches to adopting technical terms in three Asian tion under a machine adoption perspective. MT languages. The three languages investigated prefer Summit V Proceedings. different methods: Vietnamese tends to adopt Douglas Skuce, Ingrid Meyer. 1990. Concept Analysis words into its language semantically, through and Terminology: A Knowledge-Based Approach to adoption, while Japanese adopts them phonetically, Documentation, International Conference on Compu- through transliteration. Thai, to a large extent, has tational Linguistics archive Proceedings of the 13th chosen a third approach, that of using a source lan- conference on Computational linguistics, Volume 1. guage (English). Giovanni Battista Varile, Antonio Zampolli. 1997. Sur- In Japanese and in Vietnamese, many terms sub- vey of the State of the Art in Human Language Tech- ject to semantic adoption were translated into Chi- nology. Cambridge University Press. nese characters. This suggests that Sino-words may Hashimoto Waka. 2007. A diachronic transition of pho- be an effective way of communicating concepts netically translated words in Japanese: based on the among certain language users in certain disciplines, column of newspaper. Language, 36-6, Taishukan such as mathematics. Publishers, Japan. (橋本和佳 「外来語の通時的 The effect of grade level was also investigated, 推移-新聞社説を素材として-」『言語』36-6, and it was found that native language terms were 大修館書店) used in mathematics almost exclusively until high Hayase Yoshikazu, Shigeo Kawata. 2002. BABEL: A school or university level. multilingual dictionary and hypertext formation sys- Subject domain was found to have an effect on tem for engineers. Transactions of JSCES, Paper the adoption approach and was often influenced by No.20020023.(早勢欣和, 川田重夫「BABEL: エ whether the domain had a long tradition, and when ンジニア支援のための多言語辞書とハイパーテ established a discipline was (as shown by the キスト生成」) foundation of academic societies). A domain effect was seen in Vietnamese, where Russian or French- Infoterm. 2005. UNESCO Guidelines for Terminology origin terms appeared in different domains. Policies. Fomulating and implementing terminology While quite limited in scope, this study has re- policy in language communities, Paris. vealed clear trends that deserve further investiga- Kageura Kyo. 2006. The status of borrowed items in tion. Japanese terminology. Studies in Japanese Language Vol2, No.4 (影浦峡 「日本語専門語彙の構成 6 Future Research における外来語語基の位置づけ」『日本語の研 究』2006) We plan to investigate more fully the rate of pho- netically translated words and approach to termi- UNESCO. 2003. Education in a multilingual world. nology adoption. We expect that Chinese and Ko- Paris: UNESCO, (ED-2003/WS/2) . rean will give us further evidence of widespread http://unesdoc.unesco.org/images/0012/001297/1297 use of Sino-words and Mongolian will provide 28e.pdf data for phonetic adoption from Russian. Malay and Filipino are likely to be highly influenced by English, while we will find the influence of San- skrit in Myanmar and Sinhalese.

Acknowledgements The study was made possible by the sponsorship of the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT) through the Asian Language Resource Network project. The authors would like to express sincere gratitude to all lexicon contributors for their help.

32 The 6th Workshop on Asian Languae Resources, 2008

Selection of XML tag set for Myanmar National Corpus

Wunna Ko Ko Thin Zar Phyo AWZAR Co. Myanmar Unicode and NLP Research Mayangone Township, Yangon, Center Myanmar Myanmar Info-Tech, Hlaing Campus, [email protected] Yangon, Myanmar [email protected]

Myanmar Language Commission and a number Abstract of scholars had been collected a number of corpora for their specific uses [Htay et al., 2006]. But there In this paper, the authors mainly describe is no national corpus collection, both in digital and about the selections of XML tag set for non-digital format, until now. Myanmar National Corpus (MNC). MNC Since there are a number of languages used in will be a sentence level annotated corpus. Myanmar, the national level corpus to be built will The validity of XML tag set has been include all languages and scripts used in Myanmar. tested by manually tagging the sample data. It has been named as Myanmar National Corpus or MNC, in short form. Keywords: Corpus, XML, Myanmar, During the discussion for the selection of Myanmar Languages format for the corpus, XML (eXtensible Markup 1 Introduction Language), a subset of SGML (Standard Generalized Markup Language), format has been Myanmar (formerly known as Burma) is one of the chosen since XML format can be a long usable and South-East Asian countries. There are 135 ethnic possible to keep the original format of the text groups living in Myanmar. These ethnic groups [Burnard. 1996]. The range of software available speak more than one language and use different for XML is increasing day by day. Certainly more scripts to present their respective languages. There and more NLP related tools and resources are are a total of 109 languages spoken by the people produced in it. This in turn makes the necessity of living in Myanmar [Ethnologue, 2005]. selection of XML tag set to start building of MNC. There are seven major languages, according to MNC will include not only written text but also the speaking population in Myanmar. They are spoken texts. The part of written text will include Kachin, Kayin/Karen, Chin, Mon, Burmese, regional and national newspapers and periodicals, Rakhine and Shan [Ko Ko & Mikami, 2005]. journals and interests, academic books, fictions, Among them, Burmese is the official language and memoranda, essays, etc. The part of spoken text spoken by about 69% of the population as their will include scripted formal and informal mother tongue [Ministry of Immigration and conversations, movies, etc. Population, 1995]. During the selection of XML tag sets, the Corpus is a large and structured set of texts. sample for all the data which will be included in They are used to do statistical analysis, checking building of MNC, has been learnt. occurrences or validating linguistic rules on a specific universe.1 2 Myanmar National Corpus In Myanmar, there are a plenty of text for most Myanmar is a country of using 109 different of the languages, especially Burmese and major languages and a number of different scripts languages, since stone inscription. [Ethnologue, 2005]. In order to do language processing for these languages and scripts, it

1 http://en.wikipedia.org/wiki/Text_corpus becomes a necessity to build a corpus with

33 The 6th Workshop on Asian Languae Resources, 2008

languages and scripts used in Myanmar; at least great benefit about XML is that the document itself with major languages and scripts, which will describes the structure of data. 2 include almost all areas of documents. Three characteristics of XML distinguish from Among the different scripts used in Myanmar, other markup languages:3 the popular scripts include Burmese script (a • its emphasis on descriptive rather than Brahmi based script), Latin scripts. Building of procedural markup; MNC will be helpful for development of Natural Language Processing (NLP) tools (such as • its notion of documents as instances of a grammar rules, spelling checking, etc) and also for document type and linguistic research on these languages and scripts. Moreover, since Burmese script is written without • its independence of any hardware or necessarily pausing between words with spaces, software system. the corpus to be built is hoped to be useful for Since MNC is to be built in XML based format, developing tools for automatic word segmentation. the selection process for tag set of XML become an important process. The XML tagged corpus data 2.1 XML based corpus should also keep the original format of the data. XML is universal format for structured documents In order to select XML tag set for MNC, the and data, and can provide highly standardized sample data for the corpus has to be collected. The representation frameworks for NLP (Jin-Dong format of the sample corpus data has been studied KIM et al. 2001); especially, the ones with for the selection of the XML tag set in appropriate annotated corpus based approaches, by providing with the data format. them with the knowledge representation frameworks for morphological, syntactic, 2.2 Structure of a data file at MNC semantics and/or pragmatics information structure. The structure of a data file at MNC will include Important features are: two main parts: information of the corpus file and • XML is extensible and it does not consist the corpus data. of a fixed set of tags. The first part, the header part of a corpus file, describes the information of a corpus file. The • XML documents must be well-formed information of the corpus file includes the header according to a defined syntax. which will provide sensible use of the corpus • XML document can be formally validated information in machine readable form. In this part, against a schema of some kind. the information such as language usage and the description of the corpus file will be included. • XML is more interested in the meaning of The second part, the document part, of a corpus data than its presentation. file will include the source description of the The XML documents must have exactly one corpus data and the corpus data, the written or top-level element or root element. All other spoken part of the text, itself. The information of elements must be nested within it. Elements must the corpus data such as bibliographic information, be properly nested [Young, 2001]. That is, if an authorship, and publisher information will be element starts within another element, it must also included in this section. Moreover, the corpus data end within that same element. itself will also be included in this section. Each element must have both a start-tag and an The hierarchically structure of a corpus file at end-tag. The element type name in a start-tag must MNC will be as shown in figure 1. exactly match the name in the corresponding end- tag and element name are case sensitive. Moreover, the advantages of XML for NLP includes ontology extraction into XML based structured languages using XML Schema. The

2 http://www.tei-c.org/P5/Guidelines/index.html 3 http://www.w3.org/TR/xml/

34 The 6th Workshop on Asian Languae Resources, 2008

Myanmar National Corpus Data File

Header Document

Language File Source Written or Usage Description Description Spoken Texts

Figure 1. Hierarchically structure of a data file at MNC

6 3 Selection of necessary XML tag set for the text encoding and Interchange . TEI encoding scheme consists of a number of rules After studying original formats and features of with which the document has to adhere in order texts, to be used in corpus, and the structure of to be accepted as a TEI document. the corpus file has been determined, the This header part contains language usage of selection procedure for XML tag set has been the data file and the file started. description which includes machine British National Corpus (BNC)4, American readable information of the data file. 5 National Corpus (ANC) had been referenced - for selection of XML tag set. - The selection of XML tag set is based on the + nature of the structure of a data file. The main + tag for the data file will be named as which is the abbreviation of Myanmar National + Corpus. A data file contains two main parts, the Figure 3. Element and Child tags of MNC header part and the document part. - The language usage part contains such + information as language name , + script information 4 Lou Burnard. 2000. Reference Guide for the British National Corpus (World Edition). Oxford University Computing Services, Oxford. 5 Nancy Ide and Keith Suderman. 2003. The American National Corpus, first Release. Vassar College, 6 TEI Consortium. 2001, 2002 and 2004 Text Encoding Poughkeepsie, USA Initiative. In The XML Version of the TEI Guidelines.

35 The 6th Workshop on Asian Languae Resources, 2008

bibliographic information, such as title, name of author, publisher, etc., of the original data. + + + - - Figure 4. 2nd level Child tags in language - Usage part of MNC The file description part contains such information as title information of the corpus file - , edition information and publication information about the corpus file . The detail information will be tagged using more specific lower level child tags under the previously described tags. - + - + - nd + Figure 7. 2 level Child tags for source + description part of MNC + The second part, the original data part or will contain the whole + original data. The original format information such as heading , sub- heading , paragraph number Figure 5. 2nd level Child tags in file , sentence number description part of MNC will be saved in this part. 3.2 Document Part + The XML tag for the document part of the - corpus data file is named as which is + the short form of Myanmar Document. It - contains two sub parts: the source description of - the data and the original data itself which in turn can be divided into two + types; written text and the spoken text + . + - + Figure 8. 2nd level Child tags for original data + part of MNC

Since MNC is going to be annotated in Figure 6. Element and Child tags of MNC sentence level, each sentence will be annotated and numbered. The first part, the source description part of the data , will contain the

36 The 6th Workshop on Asian Languae Resources, 2008

3.3 Sample MNC data file + The Myanmar National Corpus is a major - resource for linguistic research, as well as + computational linguistics research, lexicography, - corpus linguistic research and a resource for the - development of Myanmar Language teaching material because we expect the corpus to be - continually expanded in the future. - A sample MNC data is use the Universal Declaration of Human Rights (UDHR) texts in Burmese and Karen, which is one of the major + languages in Myanmar, has been used to sample tagging with the selected XML tag set. The following figure is show for the sample MNC. Figure 9. Down to the sentence level Child tags of MNC

- - Myanmar 10646 utf-8 Unicode 5.0 - - Myanmar National Corpus - Corpus built by Myanmar NLP Team - First TEI-conformant version -

Myanmar Info-Tech, Yangon, Myanmar
Availability limited to Myanmar NLP Team - 07/06/2007 Myanmar NLP Team MNC101

37 The 6th Workshop on Asian Languae Resources, 2008

- - - အြပည်ြပည်ဆိ�င်ရာလူ ့အခွင့်အ�ရး��ကညာစာတမ်း (meaning: Universal Declaration of Human Rights) -

- - အြပည်ြပည်ဆိင်ရာလူ ့အခွင့်အရးကညာစာတမ်း (meaning: Universal Declaration of Human Rights) + - စကားချီး (meaning: Preamble) - - လူခပ်သိမ်း၏မျိုးရိးဂဏ်သိက္ခာနှင့်တကွလူတိင်းအညီအမခစားခွင့်ရှိသည့်အခွင့်အရးများကိ အသိအမှတ်ြပုြခင်းသည်လူခပ်သိမ်း၏လွတ်လပ်မ၊တရားမတမ၊ငိမ်းချမ်းမတိ ့၏အြခခအတ်ြမစ် ြဖစ်သာကာင့်လည်းကာင်း၊ ……. (meaning: Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world,…….) + - အပိဒ် ၁ (meaning: paragraph 1) - လူတိင်းသည်တူညီလွတ်လပ်သာဂဏ်သိက္ခာြဖင့်လည်းကာင်း၊တူညီလွတ်လပ်သာအခွင့်အရး များြဖင့်လည်းကာင်း၊မွးဖွားလာသူများြဖစ်သည်။ (meaning: All human beings are born free and equal in dignity and rights.)

38 The 6th Workshop on Asian Languae Resources, 2008

ထိသူတိ ့၌ပိင်းြခားဝဖန်တတ်သာဉာဏ်နှင့်ကျင့်ဝတ်သိတတ်သာစိတ်တိ ့ရှိက၍ထိသူတိ ့သည် အချင်းချင်းမတ္တာထား၍ဆက်ဆကျင့်သးသင့်၏။ (meaning: They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.) - အပိဒ် ၂ (meaning: paragraph 2) + + - အပိဒ် ၃ (meaning: paragraph 2) - လူတိင်း၌အသက်ရှင်ရန်လွတ်လပ်မခွင့်နှင့်လြခုစိတ်ချခွင့်ရှိသည်။ (meaning: Everyone has the right to life, liberty and security of person.) + + Figure 10. Sample MNC Corpus file (Burmese UDHR text in MNC XML format)

next step is to develop an algorithm for 4 Conclusion and Future work automatic tagging the data. In this paper, the authors have clearly described Acknowledgement about the selection of XML tag set for building of MNC. Since the word level segmentation for This study was performed with the support of Burmese script is not yet available, the corpus the Government of the Union of Myanmar data will be annotated only up to the sentence through Myanmar Natural Language level in order to be in the same format for all Implementation Committee. Thanks and Myanmar languages and scripts. gratitude towards the members of Myanmar In order to check whether the selected the Language Commission for providing necessary XML tag set will be enough and useful for information to write this paper. tagging the corpus data, the sample corpus data has been collected by manually tagging the data References which includes newspapers and periodicals, Ethnologue. 2005 Languages of the World, 15th Universal Declaration of Human Rights Edition, Dallas, Tex.: SIL International. Online (UDHR), novels and essays. version: http://www.ethnologue.com/. Edited by Since the manual tagging to the sample Raymond G. Gordon, Jr. corpus data proves that the selected XML tag set is enough to cover a variety of data sources, the

39 The 6th Workshop on Asian Languae Resources, 2008

Hla Hla Htay, G. Bharadwaja Kumar and Kavi N. Murthy. 2006. Constructing English-Myanmar Parallel Corpora. The Fourth International Conference on Computer Application 2006 (ICCA 2006) Conference Program. Jin-Dong KIM, Tomoko OHTA, Yuka TATEISI, Hideki MIMA and Jun’ichi TSUJII. 2001. XML- based Linguistic Annotation of Corpus . In the Proceedings of the first NLP and XML Workshop held at NLPRS 2001. pp. 47--53. Lou Burnard. 1996. Using SGML for Linguistic Analysis: the case of the BNC. ACM Vol 1 Issue 2 (Spring 1999) MIT Press ISSN: 1099-6621. pp. 31-51. Michaek J. Young. 2001. Step by Step XML.Prentice Hall of India Private Limited Press. ISBN-81-203- 1804-B Ministry of Immigration and Population. 1995. Myanmar Population Changes and Fertility Survey 1991. Immigration and Population Department Wunna Ko Ko, Yoshiki Mikami. 2005 Languages of Myanmar in Cyberspace, In Proceedings of TALN & RECITAL 2005 (NLP for Under-Resourced Languages Workshop), Dourdan, FRANCE, 2005 June, pp. 269-278.

40 The 6th Workshop on Asian Languae Resources, 2008

Myanmar Word Segmentation using Syllable level Longest Matching

Hla Hla Htay, Kavi Narayana Murthy Department of Computer and Information Sciences University of Hyderabad, India hla hla [email protected], [email protected]

Abstract Key Words:- Myanmar, Syllable, Words, Seg- mentation, Syllabification, Dictionary

In Myanmar language, sentences are 1 Introduction clearly delimited by a unique sentence boundary marker but are written without Myanmar (Burmese) is a member of the Burmese- necessarily pausing between words with Lolo group of the Sino-Tibetan language spoken by spaces. It is therefore non-trivial to seg- about 21 Million people in Myanmar (Burma). It ment sentences into words. Word tokeniz- is a tonal language, that is to say, the meaning of a ing plays a vital role in most Natural Lan- syllable or word changes with the tone. It has been guage Processing applications. We observe classified by linguists as a mono-syllabic or isolating that word boundaries generally align with language with agglutinative features. According to syllable boundaries. Working directly with history, Myanmar script has originated from Brahmi characters does not help. It is therefore script which flourished in India from about 500 B.C. useful to syllabify texts first. Syllabification to over 300 A.D (MLC, 2002). The script is syllabic is also a non-trivial task in Myanmar. We in nature, and written from left to right. have collected 4550 syllables from avail- Myanmar script is composed of 33 consonants, able sources . We have evaluated our syl- 11 basic vowels, 11 consonant combination sym- lable inventory on 2,728 sentences spread bols and extension vowels, vowel symbols, devow- over 258 pages and observed a coverage of elizing consonants, diacritic marks, specified sym- 99.96%. In the second part, we build word bols and punctuation marks(MLC, 2002),(Thu and lists from available sources such as dic- Urano, 2006). Myanmar script represents sequences tionaries, through the application of mor- of syllables where each syllable is constructed from phological rules, and by generating syllable consonants, consonant combination symbols (i.e. n-grams as possible words and manually Medials), vowel symbols related to relevant conso- checking. We have thus built list of 800,000 nants and diacritic marks indicating tone level. words including inflected forms. We have Myanmar has mainly 9 parts of speech: noun, tested our algorithm on a 5000 sentence pronoun, verb, adjective, adverb, particle , conjunc- test data set containing a total of (35049 tion, post-positional marker and interjection (MLC, words) and manually checked for evaluat- 2005), (Judson, 1842). ing the performance. The program recog- In Myanmar script, sentences are clearly delim- nized 34943 words of which 34633 words ited by a sentence boundary marker but words are were correct, thus giving us a Recall of not always delimited by spaces. Although there is 98.81%, a Precision of 99.11% and a F- a general tendency to insert spaces between phrases, Measure is 98.95%. inserting spaces is more of a convenience rather than

41 The 6th Workshop on Asian Languae Resources, 2008

a rule. Spaces may sometimes be inserted between ¥ CVVC [maun] (term of address for young men) words and even between a root word and the associ- ated post-position. In fact in the past spaces were ¥ CGVVC [mjaun] ditch rarely used. Segmenting sentences into words is therefore a challenging task. Words in the Myanmar language can be divided Word boundaries generally align with syllable into simple words, compound words and complex boundaries and syllabification is therefore a useful words (Tint, 2004),(MLC, 2005),(Judson, 1842). strategy. In this paper we describe our attempts on Some examples of compound words and loan words syllabification and segmenting Myanmar sentences are given below. into words. After a brief discussion of the corpus collection and pre-processing phases, we describe ¥ Compound Words

our approaches to syllabification and tokenization   ÜÝ – head [u:] Ç + pack [htou ] = hat [ou into words.  htou ] ÇÜÝ

Computational and quantitative studies in Myan- ÅÙÒ language [] ÔÑ+ look,see [kji.] + mar are relatively new. Lexical resources available   [tai ] building ÛÙ = library [sa kji. dai ]

are scanty. Development of electronic dictionaries ÔÑÅÙÒ ÛÙ

and other lexical resources will facilitate Natural ¼¸ – sell [yaun:] ²Ñ + buy [we] = trading Language Processing tasks such as Spell Checking, [yaun : we] Machine Translation, Automatic Text summariza- ²Ñ¼¸ tion, Information Extraction, Automatic Text Cate- ¥ Loan Words gorization, Information Retrieval and so on (Murthy,

2006). – ÙÝ×ÄÛÑ [kun pju ta] computer 

Over the last few years, we have developed mono- – ÕÝÙÑÖÛ [hsa ko ti] sub-committee lingual text corpora totalling to about 2,141,496 ø

– ׸² [che ] cherry sentences and English-Myanmar parallel corpora amounting to about 80,000 sentences and sentence 3 Corpus Collection and Preprocessing fragments, aligned at sentence and word levels. We have also collected word lists from these corpora Development of lexical resources is a very tedious and also from available dictionaries. Currently our and time consuming task and purely manual ap- word list includes about 800,000 words including in- proaches are too slow. We have downloaded Myan- flected forms. mar texts from various web sites including news sites including official newspapers, on-line maga- 2 Myanmar Words zines, trial e-books (over 300 full books) as well as Myanmar words are sequences of syllables. The syl- free and trial texts from on-line book stores includ- lable structure of Burmese is C(G)V((V)C), which ing a variety of genres, types and styles - modern and is to say the onset consists of a consonant option- ancient, prose and poetry, and example sentences ally followed by a glide, and the rhyme consists of from dictionaries. As of now, our corpus includes a monophthong alone, a monophthong with a con- 2,141,496 sentences. sonant, or a diphthong with a consonant 1. Some The downloaded corpora need to be cleaned up to representative words are: remove hypertext markup and we need to extract text if in pdf format. We have developed the necessary ¥ CV [mei] girl scripts in Perl for this. Also, different sites use dif-  ferent font formats and character encoding standards ¥ CVC [me ] crave are not yet widely followed. We have mapped these ¥ CGV [mjei] earth various formats into the standard WinInnwa font for-  mat. We have stored the cleaned up texts in ASCII ¥ CGVC [mje ]eye format and these pre-processed corpora are seen to 1http://en.wikipedia.org/wiki/Burmese language be reasonably clean.

42 The 6th Workshop on Asian Languae Resources, 2008

4 Collecting Word Lists Nominative personal pronouns

ÙÊÖ  I ÙÊÛÑ[kjun do], [kja ma.], [nga],

 ø 

Ù×Ñ ÙÊ Ý Ù×ÃÝ[kjou ], [kja no], [kjanou ], Electronic dictionaries can be updated much more ø ø Ù×Ö[kja ma.] easily than published printed dictionaries, which Possessive pronounsø and adjectives

 ÙÊÛÑ my ÙÊ Ý[kjou i.], [kjun do i.],

need more time, cost and man power to bring out  Ù×Ñ ÙÊÖ[kja ma. i.], [kja nou i.],

a fresh edition. Word lists and dictionaries in elec- ø ø 

 ²ÂÍ Ù×ÃݲÂÍ Ù×Ö[kja ma. i.], [nga i.], [kjou i.],

ø  ÙÊÛѲÂÍ tronic form are of great value in computational lin- ÙÊ Ý²ÂÍ[kjou je.], [kjun do je.],

 Ù×ѲÂÍ ÙÊÖ²ÂÍ[kja ma. je.], [kja nou je.],

guistics and NLP. Here we describe our efforts in ø ø  ²ÂÍ Ù×Ö²ÂÍ[kja ma. je.], [nga je.],

developing a large word list for Myanmar. ø  ÙÊÛÑ Ù×ÃݲÂÍ[kjou je.], [kjun do.],

Ù×Ñ[kja no.]

4.1 Independent Words Indefinite pronounsø and adjectives Ø×ÃÍÓÑ some Ø×ÃÍ[a chou.], [a chou. tho:],

As we keep analyzing texts, we can identify some ø ø Û×ÃÍÓÑ Û×ÃÍ[ta chou.], [a chou. tho:], ø ø words that can appear independently without com- Û×ÃÍ×ÃÍ[ta chou.ta chou.], ø ø Û×ÃÍÛÚ[ta chou.ta lei] bining with other words or suffixes. We build a list ø ø of such valid words and we keep adding new valid Table 1: Stop-words of English Vs Myanmar words as we progress through our segmentation pro- cess, gradually developing larger and larger lists of valid words. We have also collected from sources always necessarily lead to correct segmentation of such as Myanmar Orthography(MLC, 2003), CD sentences into words. Both under and over segmen- versions of English-English-Myanmar (Student’s tation are possible. When stop-words are too short, Dictionary)(stu, 2000) and English-Myanmar Dic- over segmentation can occur. Under segmentation tionary (EMd, ) and Myanmar-French Dictionary can occur when no stop-words occur between words. (damma sami, 2004). Currently our word list in- Examples of segmentation can be seen in Table 2. cludes 800,000 words. We have observed that over segmentation is more frequent than under segmentation.

4.2 Stop Word Removal

¼¼× Ù×ÄÀ²ÓÞÓÐØÙÓÒ

¼¼× Ù×ÄÀ² ØÙ Stop words include prepositions/post-positions, con- [waing: win: chi: kyu: khan ] [a nay khak] junctions, particles, inflections etc. which ap- received compliments abashed

Vpp Vpast ÕÝÓÒ

pear as suffixes added to other words. They Ù×ÑØÝÕ²ÑÙ ÓÒØÅÙÖÞÙÖÁÙÔÙ

ØÅÙÖÞÙÖÁ ÔÙÕÝ form closed classes and hence can be listed. Pre- Ù×ÑØÝÕ²ÑÙ  [kyaung: aop hsa ya kyi:] [a kyan: phak mhu] [sak sop] liminary studies therefore suggested that Myan- The headmaster violence abhors mar words can be recognized by eliminating these Nsubj Nobj Vpresent stop words. Hopple (Hopple, 2003) also notices Table 2: Removing stop-words for segmentation that particles ending phrases can be removed to recognize words in a sentence. We have col- lected stop words by analyzing official newspa- 4.3 Syllable N-grams pers, Myanmar grammar text books and CD ver- Myanmar language uses a syllabic sions of English-English-Myanmar (Student’s Dic- unlike English and many other western languages tionary)(stu, 2000), English-Myanmar Dictionary which use an alphabetic writing system. Interest- (EMd, ) and The Khit Thit English-Myanmar dictio- ingly, almost every syllable has a meaning in Myan- nary (Saya, 2000). We have also looked at stop word mar language. This can also be seen from the work lists in English (www.syger.com, ) and mapped them of Hopple (Hopple, 2003). to equivalent stop words in Myanmar. See Table 1. Myanmar Natural Language Processing Group As of now, our stop words list contains 1216 en- has listed 1894 syllables that can appear in Myan- tries. Stop words can be prefixes of other stop words mar texts (Htut, 2001). We have observed that there leading to ambiguities. However, usually the longest are more syllables in use, especially in foreign words matching stop word is the right choice. including Pali and Sanskrit words which are widely Identifying and removing stop words does not used in Myanmar. We have collected other pos-

43 The 6th Workshop on Asian Languae Resources, 2008

sible syllables from the Myanmar-English dictio- of Myanmar texts. We have used “-type=word” op- nary(MLC, 2002). Texts collected from the Internet tion treating syllables as words. We had to mod- show lack of standard typing sequences. There are ify this program a bit since Myanmar uses zero (as several possible typing sequences and correspond- “(0) ” letter) and the other special characters ( ing internal representations for a given syllable. We “,”, “<”, “>”, “.”, “&”, “[”,“]” etc.) which were be- include all of these possible variants in our list. Now ing ignored in the original Text::Ngrams software. we have over 4550 syllables. We collect all possible words which is composed of n-grams of syllables up to 5-grams. Table 1 Bigram Trigram 4-gram shows some words which are collected through n- bisyllables 3-syllables 4-syllables

gram analysis. Almost all monograms are meaning-

ÞØÖ Â ËÔ ËÔÙÑÙÑ lantern with a big sound whole-heartedly ful words. Many bi-grams are also valid words and [hpan ein] [boun: ga ne:] [hni  hni  ka ga]

ø as we move towards longer n-grams, we generally

ÞÓÑ ÔÍÂ ÜÐÜÐÙÂÙÂ glassware effortlessly outstanding get less and less number of valid words. See Table [hpan tha:] [swei. ga ne:] [htu: htu: ke: ke:] 3. Further, frequency of occurrence of these n-grams

ø

ÜÑ Ö×ÑÖ×ÑÔÑÔÑ ÙÔÑ is a useful clue. See Table 4. bank of lake fuming with rage many,much [kan saun:] [htaun: ga ne:] [mja: mja: sa: za:] By analyzing the morphological structure of ø words we will be able to analyze inflected and de- Table 3: Examples of Collected N-grams rived word forms. A set of morphemes and mor- phological forms have been collected from (MLC, 2005) and (Judson, 1842) . See Table 5. For exam- No. of No of words Example syllables ple, the four-syllable word in Table 3 is an adverb

1 4550 ÙÑ Good (Adj) “ ÜÐÜÐÙÂÙ” [htu: htu: ke: ke:] outstanding derived [kaun:] from the verb “ÜÐÙ”. See Table 3.

2 59964 ÚÝÝÑ Butterfly, Soul (N) Statistical construction of machine readable dic-  [lei pja] tionaries has many advantages. New words which 3 170762 ÝÛÝ Ù Window (N)  appear from time to time such as Internet, names [b a din: bau ] ø 4 274775 ÝÒÛ!ÜÛÙ of medicines, can also be detected. Compounds Domestic Product (N)  [pji dwin: htou koun] words also can be seen. Common names such

5 199682 ÚÉÝÔÔÜÖØ   as names of persons, cities, committees etc. can [hlja si ht a min: ou:] Rice Cooker(N)ø be also mined. Once sufficient data is available,

6 99762 ÓÐÑÝÃÕ²ÑÖ Nurse(female) (N) statistical analysis can be carried and techniques [thu na bju. hs a ja ma.] ø such as mutual information and maximum entropy

7 41499 ²%Ë Ó!ÑÅÙÝÛÑÓÒ become friend (V) can be used to hypothesize possible words. [jin: hni: thwa: kya. pei to. thi]

8 14149 ÝÒÜÑÔÖÖÑ%ÀÛÑ Union of Myanmar (N) [pji daun zu. mj a ma nain gan to ] ø 4.4 Words from Dictionaries

9 4986 ÓÀ¸À(ÑÛزØÖÔÖ×Ñ Natural Resources (N)  [than jan za ta. a jin: a mji ] Collecting words using the above three mentioned ø ø 10 1876 ÖÙÖÚÙÖÙÖÞÔÓÒ methods has still not covered all the valid words be agitated or shaken(V)   [ chei ma kain . le ma kain mi. hpji thi] in our corpus. We have got only 150,000 words. ø ø Words collected from n-grams needs exhaustive hu- Table 4: Syllable Structure of Words man effort to pick the valid words. We have therefore collected words from two on-line dictio- We have developed scripts in Perl to syllabify naries - the English-English-Myanmar (Student’s words using our list of syllables as a base and then Dictionary) (stu, 2000), English-Myanmar Dictio- generate n-gram statistics using Text::Ngrams which nary (EMd, ) and from two e-books - French- is developed by Vlado Keselj (Keselj, 2006). This Myanmar(damma sami, 2004), and Myanmar Or- program is quite fast and it took only a few min- thography (MLC, 2003). Fortunately, these texts utes on a desktop PC in order to process 3.5M bytes can be transformed into winninnwa font. We have

44 The 6th Workshop on Asian Languae Resources, 2008

A B C D E

basic unit (Verb)= (Noun)= (Negative)= (Noun)=

Ø Ö "Ð ÖÁ

1 syllable A+ ÓÒ +A +A+ A+

ÙÑ ÙÑÓÒ ØÙÑ ÖÙÑ"Ð ÙÑÖÁ [kaun:] [kaun: thi] [a kaun:] [ma. kaun: bu:] [kaun: mhu.]

good (Adj) is good øgood Not good good deeds

Õ ÕÓÒ ØÕ ÖÕ"Ð ÕÖÁ [hso:] [hso: thi] [a hso:] [ma. hso: bu:] [hso: mhu.]

bad (Adj) is bad øbad Not bad Bad Deeds

²Ñ ²ÑÓÒ Ø²Ñ Ö²Ñ"Ð ²ÑÖÁ [jaun:] [jaun: thi] [a jaun:] [ma. jaun: bu:] [jaun: mhu.]

sell(Verb) sell ø sale not sell sale

² ²ÓÒ Ø² Ö²"Ð ²ÖÁ [jei:] [jei: thi] [a jei:] [ma. jei: bu:] [jei: mhu.]

write(Verb) write writingø do not write writing

%ÝÑ %ÝÑÓÒ Ø%ÝÑ Ö%ÝÑ"Ð %ÝÑÖÁ [pjo:] [pjo: thi] [a pjo:] [ma. pjo: bu:] [pjo: mhu.] talk,speak(Verb) talk,speak talk,speechø not talk,speak talking

Table 5: Example patterns of Myanmar Morphological Analysis written Perl scripts to convert to the standard font. 5 Syllabification and Word Segmentation Myanmar Spelling Bible lists only lemma (root Since dictionaries and other lexical resources are not words). We have suffixed some frequently used mor- yet widely available in electronic form for Myanmar phological forms to these root words. language, we have collected 4550 possible syllables There are lots of valid words which are not de- including those used in Pali and foreign words such

scribed in published dictionaries. The entries of ÛÙ ÛÚ as ÑÜ ), considering different typing se- words in the Myanmar-English dictionary which is quences and corresponding internal representations, produced by the Department of the Myanmar Lan- and from the 800,000 strong Myanmar word-list we guage Commission are mainly words of the com- have built. With the help of these stored syllables mon Myanmar vocabulary. Most of the compound and word lists, we have carried out syllabification words have been omitted in the dictionary (MLC, and word segmentation as described below. Many 2002). This can be seen in the preface and guide researchers have used longest string matching (An- to the dictionary of the Myanmar-English dictio- gell et al., 1983),(Ari et al., 2001) and we follow the nary produced by Department of the Myanmar Lan- same approach. guage Commission, Ministry of Education. 4- The first step in building a word hypothesizer is syllables words like “ ÜÐÜÐÕÕ ”[htu: htu: syllabification of the input text by looking up sylla- zan: zan:] (strange), “ ÜÐÜÐÙÂÙÂ ” [htu: htu: ke: ble lists. In the second step, we exploit lists of words ke:](outstanding) and “ ÜÐÜÐ Ñ Ñ ” [htu: htu: gja: (viewed as n-grams at syllable level) for word seg- gja:](different)(see Table 3) are not listed in dictio- mentation from left to right. nary although we usually use those words in every day life. 5.1 Syllabification With all this, we have been able to collect a total As an initial attempt we use longest string matching of about 800,000 words. As we have collected words alone for Myanmar text syllabification. Examples from various sources and techniques, we believe we are shown in Table 7. have fairly good data for further work.

Pseudo code Here we go from left-to-right in a Ù

On screen ÅÙ  greedy manner:

Ù  Ù In ascii MuD: udk sub syllabification{ BuD: ukd Load the set of syllables from syllable-file Load the sentences to be processed from sentence-file Table 6: Syllables with different typing sequences Store all syllables of length j in Nj where j =10..1 for-each sentence do length ← length of the sentence

45 The 6th Workshop on Asian Languae Resources, 2008

pos ← 0 for j =10..1 do while (length > 0) do for-each word in Nj do for j =10..1 do if string-match sentence(pos, pos + j) with word for-each syllable in Nj do word found. Mark word if string-match sentence(pos, pos + j) with syllable pos ← pos + j Syllable found. Mark syllable length ← length − j pos ← pos + j End if length ← length − j End for End if End for End for End while End for Print tokenized string End while End for Print syllabified string End for }

We have evaluated our syllables list on a collec- 6 Evaluation and Observations

tion of 11 short novels entitled “Orchestra” ÓÀ¹ We have segmented 5000 sentences including a to-

ÔÀÛ¼[than zoun ti: wain:], written by “Nikoye” tal of (35049 words) with our programs and manu- (Ye, 1997) which includes 2,728 sentences spread ally checked for evaluating the performance. These over 259 pages including a total of 70,384 sylla- sentences are from part of the English-Myanmar bles. These texts were syllabified using the longest parallel corpus being developed by us (Htay et al., matching algorithm over our syllable list and we 2006). The program recognized 34943 words of observed that only 0.04% of the actual syllables which 34633 words were correct, thus giving us a were not detected. The Table 6 shows that differ- Recall of 98.81% and a Precision of 99.11%. The ent typing sequences of syllables were also detected. F-Measure is 98.95%. The algorithm suffers in ac-

Here are some examples of failure: Ö [rkdCf;]and curacy in two ways:

ÖÚ[rkdvf;] which are seldom used in text. The typ- ing sequence is also wrong. Failures are generally Out-of-vocabulary Words: Segmentation error traced to can occur when the words are not listed in dictionary. No lexicon contains every possible ¥ differing combinations of writing sequences word of a language. There always exist ¥ loan words borrowed from foreign languages out-of-vocabulary words such as new derived words, new compounds words, morphological ¥ rarely used syllables not listed in our list variations of existing words and technical words (Park, 2002). In order to check the effect 5.2 Word Segmentation of out-of-vocabulary words, we took a new We have carried out tokenization with longest sylla- set of 1000 sentences (7343 words). We have ble word matching using our 800,000 strong stored checked manually and noticed 329 new words, word list. This word list has been built from avail- that is about 4% of the words are not found in able sources such as dictionaries, through the ap- our list, giving us a coverage of about 96%. plication of morphological rules, and by generating syllables n-grams and manually checking. An exam- Limitations of left-to-right processing: ple sentence and its segmentation is given in Table Segmentation errors can also occur due to 8. the limitations of the left-to-right processing. Load the set of words from word-file See the example 1 in Table 9. The algorithm for-each word do suffers most in recognizing the sentences i ← syllabification(word); i =10 1 Store all words of syllable length i in N where i .. which have the word He ÓÐ [thu] followed by a End for

negative verb starting with the particle Ö[ma.]. Load the sentences to be processed from sentence-file The program wrongly segments she as he. Our for-each sentence do length ←syllabification(sentence); text collection obtains from various sources #length of the sentence in terms of syllables and the word “she” is used as ÓÐÖ [thu ma.] in pos ← 0 while (length > 0) do modern novels and Internet text. Therefore, our

46

The 6th Workshop on Asian Languae Resources, 2008 ÙÑÞ ÓÑÙ²ØÛ %ËØÚÝÓÚÝÝÑÂÓÒ

aumfzDaomuf&if;tefwDESihftvyovyajymaecJhonf

ÙÑ Þ ÓÑÙ ² Ø Û %Ë Ø Ú Ý Ó Ú Ý ÝÑ  Â ÓÒ aumf zD aomuf &if; tef wD ESihf t v y o v y ajym ae cJh onf  [ko] [hpi] [thau ] [jin:] [an] [ti] [hnin.] [a] [la] [pa.] [tha.] [la] [pa.] [pjo] [nei] [khe.] [thi] Having his coffee,ø he chit-chat with the lady.

Table 7: Example syllabification

Ù×ÑØÝÕ²Ñ&Ù'ÓÒØÅÙÖÞÙÖÁÙÔÙÕÝÓÒ

Ù×ÑØÝÕ²Ñ&Ù' ÓÒ ØÅÙÖÞÙÖÁ Ù ÔÙÕÝÓÒ [kyaung: aop hsa ya kyi:] [thi] [a kyan: phak mhu] [ko] [sak sop thi] The headmaster violence abhors Nsubj Particle Nobj Particle Vpresent

Table 8: A sentence being segmented into words

word list contains she ÓÐÖ. This problem can be lected monolingual text corpora totalling to about solved by standardization. Myanmar Language 2,141,496 sentences and English-Myanmar parallel Commission (MLC, 1993) has advised that the corpora amounting to about 80,000 sentences and

words “she” and “he” should be written only sentence fragments, aligned at sentence and word ÓÐÖ as ÓÐ and the word representing a feminine levels. We have also built a list of 1216 stop words, pronoun should not be used. For example 2 in 4550 syllables and 800,000 words from a variety

Table 9, the text ØÑÝÓÒ can be segmented of sources including our own corpora. We have

into two ways. 1) ØÑÝÓÒ [a: pei: thi] which used fairly simple and intuitive methods not requir-

means “encourage” and 2) ØÑ [particle for ing deep linguistic insights or sophisticated statisti-

indicating dative case] and ÝÓÒ give [pei: cal inference. With this initial work, we now plan thi]. Because of greedy search from left to to apply a variety of machine learning techniques. right, our algorithm will always segment as We hope this work will help to accelerate work in

ØÑÝÓÒ no matter what the context is. Myanmar language and larger lexical resources will be developed soon. In order to solve these problems, we are plan to use machine learning techniques which 1) can also detect real words dynamically (Park, 2002) while we References are segmenting the words and 2) correct the greedy Richard C. Angell, George W. Freurd, and Peter Willett. cut from left to right using frequencies of the words 1983. Automatic spelling correction using a trigram from the training samples. similarity measure. Information Processing & Man- agement Although our work presented here is for Myan- , 19(4):255Ð261. mar, we believe that the basic ideas can be applied Pirkola Ari, Heikki Keskustalo, Erkka Leppnen, Antti- to any script which is primarily syllabic in nature. Pekka Knsl, and Kalervo Jrvelin. 2001. Targeted s- gram matching: a novel n-gram matching technique 7 Conclusions for cross- and monolingual word form variants. Infor- mation Research, 7(2):235Ð237, january. Since words are not uniformly delimited by spaces U damma sami. 2004. Myanmar-French Dictionary. in Myanmar script, segmenting sentences into words English-myanmar dictionary. Ministry of Education, is an important task for Myanmar NLP. In this pa- Union of Myanmar,CD version. per we have described the need and possible tech- niques for segmentation in Myanmar script. In par- Paulette Hopple. 2003. The structure of nominalization in Burmese,Ph.D thesis. May. ticular, we have used a combination of stored lists, suffix removal, morphological analysis and syllable Hla Hla Htay, G. Bharadwaja Kumar, and Kavi Narayana Murthy. 2006. Building english-myanmar parallel level n-grams to hypothesize valid words with about corpora. In Fourth International Conference on Com- 99% accuracy. Necessary scripts have been writ- puter Applications, pages 231Ð238, Yangon, Myan- ten in Perl. Over the last few years, we have col- mar, Feb.

47 The 6th Workshop on Asian Languae Resources, 2008

Example 1: )Ñ%ÝÖÁÛÓÐÖÝ ¼Â"Ð

)Ñ%ÝÖÁ Û ÓÐÖ Ý ¼Â"Ð [damja. hmu.] [twin] [thu ma.] [pa wun khe. bu;] ørobbery in she did not involve

N Particle Nsubj Vpastneg ÛѼÙÓÐÛÔÝ ØÑÝÓÒ»

Example 2: ÖÖÖÚ×ÓÑ

ÖÖ ÖÚ×ÓÑ ÛѼ Ù ÓÐÛÔÝ  ØÑÝÓÒ [mi. mi.] [ma. lou chin tho:] [ta wun] [gou] [thu daba:] [a: pei: thi] I,myself don’t want duty,responsibility othersø encourage Nsubj Vneg Nobj1 Particle Nobj2 V

Table 9: Analysis of Over-Segmentation

Zaw Htut. 2001. All possible myanmar syllables, Ye Kyaw Thu and Yoshiyori Urano. 2006. Text entry for September. myanmar language sms: Proposal of 3 possible input methods, simulation and analysis. In Fourth Interna- Adoniram Judson. 1842. Grammatical Notices of the tional Conference on Computer Applications, Yangon, Buremse Langauge. Maulmain: American Baptist Myanmar, Feb. Mission Press. U Tun Tint. 2004. Features of myanmar language. May. Vlado Keselj. 2006. Text ::ngrams. www.syger.com. http://www.syger.com/jsc/docs/ http://search.cpan.org/ vlado/ Text-Ngrams-1.8/, stopwords/english.htm. November. Ko Ye. 1997. Orchestra. The two cats, June. MLC. 1993. Myanmar Words Commonly Misspelled and Misused. Department of the Myanmar Language Commission,Ministry of Education, Union of Myan- mar.

MLC. 2002. Myanmar-English Dictionary. Department of the Myanmar Language Commission, Ministry of Education, Union of Myanmar.

MLC. 2003. Myanmar Orthography. Department of the Myanmar Language Commission,Ministry of Educa- tion, Union of Myanmar, June.

MLC. 2005. Myanmar Grammer. Department of the Myanmar Language Commission, Ministry of Educa- tion,Union of Myanmar, June.

Kavi Narayana Murthy. 2006. Natural Language Pro- cessing - an Information Access Perspective. Ess Ess Publications, New Delhi, India.

Youngja Park. 2002. Identification of probable real words : an entropy-based approach. In ACL-02 Work- shop on Unsupervised Lexical Acquisition, pages 1Ð8, Morristown, NJ, USA. Association for Computational Linguistics.

U Soe Saya. 2000. The Khit Thit English-English- Myanmar Dictionary with Pronunciation. Yangon, Myanmar, Apr.

2000. Student’s english-english/myanmar dictionary. Ministry of Commerce and Myanmar Inforithm Ltd, Union of Myanmar, CD version, Version 1, April.

48 The 6th Workshop on Asian Languae Resources, 2008

The Link Structure of Language Communities and its Implication for Language-specific Crawling

Rizza Camus Caminero Yoshiki Mikami Language Observatory Language Observatory Nagaoka University of Technology Nagaoka University of Technology Nagaoka, Niigata, Japan Nagaoka, Niigata, Japan [email protected] [email protected]

large amount of information that can be shared and Abstract disseminated to people with Internet access. Au- thors and creators of a web page come from differ- Since its inception, the World Wide Web ent backgrounds, different cultures, and different (WWW) has exponentially grown to shelter languages. Thus, a web page is a resource of multi- billions of monolingual and multilingual lingual content, and a fertile source for socio- web pages that can be navigated through linguistic analysis. hyperlinks. Its structural properties provide useful information in presenting the socio- 1.1 Web Crawling linguistic properties of the web. In this study, about 26 million web pages under While search-engines are important means of ac- the South East Asian country code top- cessing the web for most users, automated systems level-domains (ccTLDs) are analyzed, sev- of retrieving information from the web have been eral language communities are identified, developed. These systems are called web crawlers, and the graph structure of these communi- where a software agent is given a list of pages to ties are analyzed. The distance between visit. As the crawler visits these pages, it follows language communities are calculated by a their outgoing links and adds these to the list of distance metrics based on the number of pages to visit. Each page is visited recursively ac- outgoing links between web pages. Inter- cording to some sets of policies (e.g. the type of mediary languages are identified by graph pages to retrieve, direction, the maximum depth analysis. By creating a language subgraph, from the URL (uniform resource locator), etc.). the size and diameter of its strongly- The result of this crawl is a vast amount of data, connected components are derived, as these most of which may be irrelevant to certain indi- values are useful parameters for language- viduals. Thus, a focused-crawling approach was specific crawling. Performing a link struc- implemented by some systems to limit the search ture analysis of the web pages can be a use- to only a subset of the web. ful tool for socio-linguistic and technical Focused crawlers rely on classifiers to work ef- research purposes. fectively. Language-specific crawlers, for example, need a very good language identification module 1 Introduction that properly identifies the language of the web pages. General crawlers can be extended to include The World Wide Web contains interlinked docu- focused-crawling capabilities by incorporating the ments, called web pages, navigable by hyperlinks. classifiers. Another important requirement to effi- Since its creation, it has grown to contain billions ciently crawl the desired domain is the list of initial of web pages and several billion hyperlinks. These web pages, called seed URLs. Each of these URLs web pages are created by millions of people from will be enqueued into a list. The agent visits each all parts of the world. Each web page contains a URL in the list. Since the crawler recursively visits

49 The 6th Workshop on Asian Languae Resources, 2008

the outgoing links of each URL, it is possible that A language community in the web is the group the seed URL is an outgoing link of another seed of web pages written in a language. Major lan- URL, or an outgoing link of an outgoing link of guage communities discovered in each country another seed URL. Listing these URLs as seed indicates the dominant language of the country’s URLs will just waste the crawler’s time in visiting web space. How one language community is re- them, when they have already been visited. If sev- lated to another language community can be shown eral URLs can be reached from just one URL, the by analyzing the hyperlinks between them. Thus, a seed URL list size will decrease and the crawler language graph can be created with the language will be more efficient. However, the maximum communities as nodes, and the links between the distance between these web pages must also be language communities as edges. considered, since the crawler can have a policy to stop the crawling after reaching a certain depth. 2 Previous Studies 1.2 The Web as Graph One of the earliest web survey in Asia (Ciolek, 1998) presented statistical data of the Asian web A graph consists of a set of nodes, denoted by V space by using the Altavista WWW search engine and a set of edges, denoted by E. Each edge is a in gathering its data. In 2001, he wrote a paper pre- pair of nodes (u,v) representing a connection be- senting the trends in the volume of hyperlinks con- tween u and v. A path between two nodes is a se- necting websites in 10 Eash Asian countries. quence of edges that is passed through from one Several studies have also been done regarding node to reach the other node. the representation of the web as a graph. Kumar et In a directed graph, each edge is an ordered pair al. (2000) showed that a graph can be induced by of nodes. A path from u to v does not imply a path the hyperlinks between pages. Measures on the from v to u. The distance between any two nodes is connected component sizes and diameter were pre- the number of edges in a shortest path connecting sented to show the high-level structure of the web. them. The diameter of the graph is the maximum Broder et al. (2000) did experiments on the web on distance between any two nodes. a larger scale and showed the web’s macroscopic In an undirected graph, each edge is an unor- structure consisting of the SCC, IN, OUT, and dered pair of nodes. There is an edge between u TENDRILS. Balakrishnan and Deo (2006) ob- and v if there is a link between u and v, regardless served that the number of edges grow superlinearly of which node is the source of the link. with the number of nodes, showing the degree dis- A strongly-connected component (SCC) of a di- tributions and diameter. Petricek et al. (2006) used rected graph is a set of nodes such that for any pair web graph structural metrics to measure properties of nodes u and v in the set, there is a path from u to such as the distance between two random pages v. The strongly-connected components of a graph and interconnectedness of e-government websites. consist of disjoint sets of nodes. A subgraph is a Bharat et al. (2001) studied the macro-structure of graph whose nodes and edges are subsets of an- the web, showing the linkage between web sites by other graph. creating the “hostgraph”, with the nodes represent- With the interlinked nature of the web, it can be ing the hosts and the edges as the count of hyper- represented as a graph, with the web pages as the links between pages on the corresponding hosts. nodes and the edges as the hyperlinks between the Chakrabarti et al. (1999) proposed a new ap- pages. proach to topic-specific web resource discovery by 1.3 Languages creating a focused crawler that selectively retrieves web pages relevant to a pre-defined set of topics. Languages are expressions of individuals and Stamatakis et al. (2003) created CROSSMARC, a groups of individuals, essential to all form of focused web crawler that collects domain-specific communication. It is a fundamental medium of web pages. Deligenti et al. (2000) presented a fo- expressing one’s self whether in spoken or written cused crawling algorithm that builds a model for form. Ehtnologue (2005) lists 6,912 known living the context within which relevant pages to a topic languages in the world. Only a small portion of occur on the web. Pingali et al. (2006) created an these languages can be found in the web today. Indian search engine with a language identification

50 The 6th Workshop on Asian Languae Resources, 2008

module that returns a language only if the number the Language Observatory Project (LOP) 1 . The of words in a web page are above a given threshold crawl was started on July 5, 2006, running for 14 value. The web pages were transliterated first into days. 107,168,733 web pages were collected from UTF-8 encoding. Tamura et al. (2007) presented a 43 ccTLDs in Asia. Each page contains several simulation study of language specific crawling and information such as the character set and outgoing proposed a method for selectively collecting web links. For this study, the URL of a web page and pages written in a specific language by doing a the URL of its outgoing links were used. Although linguistic graph analysis of real web data, and then there are many web pages of Asia that can be transforming them into variation of link selection found in generic domains, they were not included strategies. in this survey to limit the volume of the crawl data. Despite several studies on web graph and lan- guage-specific crawling, no study has been done 4.2 Language Identification showing the “language graph”. Herring et al. A language identification module (LIM) developed (2007) showed a study on language networks of a for LOP based on the n-gram method (Suzuki et al., selected web community, LiveJournal, but not on 2002) was used to identify on what language a the web as a whole. page is written in. This method first creates byte sequences for each training text of a language. It 3 Scope and Objectives of the Study then checks the byte sequences of the web pages This research was conducted on the 10 ccTLDs of that match the byte sequences of the training texts. the South East Asian countries. This paper, how- The language having the highest matching rate is ever, will only show the results for the Indonesian considered as the language of the web page. The domain (.id). language identification module used the parallel This research aims to show the socio-linguistic corpus of the Universal Declaration of Human properties of the language communities in each Rights (UDHR), translated into several languages. country at the macroscopic level. The web page After crawling, LIM was executed to identify the distribution for each language community in a languages of each downloaded web page. The given ccTLD and its most frequently linked to lan- identification result was stored in a LIM result file guages are shown. The distance is also computed that contains the URL, the language, and matching and the language graph is illustrated. rate, among others. In this study, the issues This research also aims to show the graph prop- regarding the accuracy of LIM will not be erties of some Filipino language communities and discussed. its implication for crawling. Graph properties like 4.3 Web Page Analysis the SCC size and the diameter will be presented to show these characteristics in a subset of the web. For this study, the web pages of the 10 South East Finally, this research demonstrates the useful- Asian ccTLDs were selected for analysis. There ness of graph analysis approaches. were 26,196,823 web pages downloaded under these ccTLDs. The web pages for each country 4 Methodology were grouped by languages. The list of languages was narrowed down to 20 based on the number of This study was conducted by performing a series pages, arranged from highest to lowest. of steps from the collection of data to the presenta- The link structure can be analyzed by traversing tion of the results through images. the outgoing links of each web page. For each web page, its outgoing links are retrieved. For each out- 4.1 UbiCrawler going link, the LIM result file is checked for its UbiCrawler (Boldi et al., 2004) was used to language. The number of outgoing links in each download the web pages under the Asian ccTLDs. language is counted. If the URL of the outgoing These pages were downloaded primarily for the link is not on the file, it wasn’t downloaded. There- purpose of assessing the usage level of each fore, the language of the ougoing link is unidenti- language in cyberspace, one of the objectives of fied. This is usually the case of outgoing links un-

1 http://www.language-observatory.org

51 The 6th Workshop on Asian Languae Resources, 2008

der the generic TLDs (e.g., .com, .org, .gov, etc.) where αis an adjusting parameter introduced to and non-Asian ccTLDs. avoid division-by-zero, which may happen when Rij =0, i.e. no links between two languages. We set 4.4 Language Graph α=0.0001. Thus, the maximum distance between There is a link between two languages if there is at languages becomes 99. β is another adjusting pa- least one outgoing link from a web page in one rameter to make Dij as a distance metrics, and we language to the other language. The language set β=1. Assumption behind this definition is based graph is created through contraction procedure, on commonly-observed rules in our world. It is where all edges linking the same language page are widely observed that interaction between two ob- contracted. jects is proportional to the inverse square of dis- tance between two objects. The number of web 4.5 Language Adjacency Matrix links between two language communities is con- Based on the number of web pages in a language sidered as a kind of interaction. Languages with no and the number of outgoing links from one lan- links between them have a distance of 99, and the guage to another, the language adjacency table N distance of a language to itself is 0. for each country is created. The row and column Based on this distance metrics, the macroscopic headers are the same – the top 20 languages based language graph is created. A distance limit of 15 is on the number of web pages. The value Nij is the used to clearly show which languages are closely- number of outgoing links from language i to lan- related by their link structure. guage j. 4.7 Intermediary Languages The language adjacency matrix P contains the ratio of the number of outgoing links and the total Considering the direction of the outgoing links, the number of outgoing links as can be found in the possibility that language i will link to language j language adjacency table. Each cell value, Pij is the may be lesser than the possibility that language i probability that a web page in a language i has an will link to language j by passing through an in- outgoing link to language j. termediary language k, such that PikPkj > Pij. The intermediary language is identified, as this would ij = ij / ∑ NNP ik k mean that there are better ways to reach another A link from language i to language j is not nec- language from one language. essarily accompanied by a link from language j to i. 4.8 Graph Analysis using JGraphT Even if there is a link, the number of outgoing links is not equal. To show the relationship be- JGraphT is a free Java graph library that provides tween two languages based on the link structure, graph objects and algorithms. The library provides the language distance is computed. classes that calculates and returns the strongly- connected components subgraphs. To compute the 4.6 Distance between Languages distance between nodes, several graph searching The distance between two languages measures algorithms are available, one of which is the Dijik- their level of connectedness. It is the relationship stra algorithm that computes for the shortest path. between the number of outgoing links from lan- A utility to export the graph into a format readable guage i to language j and vice versa. The distance by most graph visualization tools is also available. is computed as the ratio of the number of outgoing A graph file, written in the DOT language (a plain links between two languages and the total number text graph description language) was created, con- of outgoing links of the two languages. taining the nodes and edges of language pages. The distance between language i and language j, From this, the strongly-connected components and its properties (i.e. size, diameter) were determined. Dij is, ij (R1/D ij ) −+= βα where Rij is the lan- guage link ratio is defined as, 4.9 GraphVis: Visualization Tool   GraphViz is open-source graph visualization ()ijij += ji / ik + ∑∑ NNNNR jk  for (i≠j) software that takes descriptions of graphs in a  k k  simple text language and makes diagrams in Rij = 1 for (i=j)

52 The 6th Workshop on Asian Languae Resources, 2008

several formats, including images. The neato Javanese is the most popular language in the layout, which makes “spring model” layouts, was web space of Indonesia, and constitutes 28% of the used to visualize the distance between languages. total number of web pages. It is followed by Eng- However, the calculated distance cannot be drawn lish and Indonesian. Although Indonesian is an exactly, and this visualization is only two- official language of the country, it is ranked third. dimensional. So, the images are distorted and do English, a major business language of Indonesia, not illustrate the exact distance, only an has the second largest number of web pages in the approximation. domain. But, Indonesian occupies the second rank in the number of outgoing links. 5 Results Languages Linked to (# of This section shows some results on the Indonesian No. Language outgoing links) Javanese Javanese (22,641,844) domain. 1 (jav) Indonesian (3,761,205) English English (11,036,726) 5.1 Link Structure 2 (eng) Javanese (1,443,054) Indonesia is a country with one of the biggest lan- Indonesian Indonesian (12,530,636) 3 guage diversity in the world. According to Eth- (ind) Javanese (4,168,734) 2 Thai Thai (4,367,895) nologue , 742 languages are spoken in the country. 4 But, the LIM results show that only five of these (tha) English (1,775,132) Malay Malay (1,944,430) 5 languages are listed in the top 10 languages in the (mly) Indonesian (1,516,778) country, i.e., Javanese, Indonesian, Malay, Sun- Sundanese Javanese (2,004,316) 6 danese, and Madurese. (sun) Sundanese (1,641,623) Luxemburg Luxemburg (1,128,342) 7 # of Outgo- (ltz) Javanese (393,524) No. Language # of Pages Occitan Occitan (119,734) ing Links 8 Javanese 797,300 33,411,032 (lnc) English (83,380) 1 Madurese Madurese (387,585) (jav) (28.01%) (27.20%) 9 English 743,457 16,645,014 (mad) Javanese (93,837) 2 Tatar Javanese (1,802,601) (eng) (26.11%) (13.55%) 10 Indonesian 516,528 20,783,793 (tat) Thai (397,988) 3 (ind) (18.14%) (16.92%) Thai 218,453 8,952,101 Table 2. Language Link for Indonesian Domain 4 (tha) (7.67%) (7.29%) Malay 197,535 4,990,402 5 (mly) (3.47%) (4.06%) The table above only shows the top 10 lan- Sundanese 98,835 5,349,194 6 guages. Among these, 8 languages are most fre- (sun) (3.47%) (4.35%) quently linked to the same language. The two other Luxemburg3 43,376 2,307,602 7 (ltz) (1.52%) (1.88%) languages are Sundanese and Tatar, both mostly Occitan4 27,663 351,318 linked to Javanese. 8 (lnc) (0.97%) (0.29%) The language graph below shows the languages Madurese 22,121 777,903 9 as the nodes and the edges representing the dis- (mad) (0.78%) (0.63%) tance between languages. In the figure above, the Tatar 20,709 3,334,651 10 (tat) (0.73%) (2.71%) six languages of Indonesia are found to be closely- 160,917 25,930,885 connected to each other. Others (5.65%) (21.11%) Total 2,846,894 122,833,898

Table 1. LIM result for Indonesian Domain

2 Gordon, Raymond G., Jr. (ed.), 2005. Ethnologue: Lan- guages of the World, Fifteenth edition. Dallas, Tex.: SIL In- ternational. 3 Luxembourgish 4 Occitan Languidocien

53 The 6th Workshop on Asian Languae Resources, 2008

intermediary language. However, it is likely that several pages were misidentified by LIM. The sec- ond, Javanese is not surprising since it is a major language of Indonesia. The table below shows se- lected language pairs, where one of its intermedi- ary languages is Javanese.

5 Language i Language j Pij Pik*Pkj Tatar Indonesian 0.01753 0.06085 Tatar Luxemburg 0.00210 0.01193 Samoan Balinese 0.00000 0.00052

Table 4. Selected Languages of Indonesia in which it’s Intermediary Language is Javanese

5.3 Graph Properties This section discusses the size distribution of the SCCs and the diameter of the Filipino language community in Indonesia.

SCC size The SCCs of a graph are those sets of nodes such that for every node, there is a path to all the other Figure 1. Language Graph of Indonesia nodes. The size of each SCC refers to the number of nodes it contains. Distribution of the sizes of 5.2 Intermediary Languages SCCs gives a good understanding of the graph structure of the web, and has important implica- For languages with only a few outgoing links be- tions for crawling. If most components have large tween them, there exists a language that acts as its sizes, only a few nodes are needed as seed URLs (a intermediary, that makes access between the two list of starting pages for a crawler) to be able to languages more convenient if passed through. reach all the other nodes. If all nodes are members of a single SCC, one URL is enough to crawl all Intermediary No. Language Frequency Percentage pages. 1 English 89 17.73% SCC 6 2 Javanese 43 8.57% 1 2 4 16 19 20 26 45 T 3 Tatar 42 8.37% size 4 Thai 40 7.97% # of 9 1 3 1 1 1 1 1 18 5 Madurese 38 7.57% SCCs 6 Indonesian 34 6.77% # of 7 Latin 34 6.77% 9 2 12 16 19 20 26 45 149 8 Sundanese 29 5.78% nodes 9 Malay 28 5.58% 10 Luxemburg 24 4.78% Table 5. SCC size distribution of the Filipino Others (top11-20) 122 24.30% language community in Indonesia Total 502 100.00%

Table 3. Intermediary Languages of Indonesia SCC diameter The maximum distance between any two nodes is The above table shows the number of language the diameter. For each node size, the diameter is pairs having the given language as its intermediary language. English has the highest frequency as an 5 k = Javanese 6 Total

54 The 6th Workshop on Asian Languae Resources, 2008

calculated and plotted in the chart. Their corre- perlinks, information can be readily accesses by sponding SCC graph is also shown. jumping from one page to another via the hyper- links.

4 For each country domain, the web pages written in the same language form a language community. 3 The link structure between language communities 2 shows how connected a language community is diameter 1 with another language community. It can be as- sumed that the close links between two language 0 0 5 10 15 20 25 30 35 40 45 50 communities on the web imply the existence of SCC size multilingual speakers of the two languages. Oth-

erwise linked pages will not be visited. In this con- Figure 2. Diameter distribution of the Filipino text, the language graph analysis demonstrated in language community in Indonesia this study gives an effective tool to understand the linguistic scenes of the country. If the same analy- For the Filipino language subgraph of Indonesia, sis is performed for the secondary level domain the component with the largest node size also has data, further insight into the socio-linguistic status the largest diameter. However, the largest diameter of each language can be drawn. Secondary domain size is only 3, which is a very small number. Most corresponds to different social area of language of the components have a diameter of 4. activities, such as “ac” or “edu” for academic and education arena, “go” or “gov” for government or 6 Implications public arena, and “co” or “com” for commercial business and occupational arena. Although this For the Filipino language community, many SCCs study does not extend its scope to the secondary can be found in the Philippines, since Filipino is level domain analysis, the effectiveness of the ap- one of its national languages. However, for some proach was demonstrated. countries, there are not many SCCs. For example, Another implication drawn from this study is Indonesia only has 18 SCCs, half of which consist that the language graph analysis can identify in- only of one node. However, the largest component termediary languages in the multilingual communi- size is 45, and there are 4 more large components. ties. In the real world, some languages are acting By picking out just one node from each SCC with as a medium of communications among the differ- a large size and using it as a seed URL, many web ent language speakers. In most cases, such lingua pages can already be downloaded. Add to it the franca are international languages such as English, depth parameter of 3, which is the largest diameter, French, Arabic, etc. But it’s difficult to identify these web pages can be downloaded within a short which language is acting as such in detail. But on period of time. the web link structure among languages, the lan- The choice of seed URL and the crawly depth guage graph can give us a clue to identify this. As are useful parameters for crawling. The analysis is shown in this paper, there are a number of lan- done for each language community to get these guages acting as intermediary between two lan- parameters for language-specific crawling pur- guages having only a few hyperlinks between them. poses. These parameters are different for each lan- Although the result of this category is doubtful guage community. This paper shows only shows because of misidentification of language, some the case of the Filipino language community of cases show the expected result. Indonesia as a sample illustration of the diameter The second objective of the study is to give a metric. microscopic level structure of the web communi- ties for much more practical and technical reasons, 7 Conclusion such as how to design more effective crawling strategy, and how to prepare starting URLs with The vastness and multilingual content of the web minimal efforts. The key issue in this context is to makes it a rich source of culturally-diversified in- reveal the connectedness of the web. To show the formation. Since web pages are connected by hy- connectedness of language communities, several

55 The 6th Workshop on Asian Languae Resources, 2008

graph theory metrics, the size and numbers of Andrew Tomkins, and Janet Wiener. 2000. Graph strongly-connected components and the diameters structure in the web. In Proceedingsofthe9thWorld are calculated and visual presentations of language Wide Web Conference, pages 309-320, Amsterdam, communities are also given. This information can Netherlands. aid in defining parameters used for crawling, par- Chakrabarti, Soumen, Martin van den Berg, and By- ticularly language-specific crawling. ronDom. 1999. Focused Crawling: a new approach As a summary, the link structure analysis of lan- to topic-specific Web resource discovery. In Pro guage graphs can be a useful tool for various spec- ceedings of the 8thInternational World Wide Web trums of socio-linguistic and technical research Conference, pages 1623-1640, Toronto, Canada. purposes. Herring, Susan C., John C. Paolillo, Irene Ramos-Vielba, Inna Kouper, Elijah Wright, Sharon Stoerger, Lois 8 Limitations and Future Work Ann Scheidt, and Benjamin Clark. 2007. Language Networks on LiveJournal. In Proceedingsofthe40th The results of this research are highly dependent Annual Hawaii International Conference on System on the language identification module (LIM). With Sciences, Hawaii. a more improved LIM, more accurate results can Kumar, Ravi, Prabhakar Raghavan, Sridhar Rajagopalan, be presented. Currently, there is an ongoing ex- D. Sivakumar, Andrew S. Tompkins, and Eli Upfal. periment that uses a new LIM. 2000. The Web as a graph. In Proceedings of the This analysis will also be done to the secondary- NineteenthACMSIGMODSIGACTSIGARTSympo level-domains to show the language distribution siumonPrinciplesofDatabaseSystems, pages 1-10, for different social areas. Dallas, Texas, United States. Future work also includes the creation of a lan- Petricek, Vaclav, Tobias Escher, Ingemar J. Cox, and guage-specific crawler that will incorporate the Helen Margetts. 2006. The web structure of e- results derived from the analysis of the SCC size government: developing a methodology for quantita- and diameter of the language subgraphs. tive evaluation. In Proceedings of the 15th Interna tionalWorldWideWebConference, pages 669-678, Acknowledgment Edinburgh, Scotland. The study was made possible by the financial sup- Pingali, Prasad, Jagadeesh Jagarlamudi, and Vasudeva port of the Japan Science and Technology Agency Varma. 2006. WebKhoj: Indian language IR from Multiple Character Encodings. In Proceedingsofthe (JST) under the RISTEX program and the Asian th Language Resource Network Project. We also 15 International World Wide Web Conference, pages 801-809, Edinburgh, Scotland. thank UNESCO for giving official support to the project since its inception. Stamatakis, Konstantinos, Vangelis Karkaletsis, Geor- gios Paliouras, James Horlock, Claire Grover, James References R. Curran, and Shipra Dingare. 2003. Domain- Specific Web Site Identification: The CROSSMARC Balakrishnan, Hemant and Narsingh Deo. 2006. Evolu- Focused Web Crawler. In Proceedingsofthe2ndIn tion in Web graphs. Proceedings of the 37thSouth ternational Workshop on Web Document Analysis eastern International Conference on Combinatorics, (WDA2003), pages 75–78. Edinburgh, UK. GraphTheory,andComputing. Boca Raton, FL. Suzuki, Izumi, Yoshiki Mikami, Ario Ohsato, and - Bharat Krishna, Bay-Wei Chang, Monika Henzinger, shihide Chubachi. 2002. A language and character set and Matthias Ruhl. 2001. Who links to whom: min- determination method based on N-gram statistics. ing linkage between Web sites. In Proceedingsofthe ACM Transactions on Asian Language Information 2001IEEEInternationalConferenceonDataMining, Processing, 1(3): 269-278. pages 51-58, San Jose, California. Tamura, Takayuki, Kulwadee Somboonviwat, and Ma- Boldi, Paolo, Bruno Codenotti, Massimo Santini, and saru Kitsuregawa. 2007. A method for language- Sebastiano Vigna. 2004. UbiCrawler: A scalable specific Web crawling and its evaluation. Systems fully distributed web crawler. Software: Practice & andComputersinJapan, 38(2):10-20. Experience,34(8):711726. Broder, Andrei, Ravi Kumar, Farzin Maghoul, Prab- hakar Raghavan, Sridhar Rajagopalan, Raymie Stata,

56 The 6th Workshop on Asian Languae Resources, 2008

A Multilingual Multimedia Indian Sign Language Dictionary Tool

Tirthankar Sambit Sandeep Synny Diwakar Anupam Basu Dasgupta Shukla Kumar NIT, Suratkal IIT, Kharagpur IIT, Kharagpur NIT, Rourkela NIT, Allahabad. sunny.diwaka Anupam- tirtha@iitkgp sks.at.nit mnnit.sand rnitk@gmail. bas@gmail. .ernet.in [email protected] eep@gmail. com com m com

Bangladesh, and border regions of Pakistan Abstract (Zeshan et al., 2004). Different dialects of ISL with broad lexical variation are found in different This paper presents a cross platform multi- parts of the Indian subcontinent. However, the lingual multimedia Indian Sign Language grammatical structure is same for all dialects (ISL) dictionary building tool. ISL is a lin- (Zeshan, 2003). guistically under-investigated language The All India Federation of the Deaf estimates with no source of well documented elec- around 4 million deaf people and more than 10 tronic data. Research on ISL linguistics million hard of hearing people in India (Zeshan et also gets hindered due to a lack of ISL al, 2004). Studies revealed that, one out of every knowledge and the unavailability of any five deaf people in the world are from India. More educational tools. Our system can be used than 1 million deaf adults and around 0.5 million to associate signs corresponding to a given deaf children uses Indian Sign Language as a text. The current system also facilitates the mode of communication (Zeshan et al, 2004). phonological annotation of Indian signs in However, an UNESCO report (1980) found that the form of HamNoSys structure. The gen- only 5% of the deaf get any education in India. erated HamNoSys string can be given as The reason behind such a low literacy rate can be input to an avatar module to produce an due to the following reasons: a) Till the early 20th animated sign representation. century, deafness in India, is considered as a pun- ishment for sins and signing is strictly discouraged

1 Introduction (Zeshan et. al, 2004). b) Until the late 1970’s, it A sign language is a visual-gesture language that has been believed that, there were no such lan- uses hand, arm, body, and face to convey thoughts guage called ISL. c) Lack of research in ISL lin- and meanings. It is a language that is commonly guistics. d) Unavailability of well documented and developed in deaf communities, which includes annotated ISL lexicon. e) Unavailability of any deaf people, their friends and families as well as ISL learning tool. f) Difficulties in getting sign people who are hard of hearing. Despite common language interpreters. misconceptions, sign languages are complete natu- Linguistic studies on ISL were started ral languages, with their own syntax and grammar. around 1978 and it has been found that ISL is a However, sign languages are not universal. As is complete natural language, instigated in India, the case in spoken language, every country has got having its own morphology, phonology, syntax, its own sign language with high degree of gram- and grammar (Vasishta et. al, 1978; Zeshan et.al, matical variations. 2004). The research on ISL linguistics and phono- The sign language used in India is com- logical studies get hindered due to lack of linguis- monly known as Indian Sign Language (hence- tically annotated and well documented ISL data. A forth called ISL). However, it has been argued that dictionary of around 1000 signs in four different possibly the same SL is used in Nepal, Sri Lanka, regional varieties was released (Vasishta et.al,

57 The 6th Workshop on Asian Languae Resources, 2008

1978). However, these signs are based on graphi- orientation, locations, movements, and non- cal icons which are not only difficult to understand manual expressions. but also lack phonological features like move- • The phonological features are expressed in terms ments and non-manual expressions. of HamNoSys (Prillwitz et. al, 1989). As it has been specified above, ISL is not only • Facilitate search options like word-sign and used by the deaf people but also by the hearing search by HamNoSys. parents of the deaf children, the hearing children • The generated lexicon is exported in XML file of deaf adults and hearing deaf educators (Zeshan format and the sign is stored in the form of digi- et al, 2004). Therefore the need to build a system tal videos. that can associate signs to the words of spoken • The video segments are captured using webcams language, and which can further be used to learn connected with the system. It is possible to at- ISL, is significant. Further associating signs of tach multiple webcams to the system to capture 1 2 different SL (like ASL , BSL and ISL) to a word video segments from multiple angles. This fea- will help the user to learn foreign SLs simultane- ture enables a user to better understand some of ously. the complex sign language attributes. Several works have been done on building multimedia-based foreign SL dictionaries as dis- The organization of the paper is as follows: Sec- cussed in (Buttussi et. al., 2007). However no such tion 2 gives a brief introduction to ISL phonology. system is currently available for ISL. moreover, Section 3 presents related works on ISL Diction- most of the current systems suffer from the follow- ary. Section 4 presents the overall system architec- ing limitations: ture of the SL-dictionary tool. Section 5 and 6 pre- • Most of the systems are native language specific sents a brief discussion related HamNoSys repre- and hence, cannot be used for ISL. sentation, and the HamNoSys editor. Section 7 • Most of the systems provide a word-sign search presents conclusion and future work. but very few systems provide a sign-word or sign-sign search. 2 ISL Phonology • Very few systems are cross platform. • Systems lack sophisticated phonological infor- Indian Sign Language (ISL) is a visual-spatial lan- mation like hand-shape, orientations, move- guage which provides linguistic information using ments, and non-manual signs. hands, arms, face, and head/body postures. The In order to overcome the above mentioned crisis, signer often uses the 3D space around his body to and based on the limitations of the current sys- describe an event (Zeshan, 2003). Unlike spoken tems, our objective is to: languages where the communication medium is • Build a cross platform multilingual multimedia dependent on sound, in sign language, the com- SL-Dictionary tool which can be used to create a munication medium depends upon the visual large SL lexicon. channel. In spoken language, a word is composed of phonemes. Two words can be distinguished by • This tool can be used to associate signs to the at least one phoneme. In SL, a sign is composed of words, phrases, or sentences of a spoken lan- cheremes3 and similarly two signs can differ by at guage text. least one chereme (Stokoe, 1978). A sign is a se- • The sign associated with each word is composed quential or parallel construction of its manual and of its related part-of-speech and semantic senses. non-manual cheremes. A manual chereme can be • The input text (word, phrase, or a sentence) may defined by several parameters like: be in any language (like English or Hindi) and • Hand shape. the associated sign can be in any standard sign • Hand location language (ASL or ISL). • Orientation. • This tool can also be used to associate complex SL phonological features like hand shape, palm • Movements (straight, circular or curved)

3 The term chereme (originally proposed by William 1 ASL: American Sign Language. Stokoe (Stokoe, 1978)) in Greek means “hand”. It is 2 BSL: British Sign Language. equivalent to the phonemes of spoken languages.

58 The 6th Workshop on Asian Languae Resources, 2008

Non-manual cheremes are defined by: Flag • Facial expressions. Long • Eye gaze and Head/body posture (Zeshan, 2003). However, there exist some signs which may con- tain only manual or non-manual components. For example the sign “Yes” is signed by vertical head Fig.3 : Two handed sign "long"(both the hands are nod and it has no manual component. moving) and “Flag” (only the dominant right hand ISL signs can be generally classified into three is moving) classes: One handed, two handed, and non-manual signs. Fig. 1 shows the overall Indian sign hierar- 3 Related works on ISL dictionary chy. Linguistic studies on ISL are in their infancy as compared to other natural languages like English, Hindi, or Bengali and also to other SLs. Linguistic work on ISL began during late 1970’s. Before that, the existence of ISL was not acknowledged. In 1977 a survey was conducted (see Vasistha et. al., 1998 for documentation) and it was revealed that ISL is a complete natural language instigated at the Indian subcontinent. Vasistha collected signs from four major states of India (Delhi, Mumbai, Kolkata, and Bangalore) and released four diction- Fig. 1: ISL Type Hierarchy aries of ISL regional varieties. The Ramkrishna Mission vidyalaya, Coimbatore has published an- One handed signs: the one handed signs are repre- other ISL dictionary in 2001. However, all these sented by a single dominating hand. One handed dictionaries are based on iconic representations of signs can be either static or movement related. signs. As a result some of the important phono- Each of the static and movement signs is further logical information like, movements and non- classified into manual and non-manual signs. Fig. manual expression gets lost. No other work of its 2 shows examples of one handed static signs with kind has so far been reported (Zeshan, 2004). non-manual and manual components. Several works have been done in building Ear Headache ASL and BSL dictionary tools. Some of the sys- tems are briefly discussed below: • (Wilcox et. al, 1994) developed a multimedia ASL dictionary tool, which prerecorded digital video frames. Fig. 2: One Handed static manual sign (Ear) and • (Geitz et.al, 1996) developed a VRML based non-manual sign (Headache). ASL finger spelled system, which ran on inter- net. Two hand signs: As in the case of one hand signs, • Sign Smith (VCOM3D, 2004) is a 3D illustrated similar classification can be applied to two handed dictionary of ASL. It is also used as educational signs. However, two handed signs with move- software as well as an authoring tool to create ments can be further distinguished as: ASL content. Type0: Signs where both hands are active (see Fig • (Buttussi et. al, 2007) proposes an Italian Sign 3). Language dictionary tool. This tool uses H- Type1: Signs where one hand (dominant) is more animator to generate signing avatar. This tool active compared to the other hand (non-dominant) provides multiple search functionality like word- as shown in Fig 3. sign, sign-word, and sign-sign search. This tool

also facilitates association of one or more SL for a given input word.

59 The 6th Workshop on Asian Languae Resources, 2008

4 SL-Dictionary The primary objective of the SL-dictionary tool is to provide an easy to use GUI to create a multilin- gual multimedia SL dictionary by which a user can associate signs as well as the parameters defining a sign, corresponding to a given text. The overall architecture of the system is shown in Fig. 4. The system has been divided into two modules: a) Ex- pert module and b) User Module. The expert module has got three main units: a) Input Text Processing Unit b) Visual Data Capture Unit (VDCU) c) Sign Storage Unit and d) HamNoSys Editor. Input Text Processing Unit: In this unit a SL ex- pert chooses the input spoken language (like, Eng- lish, or Hindi) and the target sign language (like, ISL, or ASL) and then enters a text. The input to the system may be word, phrase, or sentences. If Fig. 4: System Architecture of ISL-Dictionary the text is a word the system generates all possible meanings, with the help of WordNet4, along with the part of speech (POS)5 of that particular word. In order to get the exact part-of-speech of a word, the SL expert has to enter an example sentence corresponding to that word. This sentence is given as an input to the POS-tagger to get the correct POS of the word. A word may have multiple senses as returned by WordNet. The user can se- lect one or more senses from the list. Visual Data Capture Unit: Sign corresponding to a word sense is signed by the user which is captured Fig.5: capturing video signs from multiple angle by the Visual Data Capture Unit (VDCU). The VDCU is connected through multiple webcams, placed at different angular positions with respect to the signer. As a result different articulation points of a signs are getting stored with in the da- tabase. This will enable the SL learner to under- stand a particular sign easily. Fig.5 shows how a sign from multiple angles is getting captured. Storage Unit: The input text along with its anno- tated information, the digital video sign, and the phonological parameters defining the sign are stored with in a database which is further exported into an XML formatted file (see Fig. 6). The pho- nological parameters are expressed in the form of HamNoSys (discussed in section 5). Fig.6: The ISL-dictionary XML Format

Searching: The search engine of the current sys- tem takes a spoken language text as input parses 4 wordnet.princeton.edu/ 5 the XML formatted dictionary and sequentially We have used the Stanford Part-of-Speech tagger (nlp.stanford.edu/software/tagger.shtml) searches the dictionary. If a match is found, then

60 The 6th Workshop on Asian Languae Resources, 2008

the sign corresponding to the lexical entry is being manual signs are not present, as the sign displayed. “WOMAN” in ISL does not have these expres- sions. Fig.9 shows the ISL representation of 5 Sign language notation systems “WOMAN”.

As it has been mentioned above, Sign language does not have any written form. Hence, In order to  \  H f vÇÀ define a sign we need some notation system. There

are a number of phonological notation systems for Location Handshape the representation of SL as discussed in (Smith Palm et.al, 2003). One of the popular among them is Extended Finger orientation Stokoe notation (Stokoe, 2003; Smith et.al, 2003). Fig. 8: HamNoSys representation of “WOMAN” Stokoe defines a sign by three parameters: a) Hand-shape or designator (dez) b) location or place of articulation with respect to the body (tab) and c) movements or signation (sig). HamNoSys (Prillwitz et. al, 1989) is a phonetic transcription system, based on Stokoe notation, used to transcribe signing gestures. It is a Fig.9: Sign of “WOMAN” syntactic representation of a sign to facilitate com- puter processing. HamNoSys extends the tradi- tional Stokoe based notation system by further 6 HamNoSys Editor expanding sign representation by some more pa- Transcribing a sign by HamNoSys is not a trivial rameters. These parameters can be defined as: task. A user who is transcribing a sign should be • Dominant hand’s shape. an expert in both HamNoSys as well as ISL. • Location of the dominant and the non-dominant Moreover he has to remember all the HamNoSys hand with respect to the body. symbols and their corresponding meanings in or- • Extended finger orientation of both dominant der to define a sign. In India it is very difficult to and non-dominant hand. find such a person. Hence our main goal behind • Palm orientation of both hands. building a HamNoSys editor is that, it can be used • Movements (straight, circular, or curved) by an ISL expert with little or no knowledge in • Non-manual signs. HamNoSys. The tool should provide an easy to Fig. 7 shows examples of different HamNoSys use GUI that can be used to transcribe symbols and their descriptions. phonological information of a sign. The HamNoSys editor provides a set of graphical images (most of the images are collected from www.sign-lang.uni-amburg.de/Projekte/ HamNoSys) for most of the phonological parame- ters of a sign, like, Hand-shape, orientation, loca- tion and movements. Based on the parameters, an ISL expert can choose a set of images and the sys- tem will automatically generate the corresponding HamNoSys of the sign. This HamNoSys string can be given as an input to a signing avatar module to generate animated sign representation. A signing avatar is a virtual human char- acter that performs sign language. However, this Fig.7: HamNoSys symbols and there descriptions character needs a set of instructions which will Fig. 8 shows an example where HamNoSys repre- guide its movement. These instructions can be sentation of the word “WOMAN” is explained. provided in the form of HamNoSys (Marshall and Here, the parameters like movement and non- Sáfár, 2001).

61 The 6th Workshop on Asian Languae Resources, 2008

Fig.10: HamNoSys Parameters Fig, 11: Twelve basic hand-shape classes

Fig.13: GUI to choose various hand locations near the human face

Fig.12: GUI to express finger and palm orientations

62 The 6th Workshop on Asian Languae Resources, 2008

Fig 14: GUI showing various straight movement parameters

Fig.10 shows the five basic parameters of Ham- “BITTER”, in ISL the representation is shown in NoSys. For each of these parameters there exist Fig.15. It can be observed that it is very difficult to interfaces through which a SL expert can choose represent the facial expressions like eyebrow by the desired parameters to define a sign. For exam- HamNoSys. Currently we have a collection of ple, the right hand side of Fig.11 shows the twelve around 979 sign icons (published by Vasistha et. al basic hand-shape classes. Each of these base hand- 1998), which we are trying to transcribe in Ham- shapes may contain several derived hand-shapes as NoSys. Out of these, 16% of the signs contain defined in HamNoSys (version 4.0). If a particular non-manual features which we are unable to repre- hand-shape is selected, then the HamNoSys sym- sent in HamNoSys. bol corresponding to the hand-shape gets stored in the XML database (see Fig.6). Similarly, separate interfaces have been provided to identify palm orientation (see Fig,12), hand location (see Fig.13), movements (see Fig.14), and non-manual Fig.15: ISL representation of signs. "BITTER" Due to its symbolic structure, HamNoSys is fairly easy to write, and understand. However, there are some drawbacks on this notation system 7 Conclusion and Future works that make it difficult to be used universally for all sign languages (Smith and Edmondson, 2004). For The paper presents an approach towards building a example, HamNoSys uses some fixed set of sym- multimedia SL dictionary tool. This tool can be bols to define a sign however it is possible that a used to prepare a well documented ISL dictionary. particular sign in any sign language may not be The system is intended to take any Indian lan- defined by 'the fixed set of symbols. For example guage text as input and can store signs in any SL. HamNoSys does not have well defined symbols Currently the system takes English, Hindi and for non-manual expressions. Consider the sign Bengali texts as input and can store signs in ISL only. The system also provides an easy to use GUI

63 The 6th Workshop on Asian Languae Resources, 2008

to include phonological information of a sign in Stokoe W. C., 1960. Sign language structure: an out- the form of HamNoSys string. The generated line of the visual communication systems of the HamNoSys string can then be used as an input to American deaf. 2nd edition, 1978. Silver Spring, the signing avatar module to produce animated MD: Linstok Press. sign output. VCOM3D,2004. Sign smith products. In the next phase of our work we will im- http://www.vcom3d.com. prove the system so that it can associate signs in Wilcox, S., Scheibman, J., Wood, D., Cokely, D., and any other SL (like, ASL and BSL). Further, stokoe, w. c. 1994. Multimedia dictionary of Ameri- WordNet as well as POS Tagger corresponding to can Sign Language. In Assets ’94: Proceedings of Hindi and Bengali languages should also be inte- the first annual ACM conference on Assistive tech- grated with the system. Also, support has to be nologies, ACM Press, New York, NY, USA, 9–16. built so that system can perform sign-to-word and Vasishta M., Woodward J., DeSantis S. 1998, “An In- sign to sign search. We will also perform proper troduction to Indian Sign Language”, All India Fed- evaluation of the HamNoSys editor in order to eration of the Deaf (Third Edition). understand its utility to the SL user. Zeshan U., 2003,”Indo-Pakistani Sign Language Grammar: A Typological Outline”, Sign Language References Studies - Volume 3, Number 2, , pp. 157-212 Buttussi F., Chittaro L., Coppo M. 2007. Using Web3D Zeshan U., Madan M. Vasishta, Sethna M. 2004, “im- technologies for visualization and search of signs in plementation of indian sign language in educational an international sign language dictionary. Proceed- settings”- Volume 15, Number 2, Asia Pacific Dis- ings of the twelfth international conference on 3D ability Rehabilitation Journal, , pp. 15-35 web technology. Perugia, Italy Pages: 61 – 70 Year of Publication: 2007 ISBN:978-1-59593-652-3 Geitz, S., Hanson, T., Maher, S. 1996. Computer gener- ated 3-dimensional models of manual alphabet hand- shapes for the World Wide Web. In Assets ’96: Proceedings of the second annual ACM confer- ence on Assistive technologies, ACM Press, New York, NY, USA, 27–31. Marshall I. and Sáfár É. 2001.Extraction of semantic representations from syntactic SMU link grammar linkages.. In G. Angelova, editor, Proceedings of Recent Advances in Natural Lanugage Processing, pp: 154-159, Tzigov Chark, Bulgaria, September. Prillwitz P., Regina Leven, Heiko Zienert, Thomas Hamke, and Jan Henning. 1989. HamNoSys Version 2.0: Hamburg Notation System for Sign Languages: An Introductory Guide, volume 5 of International Studies on Sign Language and Communication of the Deaf. Signum Press, Hamburg, Germany, Smith G., Angus. 1999. English to American Sign Lan- guage machine translation of weather reports. Pro- ceedings of the Second High Desert Student Confer- ence in Linguistics. Smith,K.C., Edmondson, W. 2004. The Development of a Computational Notation for Synthesis of Sign and Gesture, GW03(312-323). Speers, A. 1995. SL-Corpus: A computer tool for sign language corpora., Georgetown University.

64 The 6th Workshop on Asian Languae Resources, 2008

A Discourse Resource for Turkish: Annotating Discourse Connectives in the METU Corpus

Deniz Zeyrek Bonnie Webber Department of Foreign Language School of Informatics Education University of Edinburgh Middle East Technical University Edinburgh, Scotland Ankara, Turkey [email protected] [email protected]

and their arguments. The 2-million word METU Abstract Turkish Corpus (MTC) is an electronic resource of 520 samples of continuous text from 291 different This paper describes first steps towards sources written between 1990-2000. It includes extending the METU Turkish Corpus multiple genres, such as novels, short stories, from a sentence-level language resource to newspaper columns, biographies, memoirs, etc. a discourse-level resource by annotating annotated topographically, i.e., for paragraph its discourse connectives and their boundaries, author, publication date, and the arguments. The project is based on the source of the text. A small part of the MTC, called same principles as the Penn Discourse the METU-Sabancı TreeBank (5600 sentences) has TreeBank (http://www.seas.upenn.edu/~pdtb) been annotated with morphological features and and is supported by TUBITAK, The dependency relationships (e.g., modifier-of, Scientific and Technological Research subject-of, object-of, etc.). The result is a set of Council of Turkey. We first present the dependency trees. The MTC as a whole provides a goals of the project and the METU large-scale resource on Turkish discourse and is Turkish corpus. We then describe how we being used in research on Turkish. To date, there decided what to take as explicit discourse have been 81 requests for permission to use the connectives and the range of syntactic MTC and 31 requests to use the TreeBank sub- classes they come from. With corpus. Most of the users are linguists, computer or representative examples of each class, we cognitive scientists working on Turkish, or examine explicit connectives, their linear graduate students of similar disciplines. Some ordering, and types of syntactic units that users have expressed a desire for the MTC to be can serve as their arguments. We then extended by annotations at the discourse level, touch upon connectives with respect to which provides further impetus for the present free word order in Turkish and project. punctuation, as well as the important issue The result of annotating discourse connectives of how much material is needed to specify will be a clearly defined level of discourse an argument. We close with a brief structure on the MTC. Annotation of text from the discussion of current plans. multiple genres present in the MTC will allow us 1 Introduction to compare the distribution of connectives and their arguments across genres. The annotation will The goal of the project is to extend the METU help researchers understand Turkish discourse by Turkish Corpus (Say et al, 2002) from a sentence- enabling them to give concise, clear descriptions of level language resource to a discourse-level the issues concerning discourse structure and resource by annotating its discourse connectives, semantics, and support a rigorous empirical

65 The 6th Workshop on Asian Languae Resources, 2008

characterization of where and how the free word- (a) Coordinating conjunctions such as single order in a language like Turkish is sensitive to lexical items çünkü ‘because’, ama ‘but,’ features of the surrounding discourse. It can thus ve ‘and’, and the particle dA. (N.B., dA serve as a major resource for natural language can also function as a subordinator.) processing, language technology and pedagogy. (b) Paired coordinating conjunctions such as hem .. hem ‘both and,’ ne .. ne ‘neither 2 Overview of Turkish Discourse nor’ which link two clauses, with one Connectives element of the pair associated with each clause in the discourse relation. From a semantic perspective, a discourse (c) Simplex subordinators (also termed as connective is a predicate that takes as its converbs), i.e., suffixes forming non-finite arguments, abstract objects (propositions, facts, adverbial clauses, e.g. –(y)kAn, ‘while’, - events, descriptions, situations, and eventualities). (y)ArAk ‘by means of’. The primary linguistic unit in which abstract (d) Complex subordinators, i.e., connectives objects (AOs) are realized in Turkish is the clause, which have two parts, usually a either tensed or untensed. Discourse connectives postposition (rağmen ‘despite’, için ‘for’, themselves may be realized explicitly or implicitly. gibi ‘as well as’) and an accompanying An explicit connective is realized in the form of a suffix on the (non-finite) verb of the lexical item or a group of lexical items, while an subordinate clause.1 implicit connective can be inferred from adjacent (e) Anaphoric connectives such as ne var ki text spans that realise AOs and whose AOs are ‘however’, üstelik ‘what is more’, ayrıca taken to be related. To constrain the amount of text ‘apart from this’, ilk olarak ‘firstly’, etc. selected for arguments, a minimality principle can In the PDTB, non-finite clauses have not been be imposed, limiting arguments to the minimum annotated as arguments. However, since all non- amount of information needed to complete the finite clauses are marked with a suffix in Turkish interpretation of the discourse relation. The project (see sections 4.1 and 4.2 below) and encode a will initially focus on annotating explicit relation between AOs, we would have missed an connectives, integrating implicit ones at a later important property of the language if we had not stage. identified them as discourse connectives (cf. One of the most challenging issues so far has Prasad et al., 2008). been determining the set of explicit discourse All the discourse connectives above have connectives in Turkish (i.e., the various linguistic exactly two arguments. So as in English, while elements that can be interpreted as predicates on verbs in Turkish can vary in the number of AO arguments) and the syntactic classes they are arguments they take, Turkish discourse identified with. In the Penn Discourse TreeBank connectives take two and only two arguments. (PDTB), the explicit discourse connectives were These can conveniently be called ARG1 and taken to comprise (1) coordinating conjunctions, ARG2. It remains an open question whether there (2) subordinating conjunctions, and (3) discourse is any language in which discourse connectives adverbials (Forbes-Riley et al, 2006). But take more than two arguments. coordinating and subordinating conjunctions are In the following, we give representative not classes in Turkish per se. Moreover, most of examples of each of the above five classes of the existing grammars of Turkish describe clausal discourse connectives and discuss the assignment adjuncts and adverbs in semantic (e.g., temporal, of the argument labels, linear order of arguments additive, resultative, etc.) rather than syntactic and types of arguments. By convention, we label terms. We therefore made a rough classification first and determined the broad syntactic classes by 1 considering the morpho-syntactic properties shared Postpositions correspond to prepositions in English, though there are many fewer of them. They form a subordinate clause by elements of the initial classification. by nominalizing their complements and marking them with As a result of this process, we have come to the dative, ablative, or the possessive case. In the examples identify explicit discourse connectives in Turkish given in this paper, suffixes are shown in upper-case letters. with three grammatical types, forming five classes: Case suffixes are underlined in addition to being presented in upper-case letters.

66 The 6th Workshop on Asian Languae Resources, 2008

the argument containing (or with an affinity for) ‘If we had a child I would keep myself busy the connective as ARG2 (presented in boldface) with her/him but God did not predestine it.’ and the other argument as ARG1 (presented in 3.2 Paired coordinating conjunctions italics). Discourse connectives are underlined. This annotation convention is used in the English Paired coordinating conjunctions are composed of translations as well. Except for examples (12), two lexical items, with the second often a duplicate (13), (19), (20), all examples have been taken from of the first element. These lexical items express a the MTC. single discourse relation, such as disjunction as in example (4). The order of arguments is ARG1- 3 Coordinating conjunctions ARG2 and the position of the conjunctions is clause-initial. 3.1 Simple coordinating conjunctions Coordinating conjunctions are like English and (4) Birilerinin ya işi vardır, aceleyle yürürler, ya combine two clauses of the same syntactic type, koşarlar. e.g., two main clauses. They are typically ‘Some people are either busy and walk hurriedly, sentence-medial and show an affinity with the or they run.’ second clause (evidenced in part through 4 Subordinators punctuation and their ability to move to the end of the second clause). Whether a coordinating 4.1 Simplex subordinators conjunction links clauses within a single sentence When a subordinate clause is reduced in Turkish, it or clauses across adjacent sentences (cf. Section 6), loses its tense, aspect and mood properties. In this it shows an affinity with the second clause. Thus way, it becomes a nominal or adverbial clause ARG2 of these conjunctions is the second clause associated with the matrix verb. The relationship of and ARG1 is the first clause. an adverbial clause with the AO expressed by the

matrix verb and its arguments is conveyed by a (1) Yapılarını kerpiçten yapıyorlar, ama sonra taşı kullanmayı öğreniyorlar. Mimarlık açısından çok small set of suffixes corresponding to English önemli, çünkü bu yapı malzemesini başka bir ‘while’, ‘when’, ‘by means of’, ‘as if’, or temporal malzemeyle beraber kullanmayı, ilk defa ‘since’, added to the non-finite verb of the reduced burada görüyoruz. clause. This pair of non-finite verb and suffix, we ‘They constructed their buildings first from mud- call a ‘converb’. The normal order of the bricks but then they learnt to use the stone. arguments of a converb is ARG2-ARG1, where the Architecturally, this is very important because we converb appears as the last element of ARG2. The see the use of this construction material with following example illustrates –(y)ArAk ‘by means another one at this site for the first time .’ of’ and its arguments:

The particle dA can serve a discourse (5) Kafiye Hanım beni kucakladı, yanağını yanağıma connective function with an additive (Example 2) sürterek iyi yolculuklar diledi. or adversative sense (Example 3). In contrast with ‘Kafiye hugged me and by rubbing her cheek coordinating conjunctions, the order of arguments against mine, she wished me a good trip.’ to dA is normally ARG2-ARG1, thus exhibiting a similarity with subordinators (see below). 4.2 Complex subordinators However, since dA combines two clauses of the Complex subordinators constitute a larger set than same syntactic type, we take it to be a simple the set of simplex subordinators. Here, a lexical coordinating conjunction. item, usually a postposition, must appear with a nominalizing suffix and, if required, a case suffix (2) Konuşmayı unuttum diyorum da gülüyorlar as well. If the verb of the clause does not have a bana. subject, it is nominalized with –mAk (the infinitive ‘I said I’ve forgotton to talk and they laughed at suffix). If it has a subject, it is nominalized with - me.’ (3) Belki bir çocuğumuz olsa onunla oyalanırdım DIK (past) or –mA (non-past) and carries the da Allah kısmet etmedi. possessive marker agreeing with the subject of the

67 The 6th Workshop on Asian Languae Resources, 2008

verb. The normal order of the arguments of a marker agreeing with the subject of the subordinate complex subordinator is the same as with clause where necessary). The suffix suffices to converbs, i.e., ARG2-ARG1. The nominalizer, the introduce a discourse relation on its own, even possessive and the case suffix (if any) appear without the postposition eğer: attached to the non-finite verb of ARG2 in that order. The connective appears as the last element (10) Salman Rushdi öldürülürSE İslam dini bundan of ARG2. bir onur kazanacak? Some postpositions have multiple senses, ‘If Salman Rushdi was to be killed, would the depending on the type of nominalizer attached to Islam religion be honoured?’ the non-finite verb. For example, the postposition (11) Eğer sigarayı bırakmak için mükemmel zamanı bekliyorSAnız asla sigarayı için means causal ‘since’ with –DIK (Example 6), bırakamazsınız. and ‘so as to’ with –mA or –mAk (Example 7). In ‘If you are waiting for the best time to stop these examples, the lexical part of the complex smoking, you can never stop smoking’ subordinator is underlined, and the suffixes on the non-finite verb of ARG2 rendered in small caps. 5 Anaphoric connectives

(6) Herkes çoktan pazara çıkTIĞI için kentin o dar, The fifth type of explicit discourse connectives are eğri büğrü arka sokaklarını boşalmış ve sessiz anaphoric connectives. Anaphoric connectives are bulurduk. distinguished from clausal adverbs like çoğunlukla ‘Since everyone has gone to the bazaar long ‘usually’, mutlaka ‘definitely, maalesef time ago, we would find the narrow and curved ‘regrettably’, which are interpreted only with back streets of the town empty and quiet.’ respect to their matrix sentence. In contrast, (7) [Turhan Baytop] Paris Eczacılık Fakültesi anaphoric connectives also require an AO from a Farmakognozi kürsüsünde görgü ve bilgisini sentence or group of sentences adjacent (Example arttırMAK için çalışmıştır. ‘Turhan Baytop worked at Paris Pharmacology 12) or non-adjacent (Example 13) to the sentence Faculty so as to increase his experience and containing the connective. Another important knowledge,’ property of anaphoric connectives is that they can access the inferences in the prior discourse Since postpositions also have a non-discourse (Webber et al 2003). This material is neither role in which they signal a verb’s arguments and/or accessible by other types of discourse connectives adjuncts, we will only annotate postpositions as nor clausal connectives. For example, in example discourse connectives when they have clausal (14), the anaphoric connective yoksa ‘or else, elements as arguments. Given that a clausal otherwise’ accesses the inference that the element always has a nominalizing suffix, the organizations have not united and hence did not distinction will be straightforward. For example, in introduce political strategies unique to Turkey. (8) için takes an NP complement (marked with the possessive case) and will not be annotated, while (12) Ali hiç spor yapmaz. Sonuç olarak çok istediği in (9) rağmen ‘despite’ comes with a nominalizer halde kilo veremiyor. ‘Ali never exercises. Consequently , he can’t lose and the dative suffix, and it will be annotated: weight although he wants to very much.’ (13) Zeynep önceleri Bodrum’da oturdu. Krediyle (8) Bunun için paraya ihtiyacımız var. deniz kenarında bir ev aldı. Evi dayadı, döşedi, ‘We need money for this.’ bahçeye yasemin ekti. Ne var ki banka kredisini (9) Çok iyi bir biçimde yayılmış olMASINA ödeyemediğinden evi satmak zorunda kaldı. rağmen Celtis (çitlenbik) poleninin yokluğu ‘Zeynep first lived in Mersin. She bought a house dikkate değerdir. by the sea on credit. She furnished it fully and ‘Despite not dispersing well, the absence of the planted jasmine in the garden. However , she had Celtis [tree] polen is worthy of attention.’ to sell the house because she couldn’t pay back the credit.’ In general, both parts of a complex subordinator must be realized in the discourse. An exception is ‘if’ eğer and its accompanying suffix –sE (and the

68 The 6th Workshop on Asian Languae Resources, 2008

(14) Bu örgütlerin birleşerek Türkiye’ etkilemesi ve geri çevirirdi, pek anlam veremezdim. Parayı Türkiye’ye özgü politikaları gündeme getirmesi severdi çünkü. lazım. Yoksa Tony Blair şöyle yaptı şimdi biz ‘Some customers would come with gold coloured de şimdi böyle yapacağızla olmaz. fabrics and yellow taffeta weaves but he would ‘These organizations must unite, have an impact reject them saying his hands were full, which I on Turkey and introduce political strategies could not give any meaning to. Because he loved unique to Turkey. Or else talking about what money.’ Tony Blair did and hoping to do what he did is outright wrong.’ In contrast, the position of a subordinator (both simplex and complex) in its ARG2 clause is fixed: 6 Ordering flexibility of explicit discourse it must appear at the end of the clause, as shown in connectives and their arguments example (19). However, the clause is free in the In Turkish, the linear ordering of coordinating sentence and may be moved to the right of the conjunctions and subordinators and the clauses in sentence, as in example (20). It is a matter of which they occur shows some flexibility as to empirical research to find out whether different where in the clause they appear or as to the genres vary more in how clauses are ordered and ordering of the clauses. For example, coordinating what motivates preposing of ARG1. conjunctions may appear at the beginning of their (19) Ayşe konuşurken ben dinlemiyordum. ARG2, i.e. S-initially. This was shown earlier in ‘I was not listening whiley A şe was talking.’ Example (1). The sentences below illustrate ama (20) Ben dinlemiyordum Ayşe konuşurken. ‘but’ and çünkü ‘because’ used at this position. ‘I was not listening whiley A şe was talking.’

(15) Hatem Ağa’nın malına kimse yanaşamaz, 7 Issues and plans dokunamazdı. Ama Osman gitmiş, Hatem Ağa’nın çiftliğini yakmıştı. As mentioned above, we also plan to annotate ‘No one could approach and touch Agha implicit connectives between adjacent sentences or Hatem’s property. But Osman had burnt Agha clauses whose relation is not explicitly marked Hatem’s ranch.’ with a discourse connective. This we will do at a (16) Söz özgürlüğünün belli yasalar, belli ilkeler later stage, after explicit connectives have been çerçevesinde kalmak zorunda olduğunu annotated, following the procedure used in biliyoruz. Çünkü , bütün özgürlükler gibi, belli annotating implicit connectives in the PDTB sınırlar aşılınca, başkalarına zarar vermek, başkalarının özgürlüklerini zedelemek söz (PDTB-Group, 2006). Preliminary analysis has konusu oluyor. shown that punctuation serves as a useful hint in ‘We know that freedom of speech should remain inserting a coordinating conjunctions such as ‘and’ within the limits of certain laws and principles. or an anaphoric connective such as ‘then’ or Because , like all the other freedoms, when ‘consequently’ between the multiple adjacent main certain constraints are violated, one may clauses that can occur in a Turkish sentence harm others’ freedom.’ separated by a comma. Example (21) illustrates these cases. But coordinating conjunctions may also appear at the end of their ARG2 and so will appear S-finally (21) Yürüyor, Imp = THEN oturuyor, resim yapmaya in sentences with ARG1-ARG2 order. Below, we çalışıyor ama yapamıyor, tabela yazmaya illustrate two cases of ama ‘but’ and çünkü çalışıyor ama yazamıyor, Imp= CONSEQUENTLY ‘because’. sıkılıp sokağa çıkıyor, Imp=AND bisikletine atladığı gibi pedallara basıyor. (17) Kazıyabildiğini sildi, biriktirdi mendilinin içine. ‘He walks around, then sits down and tries to Çaba isteyen zor bir işti bu yaptığı ama. draw, but he can’t. He tries to inscribe words on ‘He wiped the area he had scraped and saved all the wooden plaque, but again he can’t. he could scrape in his rag. But what he was Consequently he gets bored, goes out, and hops doing was a difficult job, requiring effort.’ on his bike and pedals.’ (18) Kimi müşteriler dore rengi kumaşlarla, sarı taftalarla gelirdi de, elim dolu yapamam, diye

69 The 6th Workshop on Asian Languae Resources, 2008

A second important issue that will have to be Katherine Forbes-Riley, Bonnie Webber and Aravind tackled in the project is determining how much Joshi (2006). Computing Discourse Semantics: The material is needed to specify the argument of a Predicate-Argument Semantics of Discourse discourse connective. Annotation will be on text Connectives in D-LTAG. Journal of Semantics 23, spans, rather than on syntactic structure. This pp. 55—106. reflects two facts: First, there is only a small Aslı Göksel and Celia Kerslake (2005). Turkish: A amount of syntactically treebanked data in the Comprehensive Grammar. London and New York: MTC, and secondly, as has been discovered for Routledge. English, one can not assume that discourse units Kornfilt, Jacklin (1997). Turkish. London and New map directly to syntactic units (Dinesh et al, 2005). York: Routledge. Preliminary analysis also shows that discourse PDTB-Group (2006). The Penn Discourse TreeBank units may not coincide with a clause in its entirety. 1.0 Annotation Manual. Technical Report IRCS 06- For example, in examples (9) and (16), one can 01, University of Pennsylvania. take ARG1 to cover only the nominal complement of the matrix verb: The rest of the clause is not Rashmi Prasad, Samar Husain, Dipti Sharma and necessary to the discourse relation. The ways in Aravind Joshi (2008). Towards an Annotated Corpus of Discourse Relations in Hindi. The Third which the arguments of a discourse connective International Joint Conference on Natural Language may diverge from syntactic units must be Processing, January 7-12, 2008. characterized for Turkish as is being done for English (Dinesh et al, 2005). Bilge Say, Deniz Zeyrek, Kemal Oflazer and Umut A third issue we will investigate is whether Özge (2002). Development of a Corpus and a TreeBank for Present-day Written Turkish. different senses of a subordinator may be identified Proceedings of the Eleventh International simply from the type of nominalizing suffix Conference of Turkish Linguistics, Eastern required on the subordinate verb. For example, we Mediterranean University, Cyprus, August 2002. have noted in examples (6) and (7) that the two senses of the postposition için (namely, ‘since Bonnie Webber, Aravind Joshi, Matthew Stone and Alistair Knott (2003). Anaphora and Discourse (causal)’ and ‘in order to’) are disambiguated by Structure. Computational Linguistics 29 (4) 547-588. the nominalizing suffixes. The extent to which morphology aids sense disambiguation is an Appendix: A preliminary list of explicit empirical issue that will be further addressed in the discourse connectives found in the MTC belonging project. to five syntactic classes and their English equivalents Acknowledgement Simple coordinating English equivalent We would like to thank Sumru Özsoy, Aslı Göksel conjunctions and Cem Bozşahin for their comments on an ama but earlier version of this paper. The first author also fakat but thanks the Caledonian Research Foundation and çünkü because the Royal Society of Edinburgh for awarding her dA and, but with the European Visiting Research Fellowship, which made this research possible. All remaining halbuki despite errors are ours. oysa despite önce before References sonra after ve and Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Rashmi veya or Prasad, Aravind Joshi and Bonnie Webber (2005). ya da or Attribution and the (Non-)Alignment of Syntactic and Discourse Arguments of Connectives. veyahut or Proceedings of the ACL Workshop on Frontiers in Corpus Annotation II: Pie in the Sky. Ann Arbor, Michigan. June 2005.

70 The 6th Workshop on Asian Languae Resources, 2008

Paired coordinating English equivalent conjunctions hem .. hem both and ya .. ya either or gerek .. gerek(se) either or

Simplex subordinators English equivalent (Converbs) -ArAk by means of -Ip and -(y)kEn while, whereas -(y)AlI since -(I)ncA when

Complex subordinators English equivalent -Ir gibi as if, as though -eğer (y)sE if -dIğI zaman when -dIğI kadar as much as -dIğI gibi as well as -dAn sonra after -dAn önce before -dAn dolayı due to -(y)sE dA even though -(y)Incaya kadar/dek until -(y)AlI beri since (temporal) -(n)A rağmen/karşılık despite, although -(n)A göre since (causal)

Anaphoric connectives English equivalent aksi halde if not, otherwise aksine on the contrary bu nedenle for this reason buna rağmen/karşılık despite this bundan başka besides this bunun yerine instead of this dahası moreover, in addition ilk olarak firstly, first of all örneğin for example mesela for example sonuç olarak consequently üstelik what is more yoksa otherwise ardından afterwards

71 The 6th Workshop on Asian Languae Resources, 2008

72 The 6th Workshop on Asian Languae Resources, 2008

Towards an Annotated Corpus of Discourse Relations in Hindi

Rashmi Prasad*, Samar Husain†, Dipti Mishra Sharma† and Aravind Joshi*

explicitly realized in the text. For example, in (1), Abstract the causal relation between ‘the federal government suspending US savings bonds sales’ We describe our initial efforts towards and ‘Congress not lifting the ceiling on developing a large-scale corpus of Hindi government debt’ is expressed with the explicit texts annotated with discourse relations. connective ‘because’.2 The two arguments of each Adopting the lexically grounded approach connective are also annotated, and the annotations of the Penn Discourse Treebank (PDTB), of both connectives and their arguments are we present a preliminary analysis of recorded in terms of their text span offsets.3 discourse connectives in a small corpus.

We describe how discourse connectives are (1) The federal government suspended sales of U.S. represented in the sentence-level savings bonds because Congress hasn’t lifted the dependency annotation in Hindi, and ceiling on government debt. discuss how the discourse annotation can

enrich this level for research and One of the questions that arises is how the applications. The ultimate goal of our work PDTB style annotation can be carried over to is to build a Hindi Discourse Relation Bank languages other than English. It may prove to be a along the lines of the PDTB. Our work will challenge cross-linguistically, as the guidelines and also contribute to the cross-linguistic methodology appropriate for English may not understanding of discourse connectives. apply as well or directly to other languages, especially when they differ greatly in syntax and 1 Introduction morphology. To date, cross-linguistic An increasing interest in human language investigations of connectives in this direction have technologies such as textual summarization, been carried out for Chinese (Xue, 2005) and question answering, natural language generation Turkish (Deniz and Webber, 2008). This paper has recently led to the development of several explores discourse relation annotation in Hindi, a discourse annotation projects aimed at creating language with rich morphology and free word large scale resources for natural language order. We describe our study of “explicit processing. One of these projects is the Penn connectives” in a small corpus of Hindi texts, Discourse Treebank (PDTB Group, 2006),1whose discussing them from two perspectives. First, we goal is to annotate the discourse relations holding consider the type and distribution of Hindi between eventualities described in a text, for connectives, proposing to annotate a wider range example causal and contrastive relations. The PDTB is unique in using a lexically grounded 2 The PDTB also annotates implicit discourse relations, but approach for annotation: discourse relations are only locally, between adjacent sentences. Annotation here anchored in lexical items (called “explicit consists of providing connectives (called “implicit discourse connectives”) to express the inferred relation. Implicit discourse connectives”) whenever they are connectives are beyond the scope of this paper, but will be taken up in future work. 3 The PDTB also records the senses of the connectives, and * University of Pennsylvania, Philadelphia, PA, USA, each connective and its arguments are also marked for their {rjprasad,joshi}@seas.upenn.edu attribution. Sense annotation and attribution annotation are not † Language Technologies Research Centre, IIIT, Hyderabad, discussed in this paper. We will, of course, pursue these India, [email protected], [email protected] aspects in our future work concerning the building of a Hindi 1 http://www.seas.upenn.edu/˜pdtb Discourse Relation Bank.

73 The 6th Workshop on Asian Languae Resources, 2008

of connectives than the PDTB. Second, we language. We did try to create a list from our own consider how the connectives are represented in knowledge of the language and grammar, and also the Hindi sentence-level dependency annotation, in by translating the list of English connectives in the particular discussing how the discourse annotation PDTB. However, when we started looking at real can enrich the sentence-level structures. We also data, this list proved to be incomplete. For briefly discuss issues involved in aligning the example, we discovered that the form of the discourse and sentence-level annotations. complementizer ‘ik’ also functions as a temporal Section 2 provides a brief description of Hindi subordinator, as in (3). word order and morphology. In Section 3, we present our study of the explicit connectives (3) [ vah baalaTI ko gaMdo panaI sao ApnaI caaOklaoT identified in our texts, discussing them in light of [he bucket of dirty water from his chocolates the PDTB. Section 4 describes how connectives inakalanao hI vaalaa qaa] ik {]sakI mammaI nao are represented in the sentence-level dependency taking-out just doing was] that {his mother ERG annotation in Hindi. Finally, Section 5 concludes ]sao raok idyaa } with a summary and future work. him stop did} “He was just going to take out the chocolates from 2 Brief Overview of Hindi Syntax and the dirty water in the bucket when his mother stopped him.” Morphology Hindi is a free word order language with SOV as The method of collecting connectives will the default order. This can be seen in (2), where therefore necessarily involve “discovery during (2a) shows the constituents in the default order, annotation”. However, we wanted to get some and the remaining examples show some of the initial ideas about what kinds of connectives were word order variants of (2a). likely to occur in real text, and to this end, we looked at 9 short stories with approximately 8000 (2) a. malaya nao samaIr kao iktaba dI . words. Our goal here is to develop an initial set of malay ERG sameer DAT book gave guidelines for annotation, which will be done on 4 “Malay gave the book to Sameer” (S-IO-DO-V) the same corpus on which the sentence-level b. malaya nao iktaba samaIr kao dI. (S-DO-IO-V) dependency annotation is being carried out (see c. samaIr kao malaya nao iktaba dI. (IO-S-DO-V) Section 4). Table 1 provides the full set of d. samaIr kao iktaba malaya nao dI. (IO-DO-S-V) connectives we found in our texts, grouped by e. iktaba malaya nao samaIr kao dI. (DO-S-IO-V) syntactic type. The first four columns give the f. iktaba samaIr kao malaya nao dI. (DO-IO-S-V) syntactic grouping, the Hindi connective expressions, the English gloss, and the English Hindi also has a rich case marking system, equivalent expressions, respectively. The last although case marking is not obligatory. For column gives the number of occurrences we found example, in (2), while the subject and indirect of each expression. In the rest of this section, we object are explicitly for the ergative (ERG) and describe the function and distribution of discourse dative (DAT) cases, the direct object is unmarked connectives in Hindi based on our texts. In the for the accusative. discussion, we have noted our points of departure from the PDTB where applicable, both with 3 Discourse Connectives in Hindi respect to the types of relations being annotated as Given the lexically grounded approach adopted for well as with respect to terminology. For argument discourse annotation, the first question that arises naming, we use the PDTB convention: the clause is how to identify discourse connectives in Hindi. with which the connective is syntactically Unlike the case of the English connectives in the associated is called Arg2 and the other clause is PDTB, there are no resources that alone or together called Arg1. Two special conventions are followed provide an exhaustive list of connectives in the for paired connectives, which we describe below. In all Hindi examples in this paper, Arg1 is enclosed in square brackets and Arg2 is in braces. 4 S=Subject; IO=Indirect Object; DO=Direct Object; V=Verb; ERG=Ergative; DAT=Dative

74 The 6th Workshop on Asian Languae Resources, 2008

Connective Type Hindi Gloss English Num Sub. Conj. @yaaoMik why-that because 2 (@yaaoM)ik..[salaIe (why)-that..this-for because 3 (Agar|yadI)..tba|tao (if)..then if..(then) 15 (jaba).. tba|tao (when)..then when 50 jaba tk.. tba tk (ko ilae) when till..then till (of for) until 2 jaOsao hI..(tao) as just..(then) as soon as 5 [tnaa|eosaa..kI so|such..that so that 12 taik so-that so that 1

ik that when 5

Sentential Relatives ijasasao which-with because of which 5 jaao which because of which 1 ijasako karNa which-of reason because of which 1 Subordinator pr upon upon 9 r(-k |-ko|krko) (do) after |while 111 samaya time while 1 hue happening while 28 ko baad of later after 3 sao with due to 1 ko phlao of before before 1 ko ilae of for in order to 4 maoM in while 1 ko karNa of reason because of 3 Coord. Conj. laoikna|pr|prntu but but 51 AaOr|tqaa and and 117 yaa or or 2 yaaoM tao..pr such TOP..but but 2 naa kovala..balaik not only..but not only..but 1 Adverbial tba then then 2 baad maoM later in later 5 ifr then then 4 [saIilae this-for that is why 7 nahIM tao not then otherwise 5 tBaI tao then-only TOP that is why 1 saao so so 10 vahI|yahI nahIM that|this-only not not only that 1 TOTAL 472 Table 1: A Partial List of Discourse Connectives in Hindi. Parentheses are used for optional elements; “|” is used for alternating elements; TOP = topic marker.

3.1 Types of Discourse Connectives DAT give would], why-that {he-EMPH all 3.1.1 Subordinating Conjunctions QartI kI sampda ka svaamaI } earth of wealth of lord is} Finite adverbial subordinate clauses are “I would give all this wealth to the king, because he introduced by independent lexical items called alone is the lord of this whole world’s wealth.” “subordinating conjunctions”, such as @yaaoMik (‘because’), as in (4), and they typically occur as As the first group in Table 1 shows, right or left attached to the main clause. subordinating conjunctions in Hindi often come paired, with one element in the main clause and (4) [maOM [sa saBaI Qana kao rajya ko baadSaah the other in the subordinate clause (Ex.5). One [I this all wealth ACC kingdom of king of these elements can also be implicit (Ex.6), kao do dota], @yaaoMik {vahI samast

75 The 6th Workshop on Asian Languae Resources, 2008

and in our texts, this was most often the light putting] perhaps this-for [appropriate not subordinate clause element. samaJato qao] ik {tola ka Apvyaya haogaa}. Consider did] that {oil of waste be-FUT}. (5) @yaaoMik {yah tumharI ja,maIna pr imalaa hO}, [sailae “. . . but he did not consider it appropriate to light the because {this your land on found has}, this-for lamp repeatedly or light another lamp, perhaps [[sa Qana pr tumhara AiQakar hO] because it would be a waste of oil.” [this treasure on your right is] 3.1.2 Sentential Relative Pronouns “Because this was found on your land, you have the right to this treasure.” Since discourse relations are defined as holding between eventualities, we have also identified (6) [ ]saka vaSa calata] tao {vah ]sao Gar sao relations that are expressed syntactically as [her power walk] then {she it home from relative pronouns in sentential relative clauses, baahr inakala dotI} which modify the main clause verb denoting an out take would} eventuality, rather than some entity denoting “Had it been in her power, she would have banished noun phrase. For example, in (9), a it from the house.” result/purpose relation is conveyed between ‘the

When both elements of the paired connective are man’s rushing home’ and ‘the bird being taken explicit, their text spans must be selected care of’, and we believe that this relation discontinuously. The main clause argument is between the eventualities should be captured called Arg1 and the subordinate clause despite it’s syntactic realization as the relative argument, Arg2. pronoun ‘ijasasao’ (‘because of which/so that’). (10) Subordinating conjunctions, whether single or gives an example of a modified relative paired, can occur in non-initial positions in their pronoun. clause. However, this word order variability is not completely unconstrained. First, not all (9) [saara kama caaoD,kr vah ]sa baImaar icaiD,yaa conjunctions display this freedom. For example, [all work leaving he that sick bird kao ]zakr dbaa Gar kI Aaor Baagaa], while ‘jaba’ (‘when’) can be clause-medial (Ex. ACC picking-up fast home of direction ran], 7), ‘@yaaoMik’ (‘because’) cannot. Second, when the ijasasao {]saka sahI [laaja ikyaa jaa sako} main clause precedes the subordinate clause, the from-which {her proper care do go able} main clause element, if explicit, cannot appear “Leaving all his work, he picked up the bird and ran clause-initially at all. Consider the causal ‘@yaaoMik.. home very fast, so that the bird could be given proper [salaIe’ (Ex.5), which represents the subordinate- care.” main clause order. In the reverse order, the explicit main clause ‘[salaIe’ (Ex.8) appears (10) [ }^MTaoM ko hr baar kdma rKnao pr clause medially. Placing this element in clause- [camels of every time step keeping upon

initial position is not possible. icaiD,yaao M ko isar Aapsa maoM tqaa }^MT kI birds of head each-other in and camels of gardna sao Tkra rho qao] ijasako karNa (7) {lakD,haro kI p%naI kao} jaba {yah neck with hit-against be had] of-which reason {woodcutter of wife DAT} when {this {]na pixayaaoM kI drdBarI caIKoM inakla maalaUma pD,a ik [sa icaiDyaa ko karNa, {those birds of painful screams come-out knowledge put that this bird of reason rhI qaIM}. kama CaoD,kr Gar Aa gayaa hO} tao [vah be had} work leaving home come went is} then [she “With each step of the camels, the birds heads were ]sa pr barsa pD,I]. hitting against each other as well as with the camels’ him on anger-rain put] necks because of which the birds were screaming “When the woodcutter’s wife found out that he had painfully.” left his work and come home to care for the bird, she raged at him.” 3.1.3 Subordinators

(8) [ . . . pr icaraga kI ba%tI ]sakanaa yaa daohrI In contrast to the subordinating conjunctions, [. . .but lamp of light light or another elements introducing non-finite subordinate ba%tI lagaanaa] Saayad [sailae []icat nahIM clauses are called “subordinators”. Unlike

76 The 6th Workshop on Asian Languae Resources, 2008

English, where certain non-finite subordinate marker would not be regarded as a connective, clauses, called “free adjuncts”, appear without while in the latter case, it would. Second, the any overt marking so that their relationship with clauses marked by these connectives often seem the main clause is unspecified, Hindi non-finite to be semantically weak. This is especially true subordinate clauses almost always appear with of verbal participles, which are nonfinite verb overt marking. However, also unlike English, appearing in a modifying relation with another where the same elements may introduce both finite verb. Whereas in some cases (Ex.12-13) finite and non-finite clauses (cf. After leaving, the two verbs are perceived as each projecting she caught the bus vs. After she left, she caught “two distinct events” between which some the bus), different sets of elements are used in discourse relation can be said to exist, in other Hindi. In fact, as can be seen in the subordinator cases (Ex.14), the two verbs seem to project two group in Table 1, the non-finite clause markers distinct actions but as part of a “single complex are either postpositions (Ex.11), particles event” (Verma, 1993). These judgments can be following verbal participles (Ex.12), or suffixes very subtle, however, and our final decision on marking serial verbs (Ex.13). whether to annotate such constructions will be made after some initial annotation and (11) { mammaI ko manaa krnao} ko karNa [ramaU evaluation. {mummy of warning doing} of reason [Ramu qaaoD,I qaaoD,I caaOklaoT baD,o AnaMd ko saaqa (14) { doKto hI doKto saba baOla Baagato } little little chocolate big pleasure of with {looking EMPH looking all buffalos running} Ka rha qaa]. hue [gaaoSaalaa phu^Mca gae] eat being be] happening [shed reach did] “Because of his mother’s warning, Ramu was eating “Within seconds all the buffalos came running to the bits of chocolate with a lot of pleasure.” shed.”

(12) . . . AaOr {Kolato} hue [yah BaUla jaata hO The naming convention for the arguments of . . . and {playing} happening [this forget go is subordinators is the same as for the ik yaid ]saka ima~ BaI Apnao iKlaaOnao kao subordinating conjunctions: the clause that if his friends also their toys to ]sao haqa nahIM lagaanao dota, tao ]sao associated with the subordinator is called Arg2 him hand not touching did, then he while its matrix clause is called Arg1. iktnaa baura lagata] Unlike subordinating conjunctions, how-much bad feel] subordinators do not come paired and they can “. . . and while playing, he forgets that if his friends only appear clause-finally. Clause order, while too didn’t let him touch their toys, then how bad he not fixed, is restricted in that the nonfinite would feel.” subordinate clause can appear either before the main clause or embedded in it, but never after (13) { ApnaI p%naI sao yah sauna}kr [lakD,hara the main clause. {self wife from this listen}-do [woodcutter bahut duKI huAa] 3.1.4 Coordinating Conjunctions much sad became] “Upon hearing this from his wife, the woodcutter Coordinating conjunctions in Hindi are found in became very sad.” both inter-sentential (Ex.15) and intra-sentential (Ex.16) contexts, they always appear as While subordinators constitute a frequently- independent elements, and they almost always 5 used way to mark discourse relations, their appear clause-initially. For these connectives, annotation raises at least two difficult problems, both of which have implications for the 5 While the contrastive connectives ‘pr’, ‘prntU’ appear only reliability of annotation. The first is that these clause-initially, it seems possible for the contrastive ‘laoikna’ markers are used for marking both argument to appear clause-medially, suggesting that these two types clauses and adjunct clauses, so that annotators may correspond to the English ‘but’ and ‘however’, would be required to make difficult decisions for respectively. However, we did not find any examples of distinguishing them: in the former case, the clause-medial ‘laoikna’ in our texts, and this behavior will have to be verified with further annotation.

77 The 6th Workshop on Asian Languae Resources, 2008

the first clause is called Arg1 and the second, Addtional annotation will provide further Arg2. evidence in this regard.

(15) [ jaba vah laaOTta tao gaa-gaakr ]saka mana 4 Hindi Sentence-level Annotation [when he return then sing-singing his mind andDiscourse Connectives KuSa kr dotI]. laoikna {]sakI p%naI kao vah happy do gave]. But {his wife DAT the The sentence-level annotation task in Hindi is icaiD,yaa fUTI Aa^MK nahIM sauhatI qaI}. an ongoing effort which aims to come up with a bird torn eye not bear did} dependency annotated treebank for the NLP/CL “Upon his return, she would make him happy by community working on Indian languages. singing. But his wife could not tolerate the bird even Presently a million word Hindi corpus is being a little bit.” manually annotated (Begum et al., 2008). The dependency annotation is being done on top of (16) [ ] { tBaI drvaaja,a Kulaa AaOr maalaikna Aa the corpus which has already been marked for [then-only door opened] and {wife come POS tag and chunk information. The scheme has ga[- }. went} 28 tags which capture various dependency “Just then the door opened and the wife came in.” relations. These relations are largely inspired by the Paninian grammatical framework. Given We also recognize paired coordinating below are some relations, reflecting the conjunctions, such as ‘naa kovala..balaik’ (See Table argument structure of the verb. 1). The argument naming convention for these is the same as for the single conjunctions. a) kta- (agent) (k1) b) kma- (theme) (k2) 3.1.5 Discourse Adverbials c) krNa (instrument) (k3) Discourse adverbials in Hindi modify their clau- d) samp`dana sampradaan (recipient) (k4) ses as independent elements, and some of these e) Apadana (source) (k5) are free to appear in non-initial positions in the f) AiQakrNa (location) (k7) clause. Example (17) gives an example of the consequence adverb, ‘saao’. The Arg2 of discourse Figure 1 shows how Examples (2a-f) are adverbials is the clause they modify, whereas represented in the framework. Note that agent Arg1 is the other argument. and theme are rough translations for ’kta-’ and ’kma-’ respectively. Unlike thematic roles, these (17) [icaiD,yaa jabaana kT jaanao AaOr maalaikna ko eosao relations are not purely semantic, and are [bird tongue cut going and wife of this motivated not only through verbal semantics but vyavahar sao Dr ga[- qaI]. saao {vah iksaI also through vibhaktis (postpositions) and TAM behavior with fear go had]. So {she some (Tense, aspect and modality) markers (Bharati et trh ]D,kr calaI ga[-}. al., 1995). The relations are therefore syntactico- manner flying walk went}. semantic, and unlike thematic roles there is a “The bird was scared due to her tongue being cut and greater binding between these relations and the because of the wife’s behavior. So she somehow flew syntactic cues. away.”

As with the PDTB, one of our goals with the dIdIdI Hindi discourse annotation is to explore the k1 k4 k2 structural distance of Arg1 from the discourse adverbial. If the Arg1 clause is found to be snon- mmmaaalayalayalaya samaIrsamaIrsamaIr iiikkktabatabataba adjacent to the connective and the Arg2 clause, it may suggest that adverbials in Hindi behave Figure 1: Dependency Diagram for Example (2) anaphorically. In the texts we looked at, we did Some discourse relations that we have identified not find any instances of non-adjacent Arg1s. are already clearly represented in the sentence- level annotation. But for those that aren’t, the

78 The 6th Workshop on Asian Languae Resources, 2008

discourse level annotations will enrich the be able to provide the appropriate argument sentence-level. In the rest of this section, we structure and semantics for these paired discuss the representation of the different types connectives. of connectives at the sentence level, and discuss how the discourse annotation will add to the [sailae[sailae[sailae information present in the dependency ccof structures. AiQakar hOAiQakar hO k2 k1 rh Subordinating Conjunctions Subordinating conjunctions are lexically represented in the QanaQanaQana tututumharamharamhara @yaaoMik ccof dependency tree, taking the subordinating clause iimalaamalaa hOimalaa hO as their dependents while themselves attaching k1 k7p to the main verb (the root of the tree). Figure 2 yahyahyah ja,maInaja,maInaja,maIna shows the dependency tree for Example (4) r6 containing the subordinating conjunction ’@yaaoMik’. tumharItumharItumharI Note that the edge between the connective and Figure 3: Dependency Tree for Paired the main verb gives us the causal relation Subordinating Conjunction in Example (5) between the two clauses, the relation label being ’rh’ (relation hetu ’cause’). Thus, the discourse Subordinators As mentioned earlier, Hindi level can be taken to be completely represented nonfinite subordinate clauses almost always at the sentence-level. appear with overt marking. But unlike the subordinating conjunctions, subordinators are not lexically represented in the dependency do dotado dota trees. Figure 4 gives the dependency k1 k2 k4 rh representation for Example (11) containing a postposition subordinator ’ko karNa’, which relates maOMmaOMmaOM QanaQanaQana baabaabaadddSaahSaahSaah @yaaoMik r6 ccof the main and subordinate clauses causally. As the figure shows, while the causal relation label hE rajyarajyarajya k1 k1s (‘rh’) appears on the edge between the main and subordinate verbs, the subordinator itself is not vahIvahIvahI svaamaIsvaamaIsvaamaI lexically represented as the mediator of this r6 relation. The lexically grounded annotation at sampdasampdasampda the discourse level will thus provide the textual r6 anchors of such relations, enriching the QartIQartIQartI dependency representation. Furthermore, while many of the subordinators in Table 1 are fully Figure 2: Dependency Tree for Subordinating specified in the dependency trees for the Conjunction in Example (4) semantic relation they denote (e.g., ‘pr’ and ‘maoM’ marked as the ‘k7t’ (location in time) relation, Paired Subordinating Conjunctions Unlike and ‘ko karNa’ and ‘sao’ marked as the ‘rh’ Example (4), however, the analysis for the (cause/reason) relation), others, like the particle paired connective in Example (5), given in ‘hue’ are underspecified for their semantics, being Figure 3, is insufficient. Despite the lexical marked only as ‘vmod’ (verbal modifier). The representation of the connective in the tree, the discourse-level annotation will thus be the correct interpretation of the paired conjunction source for the semantics of these subordinators. and the clauses which it relates is only possible at the discourse level. In particular, the Coordinating Conjunctions Coordinating dependencies don’t show that ’@yaaoMik’ and ’[salaIe’ conjunctions at the sentence level anchor the are two parts of the same connective, expressing root of the dependency tree. Figure 5 shows the a single relation and taking the same two arguments. Thus, the discourse annotation will

79 The 6th Workshop on Asian Languae Resources, 2008

dependency representation of Example (16) wide range of connectives, analyzing their types containing a coordinating conjunction. and distributions, and discussing some of the issues involved in the annotation. We also Ka rha qaarha qaa described the representation of the connectives rh vmod in the sentence-level dependency annotation k1 k2 being carried out independently for Hindi, and manaamanaamanaa krnaokrnaokrnao ramaramarama caaOcaaOcaaOkkklaoTlaoTlaoT AanaMdAanaMdAanaMd discussed how the discourse annotations can k1 enrich the information provided at the sentence level. While we focused on explicit connectives mammaImammaImammaI in this paper, future work will investigate the Figure 4: Dependency Tree for Subordinator in annotation of implicit connectives, the semantic Example (11) classification of connectives, and the attribution of connectives and their arguments. AaOrAaOrAaOr ccof ccof References Rafiya Begum, Samar Husain, Arun Dhwaj, Dipti KulaaKulaaKulaa AaAaAa ga[ga[ga[--- - Misra Sharma, Lakshmi Bai, and Rajeev Sangal. k7t k1 k1 2008. Dependency annotation scheme for Indian languages. In Proceedings of IJCNLP-2008. tBaItBaItBaI drvaaja,adrvaaja,adrvaaja,a maalaiknamaalaiknamaalaikna Hyderabad, India. Figure 5: Dependency Tree for Coordinating Akshar Bharati, Vineet Chaitanya, and Rajeev Conjunction in Example (16) Sangal. 1995. Natural Language Processing: A Paninian Perspective. Prentice Hall of India. While the sentence-level dependency analysis http://ltrc.iiit.ac.in/downloads/nlpbook/nlppanini.p here is similar to the one we get at the discourse df. level, the semantics of these conjunctions are again underspecified, being all marked as ‘ccof’, Manindra K. Verma (ed.). 1993. Complex Predicates in South Asian Languages. New Delhi: Manohar. and can be obtained from the discourse level. The PDTB-Group. 2006. The Penn Discourse Discourse Adverbials Like subordinating TreeBank 1.0 Annotation Manual. Technical conjunctions, discourse adverbials are Report IRCS-06-01, IRCS, University of represented lexically in the dependency tree. Pennsylvania. They are attached to the verb of their clause as Bonnie Webber, Aravind Joshi, Matthew Stone, and its child node and their denoted semantic Alistair Knott. 2003. Anaphora and discourse relation is specified clearly. This can be seen structure. Computational Linguistics, 29(4):545– with the temporal adverb ‘tBaI’ (‘then-only’) and 587. its semantic label ‘k7t’ in Figure 5. At the same Nianwen Xue. 2005. Annotating Discourse time, since the Arg1 discourse argument of Connectives in the Chinese Treebank. In adverbials is most often in the prior context, the Proceedings of the ACL Workshop on Frontiers in discourse annotation will enrich the semantics of Corpus Annotation II: Pie in the Sky. Ann Arbor, these connectives by providing the Arg1 Michigan. argument. Deniz Zeyrek and Bonnie Webber. 2008. A Discourse Resource for Turkish: Annotating 5 Summary and Future Work Discourse Connectives in the METU Corpus. In Proceedings of IJCNLP-2008. Hyderabad, India. In this paper, we have described our study of discourse connectives in a small corpus of Hindi texts in an effort towards developing an annotated corpus of discourse relations in Hindi. Adopting the lexically grounded approach of the Penn Discourse Treebank, we have identified a

80 The 6th Workshop on Asian Languae Resources, 2008

A Semantic Study on Yami Ontology in Traditional Songs

Yin-Sheng Tai D. Victoria Rau Meng-Chien Yang Providence University, Providence University, Providence University, Taiwan Taiwan Taiwan

[email protected] [email protected] [email protected]

Abstract “Raods” (traditional songs) play an important role in culture because they subsume features such as The purpose of this study was to provide an archaism, metaphors, puns and polite contradiction. example of how to build a Yami ontology from In addition, Yami traditonal songs reflect Yami traditional songs by employing Protégé, an values, such as love and honoring hard work (e.g. open-source tool for editing and managing fishing or farm work), and cultural events, such as ontologies developed by Stanford University. completion of hard work and special festivals. Following Conceptual Blending Theory Thus, we would like to build an ontology using (Fauconnier and Turner, 1998), we found that Yami people use the conceptual metaphor of Yami traditional songs, adapting the taxonomies in “fishing” in traditional songs when praising the WordNet and SUMO, which can serve as a point of host’s diligence in a ceremony celebrating the departure for further mapping of other Yami completion of a workhouse. The process of ontologies. In this paper, we will report our building ontologies is explored and illustrated. preliminary results, giving one example at this The proposed construction of an ontology for early stage of the research project on constructing Yami traditional songs can serve as a Yami ontology, led by the second and third fundamental template, using the corpus authors. available online from the Yami documentation website (http://yamiproject.cs.pu.edu.tw/yami) to build ontologies for other domains. 2 Literature Review 2.1 Conceptual Metaphor Theory 1 Introduction The original Conceptual Metaphor Theory was Yami is an endangered Austronesian language, proposed by Lakoff and Johnson (1980). They spoken on Orchid Island (Lanyu), 46 kilometers identify metaphor as a transfer between the source southeast of the main island of Taiwan. For the domain and the target domain. This has become purposes of language documentation and known as the “two-domain theory” of metaphor. preservation, an on-line Yami dictionary1 has been 2.2 Conceptual Blending Theory developed to facilitate language learning. Although Conceptual Blending Theory (Fauconnier and each lexical entry contains basic meanings of Turner, 1998; 2002) is a framework for interpreting words (in both English and Chinese), cognitive linguistic phenomena such as analogy, pronunciations, and roots and affixes, no metaphor, etc. According to Conceptual Blending information on lexical semantics, such as Theory, the input structures, generic structures, and synonyms, hyponyms, or metaphors is available. If blend structures in the network are mental spaces. information on lexical relationships could be In Figure 1, the frame structure recruited to the incorporated into the Yami dictionary, this online mental space is represented as a rectangle either tool would be even more useful for Yami language outside or iconically inside the circle. learners. In the present study, we focused on the metaphors in Yami lyrics. Knight (2005) considers that, from a Yami native speaker’s point of view,

1 Available from the following IP address: http://yamiproject.cs.pu.edu.tw/elearn/search.php Figure 1. Conceptual Blending Theory

81 The 6th Workshop on Asian Languae Resources, 2008

2.3 Protégé “Now, the old man reached the harbor.” In this study, we are building a Yami ontology 2 ji na minatokod Jicamongan ta based on traditional songs using Protégé2, which NEG already reached PLN because “ not only provides a rich set of basic knowledge- He didn’t paddle to Jicamongan.” modeling structures and a way to enter data, but 3 kalagarawan am paneneneban o ... can also be customized to create new domains in fingerling_place TOP shallow_sea NOM ... “ knowledge models. He moved in the shallow sea where only fingerling As demonstrated in previous studies (e.g., fish live.” Dodds, 2005; Lin 2006), Protégé has been 4 to na rana avavangi sia ta AUX 3.S.GEN already row a boat there because successfully applied to construction of ontology in “ specific domains. Therefore, the present study used He could only row his boat there” Protégé to construct an ontology of Yami 5 ji na rana voaz o kakaod NEG 3.S.GEN already row NOM paddle traditional songs. “Because he had already lost strength to row.” 3. Methodology Seven metaphorical tokens related to “fishing” 3.1 Data Collection were detected from the lyric. They are marked in The main data resource came from Dong’s bold case. In Line 1, minangyid “reached the monograph (1994) “In Praise of Taro,” which harbor” was identified as metaphorical because contains a total of 250 songs. This study is based although it literally describes going back home on one song which contains many fishing after finishing one’s fishing, its intended meaning metaphors. Seven metaphorical tokens were is “to rest and hold a ceremony celebrating the extracted from this song. completion of a workhouse.” Using the Conceptual 3.2 Data Analysis Metaphor Theory (1980), we compared the First of all, the question of the use of metaphor cognitive activities in the lyrics and the mental in Yami was analyzed by Lakoff & Johnson’s space (Table 1). The entities, quality, and functions Conceptual Metaphor Theory (1980) and in the domain of fishing were analyzed. Fauconnier and Turner’s Conceptual Blending Theory (1998). Secondly, two taxonomic tenors Table 1. Reached the harbor vs. Holding a were identified using WordNet (Fellbaum, 1999) ceremony celebrating the completion of a and Yami Texts with Reference Grammar and workhouse Dictionary (Rau & Dong, 2006). Based on these taxonomic tenors, Yami words were classified into “Verbs” and “Nouns”.

4. Results and Discussion 4.1 Conceptual Blending In the following discussion, we begin with an analysis of a traditional Yami song celebrating the completion of a workhouse with a harvest of taro. It mostly praises the host’s achivement and hard work. After praising the host’s achivement, the guests take all the host’s taros and cover the roof of the workhouse with them. Finally, the guests sing songs with the host in turn. Table 1 shows that Yami people prefer to use The lyric is illustrated as follows: “fishing” as a metaphor for the intended meanings 1 oya rana minangyid siapen rarakeh of building a house or farming. We further this already reached_the_harbor grandfather old employed the Conceptual Blending Theory (1998) to interpret the “harbor” example (see Figure 2). 2 Protégé is a free, open source ontology editor and knowledge-base framework. The IP address of Protégé is: http://protege.stanford.edu/.

82 The 6th Workshop on Asian Languae Resources, 2008

elements projected from Input space II. As a result, “the host of the completion ceremony of a workhouse” is the “fisherman” of the “workhouse ceremony;” the relation between “the host” and the “workhouse ceremony” is that of “finishing a time- consuming job.” Additionally, “taros” in the yard are compared with “fish” at the harbor, which await to be “shared with friends and relatives.” 4.2 Taxonomy Yami verbs subsume dynamic verbs and stative verbs (Rau and Dong, 2006). Based on the notions from WordNet, we further divided verbs into Bodily Function and Care Verbs, Change Verbs, Communication Verbs, Competition Verbs, Consumption Verbs, Contact Verbs, Cognition Figure 2. “Reached the harbor” vs. “Holding a Verbs, Creation Verbs, Motion Verbs, Emotion or ceremony celebrating the completion of a Psych Verbs, Stative Verbs, Perception Verbs, workhouse” Possession Verbs, Social Interaction Verbs, and Weather Verbs. Since Yami does not possess a The concept of “reached the harbor” is distinctive adjective word class, both descriptive categorized into Input space I, and “holding a and relational adjectives are in this study classified ceremony celebrating the completion of a under stative verbs in WordNet. Descriptive workhouse” is categorized into Input space II. Both adjectives subsume antonymy, gradation, Inputs have certain similarities as well as distinct markedness, polysemy and selectional preferences, features. From Input Space I, “fisherman,” “boat,” reference-modifying adjectives, color adjectives, “fish,” “harbor,” and “to share his achievements quantifiers, and participial adjectives. Relational and honor with his friends and relatives” are adjectives include two domains, “pertaining” or respectively mapped into “host/builder,” “new “relating to” (Fellbaum 1999: 63). The coding of workhouse,” “taro,” “yard,” and “to share his adjectives in this file is different from that of achievements and honor with his friends and descriptive adjectives. Rather than being part of a relatives,” in Input Space II. The Inputs might cluster, each synset is entered individually, so that share some cross-mapping properties, which can be the interface will present the adjective with its listed in the Generic space. The structure from the related noun and information about the sense of the two input mental spaces is projected into the Blend noun. space. Essentially, which elements from Input For the aspect of Yami Nouns, we categorized space II should be selected and projected onto the the nouns into 10 basic noun categories following blend space are determined by the contents of the WordNet (Fellbaum, 1999: 30), including entity, lyric. Thus, in the blend space, all elements remain abstraction, psycho-feature, natural phenomena, separate from their corresponding counterparts, but activity, event, group, location, possession, and the relations among the features in Input space I state. determines the relations between corresponding 4.3 Example from the Yami Lyric counterparts. That is to say, the running structure in The following example illustrates how we the blend space partially projected from Input extracted the metaphorical words of “fishing” from space I determines the existing relations among the the lyric and classfied them into their domains. elements in the blend. In Input space I, a Firstly, minangyid “reached the harbor,” avavangi “fisherman” needs a “boat” to fish in the ocean, so “row, sail something” and voaz “row something,” the relation between “fisherman” and “boat” is a were classified as Motion Verbs. Secondly, kind of “earning a living.” In addition, what a Jicamongan “a place name of deep water,” “fisherman” works for is “fish.” After the kalagarawan “a place where fingerling fish swim” fisherman finishes his work, he has to go back to and paneneneban “shallow place,” were the harbor. Such relations also operate among those categorized under the main section Location and

83 The 6th Workshop on Asian Languae Resources, 2008

the sub-section Sea. Finally, kakaod “ paddle” was meanings of each word, the ontology builder can classified in the main section Entity and the sub- check the ontology with the DL Query Tab in section Artifact, as shown in Figure 3 and Figure 4. Protégé. In summary, the connection between the two items minangyid, “reached the harbor” and “to hold a new workhouse celebration,” shown in shadow in Figure 4, is solely based on the structure of the inputs, since each of them are from a different domain of verbs. This structure is in harmony with Yami custom and contextual structure.

5 Conclusion This paper has provided an example of how to Figure 3. An example of four Yami nouns in the construct an ontology of Yami using traditional ontology of Yami lyrics songs. We hope this approach will serve as a fundamental template for further mappings with more texts to produce other Yami ontologies.

References Dodds, D. 2005. Qualitative geospatial processing, ontology and spatial metaphor. presented at the GML and Geo-Spatial Web Figure 4. An example of three Yami verbs in the Services. ontology of Yami lyrics Dong, M. N. 1995. In Praise of Taro. Dao-Xiang 4.4 Mapping metaphorical words Publisher. To classify Yami metaphor, we made links Fauconnier, G. & Turner,M. 1998. Conceptual (equivalent classes) between the literal meaning Integration Networks. Cognitive Science, 22 and the metaphorical meaning using Protégé. Since (2), 133-187. Input space I usually works as the source domain Fellbaum,C. 1999. WordNet:an Electronic and Input space II as the target domain, we found Lexical Database. Cambridge,Mass.: MIT that, at least in this case, this is a one-sided network Press. of metaphor mapping. We thus employed the Knight, P., & Lu Y. H.. 2005. Music heritage of property “mappings” to define the relationship the oral traditions by meykaryag of the Tao between “reached the harbor” and “to hold a new tribe. Paper presented at the 2005 workhouse celebration.” Moreover, in order to set International Forum of Ethnomusicology in up the restrictions for searching words, we added Taiwan: Interpretation and Evolutionof the property “hasConceptOf”. We then named the Musical Sound. Taipei: Soochow University. class expression of “reached the harbor” with the Lakoff, G. & Johnson, M. 1980. Metaphors We following restriction: Live By. Chicago: University of Chicago has_concept_of some3 (harbor and4 fish and Press. fisherman and boat and share) Lin, J.H. 2006. Using formal concept analysis Similarly, we also provided the phrase “to hold to construct the computer virus a new workhouse celebration” with the following characteristics domain ontology. MA thesis, restriction: National Yunlin University of Science & has_concept_of some (host and builder and Technology, Taiwan. workhouse and share and taro and yard) Rau, D.Victoria & Dong, M. N. 2006 Yami Texts Finally, for the sake of correctly linking with Reference Grammar and Vocabulary. 3 This refers to the existential quantifier ( ) in OWL Language and Linguistics, Academia Sinica, syntax, which can be read as at least one, or some. Taipei, Monograph A-10. 4 An intersection class is described by combining two or more classes using the AND operator ( ).

84 The 6th Workshop on Asian Languae Resources, 2008

ASSESSMENT AND DEVELOPMENT OF POS TAG SET FOR TELUGU

Dr.Rama Sree R.J Dr.Uma Maheswara Rao G Dr. Madhu Murthy K.V Rashtriya Sanskrit Central University S.V.U.College of Vidyapeetha, Hyderabad Engineering Tirupati [email protected] Tirupati [email protected] [email protected]

ABSTRACT 1995). A number of tag sets have been evolved for a number of languages. These tag sets not In this paper, we first had a overall study of only differ with each other from language to existing POS tag sets for European and Indian language, but vary within the language itself. languages. Till now, most of the research The reasons for the variation of tags in the tag done on POS tagging is for English. We sets are as follows. As taggers give additional observed that even though the research on information like grammatical features such as POS tagging for English is done exhaustively, number, gender, person, case markers for noun part-of-speech annotation in various research inflections; tense markers for verbal applications is incomparable which is inflections, the number of tags used by variously due to the variations in tag set different systems varies depending on the definitions. We understand that the morpho- information encoded in the tag. However the syntactic features of the language and the tag set design plays a vital role when data is degree of desire to represent the granularity of tagged according to it and hence it affects the these morpho-syntactic features, domain etc., development of NLP application tools within decide the tags in the tag set. We then and across that language. Language examined how POS tagset design has to be independent representation of a tag set help to handled for Indian languages, taking Telugu find out the hidden information like context, language into consideration. structure, syntactic and semantic aspect of the word. It also gives an overview of language modeling features. 1. Introduction 2. Desirable Features of a Tag Set Annotation is the process of adding some additional information (grammatical features Unfortunately, there does not seem to be much like word category, case indicator, other literature on standard tag set design. There is a morph features) about the word to each word need to have standard tag set labels for the of the text. This additional information is words to encode the same linguistic called a tag. The set of all these tags is called information across the languages. The tag set a tag set. When words are considered in labels of a given language should satisfy the isolation, they can have one or more number following characteristics. of tags for each word. But when these words (1) The words carrying same syntactic, are used in a certain context, the tags categorical information should be grouped representing morphological and syntactic under the same tag. For example, all feature reduce to one tag. The information to adjectives should be tagged as JJ. be captured as a tag is an application specific issue (Anne,1997, David, 1994 and David,

85 The 6th Workshop on Asian Languae Resources, 2008

(2) The words which have same syntax and shown in the above table). In British come under different categories should corpus, there may be a possibility of the clearly be distinguished depending on the presence of the test corpus where more categorical sense in which it is used in the number of words are borrowed from other given context. For example, the word languages into English. book can be tagged both as noun (NN) (iii) Desire for precision: The reason for and Verb (VB). more number of tags in a tag set is to (3) The tag set should also help us to classify precisely capture all linguistic criteria and predict the sense, category of the which describe morpho-syntactic features unknown and foreign words. For example, in detail. However, there should be a consider the sentence, “Give it to xyxxy”, balance between theoretical and actual POS tagger should be in a position to distribution of these syntactic features. predict xyxxy (or any non-sensical string) 4. Tag Sets for Indian Languages could be a noun. The two POS tag sets developed for Hindi 3. Sources of Variations among POS Tag (revised on Nov 15, 2003) and Telugu by IIIT, Sets for English Hyderabad and CALTS, Hyderabad In order to identify the reasons for tag set respectively are examined and the following variations for English, the tag sets viz., the points are observed. Penn Treebank (Mitchel, 1993) tag set(PT), Telugu POS tag set contains more number of UCREL CLAWS7 tag set (UCREL_C7), the POS tag labels. This difference is due to the International Corpus of English (ICE) tag set reason that Telugu is more inflective than (Greenbaum, 1992) and the Brown Corpus Hindi. In Hindi nouns are non-inflectional. (BC) tag set (Green, 1997) for English are Karaka roles are not encoded in Hindi noun examined; the POS tag labels are extracted for word forms as in Telugu. Similarly main some important morpho-syntactic features verbal roots appear as non-inflective in Hindi. and studied to demonstrate the present study. The verbs co-occur with tense, aspect and After a careful study, the following points modality as separate words whereas aspect were observed with regard to the differences and modality are packed into a single verbal in POS tag sets. inflection word in Telugu. For example, (i) Desire to capture more semantic content: consider the following sentences. BC, ICE, URCEL tag set are making English: Ram killed Ravana. more subtle distinctions within one Hindi : RAm ne mArA Ravana ko. category than PT. For example, POS tags for adjectives- PT is not making any clear Telugu: RAmudu caMpAdu RAvanunni. distinction for adjectives other than JJ, JJS, JJR, whereas other tag sets are maintaining fine granularity. Such For convenience, the word order is maintained differences can be observed for several as it is in all the three languages. In case of morpho-syntactic features. English language, position gives the roles played by Rama (subject) and Ravana (object). (ii) Corpus Coverage: Depending on the In case of Hindi, case markers ne and ko exist, syntactic distribution of the test corpus but they do not inflect Ram and Ravana. But under consideration, there may be Telugu noun inflections give the information variations. For example, BC tag set made of case markers also. Hence there are a wide provision for foreign words (not differences in the tag labels of Hindi and

86 The 6th Workshop on Asian Languae Resources, 2008

Telugu language tag sets. In order to capture 5. Improvement of Telugu Tag Set these syntactic (more over they are also In addition to the above mentioned tags, some semantic) information, Telugu has more new tags are introduced to capture and number of POS tags (nn1,nn2,nn3, provide finer discrimination of the semantic nn4,nn5,nn6,nn7) in the place of a single tag content of some of the linguistic expressions a (nn) of Hindi. corpus of 12,000 words. They are explained The POS tags of Telugu are described below briefly in the succeeding paragraphs. in detail. (a) Verbal finite negative : Some words like (i) Nouns (nAma vAcakAlu- nn) :These tags kAxu ( ), lexu ( ) are verbal finites but capture the nouns and their roles played in the �ాదు ల�దు sentence. The different tags in the subclass are they give the negative meaning of the verbal nn1,nn2,nn3,nn4,nn5, nn6 and nn7. action. If they are tagged simply as vf, it is Depending on the vibhakti, the nouns get the undestood that some action has taken place. number label to main class, i.e., nn based on But these words are used in negative sense. the karaka relations. The tag nni stands for In order to capture this feature, we have noun oblique form indicating that the noun is labelled them as vng. in a position to get attached with the (b) Verbal nouns wih vibhakti: Verbal succeeding noun inflection. nouns behave in the same way as nouns do, in (ii) Locative affixes (swAna vAcakAlu – nl) forming their inflections with vibhaktis like ( :Here some locative prepositions combined AdataM- ఆడటం), Adatanni-(ఆడట��న), with the six vibhaktis are listed as nl1 (pEna- Adatamcewa ( etc. At present they ఆడటం �ేత ) , nl4 (pEki- ), nl5 (pEnuMdi- ), ��ౖన) ��ౖ�� ��ౖనుం�� are labelled as nn1, nn2, nn3 etc. depening on nl6 (pEni- ) etc. the affix. In doing so, the semantic content of ��ౖ� verb is lost. This would lead to difficulties in (iii) Prepositions (Vibhakti – pp):So me t i me s disambiguating words at the semantic level. prepositions can occur independently. For Hence the introduction of POS tags like vnn1, example, varaku ( . Hence all vibhaktis vnn2 etc is proposed. వరక�) are labelled as pp1,pp2 etc. (c) Words expressing doubts: There are linguistic expressions that express the (iv) Pronouns (sarva nAmAlu - pr) :Like doubtfulness as explained below. nouns, all pronouns form inflections with vibhaktis. Accordingly they are named as pr1, Doubtfulness of : pr2 etc. (i) Verbal finites:Words like uMxo ( ) ఉం�ో (v) Adjectives (Visheshana) : Special type of vunnavo ( etc., express the adjectives like Verbal adjectives ( kriya ఉనన్�) visheshana) as vjj, Nominal adejctives doubtfullness of the occurrence of action. (saMjna viseshana) as jj and noninfinitive To capture this semantic discrimination, verbal adjectives (sahAyaka asamapaka kriya) POS tag vfw is introduced. Previously they as ajj etc. are labelled as vf. (vi) Other syntactic categories : The tags (ii) Nouns: Words which express the for other syntactic categories like quantifiers doubtfulness a noun participation in the as qf, negative meanings as ng etc., are given. action like rAmudo ( ), axo ( �ామ��ో అ�ో)

87 The 6th Workshop on Asian Languae Resources, 2008

etc. Instead of labelling them nn1, they are References: labelled them with the tag nnw. Anne Schiller, Simone Teufel, Christine The above mentioned improvements made to Thielen. 1995. Guidelines für das the existing POS tag sets and the advantages Tagging deutscher Textcorpus mit thereof are as follows. STTS. Universitäten Stuttgart und (i) A finer discrimination is made. For Tübingen. example consider vfw. In the absence of David Elworthy. 1994. Automatic error this tag, the verbal inflections which end detection in part of speech tagging. In with lexu ( ) could be tagged as vf. Proceedings of the International ల�దు Conference on New Methods in Language Due to this, the verbal inflections which Processing, Manchester. are completed can be clearly distinguished David Elworthy. 1995. Tagset Design and from theose verbal inflections where Inflected Languages. In Proceedings of action has not been completed. the ACL SIGDAT Workshop, Dublin. (ii) vnn tag captures more information that Mitchell P Marcus, Beatrice Santorini, the noun present in the verbal inflection is Mary Ann Marcinkiewicz. 1993. just a simple common noun. In the Building a Large Annotated Corpus of absence of this tag, words erroneously English: The Penn Treebank. labelled as nn to which it does not really Co mp utational Linguistics. Volume 19, belong. So these tags accurately capture Number 2, pp. 313--330 (Special Issue on the information present in the words. Using Large Corpus). Greenbaum S. 1992. The ICE tag set 6. Conclusion manual. University College London. It is strongly felt that all Indian languages Green B, Rubun G. 1971. Automated should have the same tag set so that the Grammatical Tagging of English. In annotated corpus in corresponding languages Department of Linguistics, Brown may be useful in cross lingual NLP University. applications, reducing much load on language to language transfer engines. This point can be well explained by taking analogy of existing script representation for Indian Languages. The ISCII and Unicode representations for all Indian languages can be viewed appropriately in the languages we like, just by setting their language code. There is no one-to-one alphabet mapping in the scripts of Indian Languages. For example, the short e,o ( ఏ,ఒ) are present in Telugu, while they are not available in Hindi, Sanskrit etc. Similarly alphabet variations between Telugu and Tamil exist. Even then, all these issues are taken care of, in the process of language to language script conversion. Similarly POS variations across Indian Languages also should be taken care of.

88 The 6th Workshop on Asian Languae Resources, 2008

Designing a Common POS-Tagset Framework for Indian Languages

Sankaran Baskaran, Microsoft Research India. Bangalore. [email protected] Kalika Bali, Microsoft Research India. Bangalore. [email protected] Tanmoy Bhattacharya, Delhi University, Delhi. [email protected] Pushpak Bhattacharyya, IIT-Bombay, Mumbai. [email protected] Girish Nath Jha, Jawaharlal Nehru University, Delhi. [email protected] Rajendran S, Tamil University, Thanjavur. [email protected] Saravanan K, Microsoft Research India, Bangalore. [email protected] Sobha L, AU-KBC Research Centre, Chennai. [email protected] Subbarao K V. Delhi. [email protected]

more detailed morphosyntactic features of these Abstract languages. However, most of the tagsets for ILs are lan- Research in Parts-of-Speech (POS) tagset guage specific and cannot be used for tagging data design for European and East Asian lan- in other language. This disparity in tagsets hinders guages started with a mere listing of impor- interoperability and reusability of annotated corpo- tant morphosyntactic features in one lan- ra. This further affects NLP research in resource guage and has matured in later years to- poor ILs where non-availability of data, especially wards hierarchical tagsets, decomposable tagged data, remains a critical issue for researchers. tags, common framework for multiple lan- Moreover, these tagsets capture the morphosyntac- guages (EAGLES) etc. Several tagsets tic features only at a shallow level and miss out the have been developed in these languages richer information that is characteristic of these along with large amount of annotated data languages. for furthering research. Indian Languages The work presented in this paper focuses on de- (ILs) present a contrasting picture with signing a common tagset framework for Indian very little research in tagset design issues. languages using the EAGLES guidelines as a mod- We present our work in designing a com- el. Though Indian languages belong to (mainly) mon POS-tagset framework for ILs, which four distinct families, the two largest being Indo- is the result of in-depth analysis of eight Aryan and Dravidian, as languages that have been languages from two major families, viz. in contact for a long period of time, they share sig- Indo-Aryan and Dravidian. Our framework nificant similarities in morphology and syntax. follows hierarchical tagset layout similar to This makes it desirable to design a common tagset the EAGLES guidelines, but with signifi- framework that can exploit this similarity to facili- cant changes as needed for the ILs. tate the mapping of different tagsets to each other. This would not only allow corpora tagged with

1 Introduction different tagsets for the same language to be reused A POS tagset design should take into consideration but also achieve cross-linguistic compatibility be- all possible morphosyntactic categories that can tween different language corpora. Most important- occur in a particular language or group of languag- ly, it will ensure that common categories of differ- es (Hardie, 2004). Some effort has been made in ent languages are annotated in the same way. the past, including the EAGLES guidelines for In the next section we will discuss the impor- morphosyntactic annotation (Leech and Wilson, tance of a common standard vis-à-vis the currently 1996) to define guidelines for a common tagset available tagsets for Indian languages. Section 3 across multiple languages with an aim to capture will provide the details of the design principles

89 The 6th Workshop on Asian Languae Resources, 2008

behind the framework presented in this paper. Ex- guage particular categories without disregarding amples of tag categories in the common framework the shared traits of the Indian languages. will be presented in Section 4. Section 5 will dis- cuss the current status of the paper and future steps 3 Design Principles envisaged. Whilst several large projects have been concerned 2 Common Standard for POS Tagsets with tagset development very few have touched upon the design principles behind them. Leech Some of the earlier POS tagsets were designed (1997), Cloeren (1999) and Hardie (2004) are for English (Greene and Rubin, 1981; Garside, some important examples presenting universal 1987; Santorini, 1990) in the broader context of principles for tagset design. automatic parsing of English text. These tagsets In this section we restrict the discussion to the popular even today, though designed for the same principles behind our tagset framework. Important- language differ significantly from each other mak- ly, we diverge from some of the universal prin- ing the corpora tagged by one incompatible with ciples but broadly follow them in a consistent way. the other. Moreover, as these are highly language Tagset structure: Flat tagsets just list down the specific tagsets they cannot be reused for any other categories applicable for a particular language language without substantial changes this requires without any provision for modularity or feature standardization of POS tagsets (Hardie 2004). reusability. Hierarchical tagsets on the other hand Leech and Wilson (1999) put forth a strong argu- are structured relative to one another and offer a ment for the need to standardize POS tagset for well-defined mechanism for creating a common reusability of annotated corpora and interopera- tagset framework for multiple languages while bility across corpora in different languages. providing flexibility for customization according to EAGLES guidelines (Leech and Wilson 1996) the language and/ or application. were a result of such an initiative to create stan- Decomposability in a tagset allows different fea- dards that are common across languages that share tures to be encoded in a tag by separate sub-stings. morphosyntactic features. Decomposable tags help in better corpus analysis Several POS tagsets have been designed by a (Leech 1997) by allowing to search with an un- number of research groups working on Indian derspecified search string. Languages though very few are available publicly In our present framework, we have adopted the (IIIT-tagset, Tamil tagset). However, as each of hierarchical layout as well as decomposable tags these tagsets have been motivated by specific re- for designing the tagset. The framework will have search agenda, they differ considerably in terms of three levels in the hierarchy with categories, types morphosyntactic categories and features, tag defi- (subcategories) and features occupying the top, nitions, level of granularity, annotation guidelines medium and the bottom layers. etc. Moreover, some of the tagsets (Tamil tagset) What to encode? One thumb rule for the POS are language specific and do not scale across other tagging is to consider only the aspects of morpho- Indian languages. This has led to a situation where syntax for annotation and not that of syntax, se- despite strong commonalities between the lan- mantics or discourse. We follow this throughout guages addressed resources cannot be shared due and focus only on the morphosyntactic aspects of to incompatibility of tasgets. This is detrimental to the ILs for encoding in the framework. the development of language technology for Indian Morphology and Granularity: Indian languag- languages which already suffer from a lack of ade- es have complex morphology with varying degree quate resources in terms of data and tools. of richness. Some of the languages such as those of In this paper, we present a common framework the Dravidian family also display agglutination as for all Indian languages where an attempt is made an important characteristic. This entails that mor- to treat equivalent morphosyntactic phenomena phological analysis is a desirable pre-process for consistently across all languages. The hierarchical the POS tagging to achieve better results in auto- design, discussed in detail in the next section, also matic tagging. We encode all possible morphosyn- allows for a systematic method to annotate lan- tactic features in our framework assuming the exis-

90 The 6th Workshop on Asian Languae Resources, 2008

tence of morphological analysers and leave the II. Recommended attributes or values are recog- choice of granularity to users. nised to be important sub-categories and fea- As pointed out by Leech (1997) some of the tures common to a majority of languages. linguistically desirable distinctions may not be III. Special extensions1 feasible computationally. Therefore, we ignore a. Generic attributes or values certain features that may not be computationally b. Language-specific attributes or values are feasible at POS tagging level. the attributes that are relevant only for few lan- Multi-words: We treat the constituents of Mul- guages and do not apply to most languages. ti-word expressions (MWEs) like Indian Space All the tags were discussed and debated in detail Research Organization as individual words and tag by a group of linguists and computer scien- them separately rather than giving a single tag to tists/NLP experts for eight Indian languages viz. the entire word sequence. This is done because: Bengali, Hindi, Kannada, Malayalam, Marathi, Firstly, this is in accordance with the standard Sanskrit, Tamil and Telugu. practice followed in earlier tagsets. Secondly, Now, because of space constraints we present grouping MWEs into a single unit should ideally only the partial tagset framework. This is just to be handled in chunking. illustrate the nature of the framework and the com- Form vs. function: We try to adopt a balance plete version as well as the rationale for different between form and function in a systematic and categories/features in the framework can be found consistent way through deep analysis. Based on in Baskaran et al. (2007).2 our analysis we propose to consider the form in In the top level the following 12 categories are normal circumstances and the function for words identified as universal categories for all ILs and that are derived from other words. More details on hence these are obligatory for any tagset. this will be provided in the framework document (Baskaran et al 2007) 1. [N] Nouns 7. [PP] Postpositions Theoretical neutrality: As Leech (1997) points 2. [V] Verbs 8. [DM] Demonstratives out the annotation scheme should be theoretically 3. [PR] Pronouns 9. [QT] Quantifiers neutral to make it clearly understandable to a larger 4. [JJ] Adjectives 10. [RP] Particles group and for wider applicability. 5. [RB] Adverbs 11. [PU] Punctuations Diverse Language families: As mentioned ear- 6. [PL] Participles 12. [RD] Residual3 lier, we consider eight languages coming from two major language families of India, viz. Indo-Aryan The partial tagset illustrated in Figure 1 high- and Dravidian. Despite the distinct characteristics lights entries in recommended and optional catego- of these two families, it is however striking to note ries for verbs and participles marked for three le- the typological parallels between them, especially vels.4 The features take the form of attribute-value in syntax. For example, both families follow SOV pairs with values in italics and in some cases (such pattern. Also, several Indo-Aryan languages such as case-markers for participles) not all the values as Marathi, Bangla etc. exhibit some agglutination, are fully listed in the figure. though not to the same extent of Dravidian. Given the strong commonalities between the two families 5 Current Status and Future Work we decided to use a single framework for them In the preceding sections we presented a common 4 POS Tagset Framework for Indian lan- framework being designed for POS tagsets for In- guages dian Languages. This hierarchical framework has The tagset framework is laid out at the following 1 four levels similar to EAGLES. We do not have many features defined under the special I. Obligatory attributes or values are generally extensions and this is mainly retained for any future needs. 2 Currently this is just the draft version and the final version universal for all languages and hence must be will be made available soon included in any morphosyntactic tagset. The 3 For words or segments in the text occurring outside the gam- major POS categories are included here. bit of grammatical categories like foreign words, symbols,etc. 4 These are not finalised as yet and there might be some changes in the final version of the framework.

91 The 6th Workshop on Asian Languae Resources, 2008

three levels to permit flexibility and interoperabili- Nouns Postpositions ty between languages. We are currently involved in a thorough review of the present framework by Verbs Demonstratives using it to design the tagset for specific Indian lan- guages. The issues that come up during this Pronouns Quantifiers process will help refine and consolidate the Level - 1 framework further. In the future, annotation guide- Adjectives Particles lines with some recommendations for handling ambiguous categories will also be defined. With Adverbs Punctuations the common framework in place, it is hoped that researchers working with Indian Languages would Participles Residual be able to not only reuse data annotated by each other but also share tools across projects and lan- guages. Type Type References Finite General Auxiliary Adjectival Baskaran S. et al. 2007. Framework for a Common Infinitive Verbal Parts-of-Speech Tagset for Indic Languages. (Draft) Non-finite Nominal http://research.microsoft.com/~baskaran/POSTagset/ Nominal Gender

Gender As in verbs Cloeren, J. 1999. Tagsets. In Syntactic Wordclass Tagging, ed. Hans van Halteren, Dordrecht.: Kluwer Academic. Masculine Number Feminine Singular Hardie, A . 2004. The Computational Analysis of Morpho- Neuter Plural syntactic Categories in Urdu. PhD thesis submitted to Number Dual Lancaster University. Singular Case Greene, B.B. and Rubin, G.M. 1981. Automatic grammati- Plural/Hon. Direct cal tagging of English. Providence, R.I.: Department of Dual Oblique Linguistics, Brown University Honourific Case-markers Person Ergative Garside, R. 1987 The CLAWS word-tagging system. In First Accusative The Computational Analysis of English, ed. Garside, Second etc. Leech and Sampson, London: Longman. Third Tense Leech, G and Wilson, A. 1996. Recommendations for the Tense As in verbs Morphosyntactic Annotation of Corpora. EAGLES Re- Past Negative port EAG-TCWG-MAC/R. Present Leech, G. 1997. Grammatical Tagging. In Corpus Annota- Future tion: Linguistic Information from Computer Text Cor- Negative Level - 2 pora, ed: Garside, Leech and McEnery, London: Long- man Leech, G and Wilson, A. 1999. Standards for Tag-sets. In Syntactic Wordclass Tagging, ed. Hans van Halteren, Aspect Dordrecht: Kluwer Academic. Perfect Imperfect Santorini, B. 1990. Part-of-speech tagging guidelines for Progressive the Penn Treebank Project. Technical report MS-CIS- 90-47, Department of Computer and Information Mood Science, University of Pennsylvania Declarative Level - 3 Subjunctative/ IIIT-tagset. A Parts-of-Speech tagset for Indian languages. Hortative http://shiva.iiit.ac.in/SPSAL2007/iiit_tagset_guidelines. Conditional pdf Imperative Tamil tagset. AU-KBC Parts-of-Speech tagset for Tamil. Presumptive http://nrcfosshelpline.in/smedia/images/downloads/Tam il_Tagset-opensource.odt Fig-1. Tagset framework - partial representation

92 The 6th Workshop on Asian Languae Resources, 2008

Resources Report on Languages of Indonesia

Hammam Riza IPTEKNET Agency for the Assessment and Application of Technology (BPPT) Jakarta, Indonesia [email protected]

for 69.91% of the total population – Javanese Abstract (75,200,000 speakers), Sundanese (27,000,000), Malay (20,000,000), Madurese (13,694,000), Mi- In this paper, we report a survey of lan- nangkabau (6,500,000), Batak (5,150,000), Bugis- guage resources in Indonesia, primarily of nese (4,000,000), Balinese (3,800,000), Acehnese indigenous languages. We look at the offi- (3,000,000), Sasak (2,100,000), Makasarese cial Indonesian language (Bahasa Indone- (1,600,000), Lampungese (1,500,000), and Rejang sia) and 726 regional languages of Indone- (1,000,000). (Lauder, 2004: 3-4). Of these 13 lan- sia (Bahasa Nusantara) and list all the guages, only 7 languages have presence on the available LRs that we can gathered. This Internet (Riza 2006). paper suggests that the smaller regional The remaining 713 languages have a total popu- languages may remain relatively unstudied, lation of only 41.4 million speakers, and the major- and unknown, but they are still worthy of ity of these have very small numbers of speakers. our attention. Various LRs of these endan- For example, 386 languages are spoken by 5,000 gered languages are being built and col- or less; 233 have 1,000 speakers or less; 169 lan- lected by regional language centers for guages have 500 speakers or less; and 52 have 100 study and its preservation. We will also or less (Gordon, 2005). These languages are facing briefly report its presence on the Internet. various degrees of language endangerment (Crys- tal, 2000).

There is evidence from census data over three 1 Introduction decades that the growth in the numbers of speakers of Indonesian is reducing the numbers of speakers It is not hard to get a picture of just how linguisti- of the indigenous languages (Lauder, 2005). Con- cally diverse Indonesia is. There are 726 languages cerns that this kind of growth would give Indone- in the country; making it the world’s second most sian the potential to replace the regional languages diverse, after Papua New Guinea which has 823 were aired as early as the 1980s. (Poedjosoedarmo, local languages (Martí et al., 2005:48). 1981; Alisjahbana, 1984). The languages of Indonesia are part of a com- plex linguistic situation that is generally seen as comprised of three categories: Indonesian lan- 2 Language Resources guage, the regional indigenous languages, and for- Many language centers in Indonesia have em- eign languages. (Alwi and Sugono, 2000). barked in various research and development in The indigenous languages of Indonesia - also re- creation of language resources (LRs). Unfortu- ferred to as vernaculars or provincial languages, nately, this development mainly only focused on collectively called as Bahasa Nusantara - exhibits creating LRs for the official language Bahasa In- great variation in numbers of speakers. Thirteen of donesia. In the followings, we describe the present them have a million or more speakers, accounting

93 The 6th Workshop on Asian Languae Resources, 2008

and ongoing LRs research projects with emphasis tions; find the best means to disseminate the speci- on the indigenous languages. fications, tools and applications and encourage an open standard-based approach to the creation and 2.1 Indonesian Electronic Dictionary System interchange of LRs. It also demonstrate how MLI (KEBI) can be applied to Asian Language Resource (ALR) Our laboratory has worked to enlarge and improve through making the results of collaborative en- the quality of Indonesian electronic dictionaries. deavors available throughout the members of the Starting 1987, it took us at least 4 years to develop group and wider associations; provide training, all necessary components for CICC-MMTS, result- awareness and educational events and share with ing with many first-ever Indonesian language re- each other their work on related issues. sources, primarily electronic dictionaries and 2.5 Speech Corpus grammar rules for language analysis and genera- tion. This extension have resulted in a collection of In a mission to improve the quality of automatic 500,000 word entries and more than 2 million speech recognition (ASR), a collaboration of derivational and inflected words. As part of this Telkom RDC and ATR-Japan has constructed research, we built an online access to the dictionar- speakers’ corpus (40 speakers, 2000 sentences) ies (http://nlp.inn.bppt.go.id/kebi) enabling users to which is expected to improve the accuracy of ASR add new words and definition. KBBI electronic to 90% level. dictionary is scheduled to be launched in 2008, during 100 years celebration of the official Bahasa 2.6 Other Corpus Indonesia. Other monolingual corpus is found online. The major news articles corpora on the web is Tem- 2.2 BPPT-ANTARA Corpus pointerakif.com (56,471 articles). Kompas corpus This parallel corpus was developed as extension to (71,109 articles) can be found at Indonesia National Corpus Initiative (INCI) which http://ilps.science.uva.nl/Resources/BI. was earlier created to support the development of a hybrid stochastic-symbolic system BIAS-II. Cur- rently, a pure statistical MT system based on Phar- References aoh is developed by BPPT and National News Alisjahbana, S. T. 1984. The problem of minority lan- Agency (ANTARA) using 500K sentences pair, guages in the overall linguistic problems of our time. expected to have better accuracy and robustness In Linguistic Minorities and Literacy: Language Pol- and could enhance the quality of translation (cur- icy Issues in Developing Countries, ed. F. Coulmas. rent BLEU score 0.72). Berlin: Mouton. 2.3 Regional Languages Mapping (National Alwi, Hasan, and Sugono, Dendy. 2000. From National Language Center) Language Politics to National Language Policy. Pro- cedings of the Seminar on Language Politics, Jakarta For the past 15 years, the Indonesia National Lan- guage Center have been collecting information re- Crystal, David. 2000. Language Death. Cambridge: Cambridge University Press. garding all indigenous languages. By the end of this year, this project will be completed and all re- Lauder, Multamia RMT. 2005. Language Treasures in sult and findings will be open to public. Indonesia. In Words and Worlds : World Languages Review, eds. Fèlix Martí et al., 95-97. Clevedon 2.4 Dictionaries of Bahasa Nusantara, Indo- [England] ; Buffalo [N.Y.]: Multilingual Matters. nesian Linguistics Association (MLI) Martí, Fèlix, et.al. eds. 2005. Words and Worlds : World Masyarakat Linguistik Indonesia (MLI) is a group Languages Review. vol. 52. Bilingual Education and of institutions, organizations and corporation, Bilingualism. Clevedon [England] ; Buffalo [N.Y.]: working together on mutually defined goals and Multilingual Matters. projects that seek to provide a specification of LRs Riza, H, et. al. 2006. Indonesian Languages Diversity of all languages of Indonesia. MLI also help mem- on the Internet, Internet Governance Forum (IGF), bers to use the specification for tools and applica- Athens.

94 The 6th Workshop on Asian Languae Resources, 2008

Confirmed Language Resource for Answering How Type Questions Developed by Using Mails Posted to a Mailing List Ryo Nishimura Yasuhiko Watanabe Yoshihiro Okada Ryukoku University, Seta, Otsu, Shiga, 520-2194, Japan r [email protected] [email protected] Abstract In this paper, we report a Japanese language re- source for answering how-type questions. It was developed it by using mails posted to a mailing list. We show a QA system based on this language resource. 1 Introduction In this paper, we report a Japanese language resource for answering how type questions. It was developed by using mails posted to a mailing list and it was given the four types of descriptions: (1) mail type, (2) key Figure 1: The overview of the language resource de- sentence, (3) semantic label, and (4) credibility label. velopment Credibility is a center problem of knowledge acquiti- sion from natural language documents because the doc- uments, including mails posted to mailing lists, often mails were submitted by questioners for reporting contain incorrect information. We describe how to de- the credibility of the solutions which they had re- velop this language resource in section 2, and show a ceived. As a result, solutions described in answer QA system based on it in section 3. mails can be confirmed by using questioner’s re- ply mails. 2 Language resource development • Mails posted to mailing lists generally have key There are mailing lists to which question and answer sentences: These key sentences can be extracted mails are posted frequently. For example, to Vine by using surface clues (Watanabe 05). The sets of Users ML, considerable number of question mails and questions and solutions can be acquired by using their answer mails are posted by participants who are key sentences in question and answer mails. Also, 1 interested in Vine Linux . We intended to use these the solutions are confirmed by using key sen- mails for developing a language resource because we tences in questioner’s reply mails. Furthermore, have the following advantages. key sentences in question mails and their neigh- • It is easy to collect question and answer mails in boring sentences often contain information about various domains: The sets of question and answer conditions, symptoms, and purpose. These kinds mails are necessary to answer how-type questions. of information are useful in specifying user’s un- Many informative mails posted to mailing lists are clear questions. disclosed in the Internet and can be retrieved by Figure 1 shows the overview of the language re- using full text search engines, such as Namazu source development. First, by using reference rela- (Namazu). However, users want a more conve- tions and sender’s email address, mails are classified nient retrieval system than existing systems. into four types: (1) question (Q) mail, (2) direct an- • There are many mails which report the credibility swer (DA) mail, (3) questioner’s reply (QR) mail, and of their previous mails: Answer mails often con- (4) others. DA mails are direct answers to the original tain incorrect solutions. On the other hand, many questions. Solutions are generally described in the DA 1Vine Linux is a linux distribution with a customized Japanese mails. QR mails are questioners’ answers to the DA environment. mails. In the QR mails, questioners often report the

95 The 6th Workshop on Asian Languae Resources, 2008

credibility of the solutions described in the DA mails. Sentences in the Q, DA, and QR mails are transformed into dependency trees by using JUMAN(JMN 05) and KNP(KNP 05). Second, key sentences are extracted from the Q, DA, and QR mails by using (1) nouns used in the mail sub- jects, (2) quotation frequency, (3) clue expressions, and (4) sentence location (Watanabe 05). To evaluate this Figure 2: System overview method, we selected 100 examples of question mails and their DA and QR mails in Vine Users ML. The ac- curacy of the key sentence extraction from the Q, DA, and QR mails were 80%, 88%, and 76%, respectively. We associated (1) the key sentences and the neighbor- ing sentences in the Q mails and (2) the key sentences in the DA mails. We used them as knowledge for an- Figure 3: A set of a question and the answers with a swering how-type questions. 73% of them were coher- positive label retrieved by our system ent explanations. Third, expressions including information about con- dition, symptom, and purpose are extracted from the DA mails, 4272 QR mails, and 24711 others). 8782 key sentences in the Q mails and their neighboring sen- key sentences and their 7330 previous and 8614 next tences by using clue expressions. The results are used sentences were extracted from the Q mails. These sen- for specifying unclear questions. For example, unclear tences were associated with 13081 key sentences ex- question “oto ga denai (I cannot get any sounds)” is tracted from the DA mails and used as knowledge for specified by “saisho kara (symptom: from the begin- answering how-type questions. 3173 key sentences ning) ?” and “kernel no version (condition: which were extracted from the QR mails and the credibility kernel version) ?”, both of which were extracted from labels (2148 positive and 1025 negative) were given to the Q mails through this semantic analysis. The accu- 3127 key sentences in the DA mails. racy of this analysis was 74%. The QA processor transforms user’s question into a dependency structure by using JUMAN(JMN 05) and Finally, positive and negative expressions to the so- KNP(KNP 05). Then, it retrieves similar questions and lutions described in the DA mails are extracted from their solutions by calculating the similarity scores be- the key sentences in the QR mails. The results of this tween user’s question and key sentencecs in the ques- analysis on QR mails are used for giving credibility la- tion mails. It also retrieves expressions including in- bels to the solutions described in the DA mails. The formation about conditions, symptoms, and purpose accuracy of this analysis was 76%. which seem to be useful in specifying user’s questions. The user interface enables a user to access to the sys- 3 QA system based on the language resource tem via a WWW browser by using CGI-based HTML Figure 2 shows the overview of our system based on forms. It puts the answer threads in order of similarity the language resource. A user can ask a question to the score. system in natural language. Then, the system retrieves References similar questions and their solutions, and it shows the credibility of these solutions by using their credibility Namazu: a Full-Text Search Engine, http://www.namazu.org/ labels. Figure 3 shows an example where our system Watanabe, Nishimura, and Okada: Confirmed Knowledge gave an answer to user’s question, “IP shutoku dek- Acquisition Using Mails Posted to a Mailing List, IJC- inai (I cannot get an IP address)”; “positive 1” means NLP 2005, pp.131-142, (2005). that this answer thread has one solution that was posi- Kurohashi and Kawahara: JUMAN Manual version 5.1 (in tively confirmed by its QR mail. Japanese), Kyoto University, (2005). The language resource consists of the mails posted Kurohashi and Kawahara: KNP Manual version 2.0 (in to Vine Users ML (50846 mails: 8782 Q mails, 13081 Japanese), Kyoto University, (2005).

96 The 6th Workshop on Asian Languae Resources, 2008

Corpus Building for Mongolian Language

Purev Jaimai Odbayar Chimeddorj Center for Research on Language Processing, Center for Research on Language Processing, National University of Mongolia, Mongolia National University of Mongolia, Mongolia [email protected] [email protected]

Abstract

This paper presents an ongoing research aimed to build the first corpus, 5 million words, for Mongolian language by focus- Figure 1. Current and future states of building a ing on annotating and tagging corpus texts Mongolian corpus. according to TEI XML (McQueen, 2004)

format. Also, a tool, MCBuilder, which And, we manually build the corpus until collect- provides support for flexibly and manually ing and annotating 1 million words and tagging annotating and manipulating the corpus 100 thousand words of them for semi- texts with XML structure, is presented. automatically building the corpus in the future. 1 Introduction 2 Corpus Building Design Mongolian researchers quite recently have begun We are building the corpus as sub-corpora, which to be involved in the research area of Natural Lan- are a raw corpus, a cleaned corpus, a tagged corpus guage Processing. All necessary linguistic re- and a parsed corpus, separately for various kinds of sources, which are required for Mongolian lan- studying and use on Mongolian language (Figure guage processing, have to be built from scratch, 2). and then they should be shared in public research for the rapid development of Mongolian language processing. This ongoing research aims to build a tagged and parsed 5 million words corpus for Mongolian by developing a spell-checker, tagger, sentence- parser and others (see Figure 1 and 2). Also, we needed to develop a tagset for the corpus because there was not any tagset for Mongolian and the traditional words categories are not appropriate to it. Thus, we designed a high level tagset, which consists of 20 tags, and are further classifying them. Currently, we have collected and populated 500 thousand words, 50 thousand of which have been Figure 2. Schema of building a Mongolian manually tagged, into the corpus (see Figure 1). corpus. At first, we are collecting the editorial articles of Unen newspaper (Unen publish), which is one of

97 The 6th Workshop on Asian Languae Resources, 2008

the best written newspapers in Mongolia, by using MCBuilder has three main windows that are (1) OCR application. We will also collect laws, school manipulating and organizing the corpus, (2) anno- book, and literary text (see Figure 3). tating sample texts and (3) manipulating tagset as shown in Figure 5. 3 Conclusion Mongolian language has hardly studied by com- puter, and its traditional rules such as inflectional, derivational, part of speech, sentence constituents, etc are extremely difficult to computerize. Our re- Figure 3. Text sizes included in the corpus. search works in the last few years showed it (Purev, 2006). Therefore, we are revising them by The corpus annotation follows TEI XML stan- creating a corpus for computer processing. dard. According to the work scope, the annotating The proposals of this ongoing research are the part is divided into two parts that are structural an- first Mongolian 5 million words corpus, and tools notation such as paragraphs, sentences, and so on, that are spell-checker, tagger and parser. and POS tagging. Currently, we have done followings: The structure of the text annotation is presented in Figure 4. • Defined the corpus design, XML structure of the corpus text, and the high level tagset • Collected and annotated 500 thousand words text • Tagged 50 thousand words • Released the first version of a Mongolian WORD corpus building tool called MCBuilder • First versions of Syllable-parser and Morph-analyzer for Mongolian Figure 4. XML Structure of corpus text. We are planning to complete the corpus in the For annotating two parts, once a manual corpus next two years. builder, called MCBuilder, were planned to de- velop, we have developed the first version and 4 Acknowledgement used to annotating 500 thousand word texts and Here described work was carried out by support of tagging 50 of them (see Figure 5). PAN Localization Project (PANL10n).

References PANL10n: PANLocalization Project. National Univer- sity of Computer and Emerging Sciences, Pakistan. Purev J. 2006. Corpus for Mongolian Language, Re- search Project, Mongolia. Purev J. and Odbayar Ch.. 2006. Towards Constructing the Corpus of Mongolian Language, Proceeding of ICEIC. Sperberg-McQueen, C. M. and Burnard, L.. 2004. Text Encoding Initiative. The XML version of the TEI Guidelines, Website. Figure 5. Screenshot of the corpus organizer and its main view. Unen press. 1984-1989. Editorial Articles. Mongolia

98 The 6th Workshop on Asian Languae Resources, 2008

Resources for Urdu Language Processing

Sarmad Hussain Center for Research in Urdu Language Processing National University of Computer and Emerging Sciences B Block, Faisal Town, Lahore, Pakistan sarmad.hussain@.edu.pk

and Urdu, creating a parallel corpus across these Abstract languages. In addition, the corpus also has 512,000 words of Spoken Urdu, from BBC Radio. Urdu is spoken by more than 100 million Moreover, the corpus also contains 1,640,000 speakers. This paper summarizes the cor- words of Urdu text. These Urdu corpus resources pus and lexical resources being developed are also annotated with a large morpho-syntactic for Urdu by the CRULP, in Pakistan. tag-set (Hardie 2003). Center for Research in Urdu Language Process-

1 Introduction ing (CRULP) at National University of Computer Urdu is the national language of Pakistan and one and Emerging Sciences in Pakistan has also been of the state languages of India and has more than developing corpora and associated tools for Urdu. 60 million first language speakers and more than A recent project collected a raw corpus of 19 mil- 100 million total speakers in more than 20 coun- lion words of Urdu text mostly from Jang News, tries (Gordon 2005) . Urdu is written in Nastalique reduced to 18 million words after cleaning. The writing style based on Perso-Arabic script. This corpus collection has been based on LC-STAR II 1 paper focuses on the Urdu resources being devel- guidelines . The domain-wise figures are given in oped, which can be used for research in computa- Table 1. Further details of the corpus and associ- tional linguistics. ated information are discussed by Ijaz et al. (2007).

2 Urdu Text Encoding Table 1: Distribution of Urdu Corpus Cleaned Corpus Urdu computing started early, in 1980s, creating Domains multiple encodings, as a standard encoding scheme Total Distinct Words Words was missing at that time. With the advent of Uni- C1. Sports/Games 1529066 15354 code in early 1990s, some online publications have C2. News 8425990 36009 switched to Unicode, but much of the publication C3. Finance 1123787 13349 still continues to follow the ad hoc encodings C4. Culture/Entertainment 3667688 34221 (Hussain et al. 2006). Two main on-line sources C5. Consumer Information 1929732 24722 of Urdu text in Unicode are Jang News C6. Personal communica- 1632353 23409 (www.Jang.net/Urdu) and BBC Urdu service tions (www.BBC.co.uk/Urdu) and are thus good sources Total 18308616 50365 of corpus. Encoding conversion may be required if data is acquired from other sources. Agreement between CRULP and Jang News al- lows internal use. However, due to distribution 3 Corpora restrictions in this agreement, the corpus has not been made publicly available. The distribution EMILLE Project, initiated by Lancaster Univer- rights are still being negotiated with Jang News. sity is one of the first initiatives to make Urdu cor- The tag set developed by Hardie (2003) is based pus available for research and development of lan- on morpho-syntactic analysis. A (much reduced) guage processing (McEnery et al. 2000). The pro- syntactic tag set has also been developed by ject has released 200,000 words of English text translated into Bengali, Gujarati, Hindi, Punjabi 1 See www.lc-star.org/docs/LC-STAR_D1.1_v1.3.doc

99 The 6th Workshop on Asian Languae Resources, 2008

CRULP (on the lines of PENN Treebank tagset), developed by Urdu Dictionary Board of available at its website www.CRULP.org. A cor- Government of Pakistan. See www.crulp.org/oud pus of 100,000 words manually tagged on this tag for details. set has also been developed based on text from CRULP has also developed a corpus based lexi- Jang online news service. This CRULP POS con of 50,000 words with frequency data and an- Tagged Jang News Corpus is available through the notation specifications defined by LC-STAR II center. project (at http://www.lc-star.org/docs/LC- Recently another corpus of about 40,000 words STAR_D1.1_v1.3.doc). Details of the lexicon an- annotated with Named Entity tags was also made notation scheme are given by Ijaz et al. (2007). available for Workshop on NER for South and There are also additional tools available through South East Asian Languages organized at IJCNLP CRULP, and documented at its website, including 2008. The annotated corpus was donated by normalization, collations, spell checking, POS tag- CRULP and IIIT Hyderabad and is available at ging and word segmentation applications. http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5. Tag set contains 12 tags. Details of these tags are 5 Conclusions discussed at the link http://ltrc.iiit.ac.in/ner-ssea- This paper lists some core linguistic resources of 08/index.cgi?topic=3. The CRULP portion of the Urdu, available through CRULP and other sources. data is also available at CRULP website, and is a However, the paper identifies licensing constraints, subset of the CRULP POS Tagged Jang News a challenge for open distribution, which needs to Corpus. be addressed. In earlier work at CRULP, a 230 spelling errors corpus has also been developed based on typo- References graphical errors in Newspapers and student term papers. See Naseem et al. (2007) for details. Gordon, Raymond G., Jr. (ed.). (2005). Ethnologue: A corpus of Urdu Names has also been devel- Languages of the World, Fifteenth edition. Dallas, oped by CRULP, based on the collective telephone Tex.: SIL International. Online ver- directories of Pakistan Telecommunications Cor- sion: http://www.ethnologue.com/. poration Limited (PTCL) from across all major Hardie, A. (2003). Developing a tag-set for automated cities of Pakistan. A name list has also been ex- part-of-speech tagging in Urdu. In Archer, D, Ray- tracted from the corpus for all person names, ad- son, P, Wilson, A, and McEnery, T (eds.) Proceed- dresses and cities of Pakistan. ings of the Corpus Linguistics 2003 conference. UCREL Technical Papers Volume 16. Department of 4 Lexica Linguistics, Lancaster University, UK. Lexica are as critical for development of language Ijaz, M. and Hussain, S. (2007). Corpus Based Urdu computing as corpora. One of the most Lexicon Development. In the Proceedings of Con- comprehensive lexica available for Urdu was ference on Language Technology ‘07, University of Peshawar, Peshawar, Pakistan. recently released by CRULP (available through CRULP website). The online version, called Naseem, T. and Hussain, S. (2007). Spelling Error Online Urdu Dictionary (OUD) contains 120,000 Trends in Urdu. In the Proceedings of Conference entries, with 80,000 words annotated with on Language Technology ‘07, University of Pesha- significant information. The data of OUD is XML war, Peshawar, Pakistan. tagged, as per the annotation schema discussed by McEnery, A., Baker, J., Gaizauskas, R. & Cunningham, Rahman (2005; pp. 15), which contains about 20 H. (2000). EMILLE: towards a corpus of South etymological, phonetic, morphological, syntactic, Asian languages, British Computing Society Machine semantic and other parameters of information Translation Specialist Group, London, UK. about a word. The dictionary also gives translation Rahman, S. (2005). Lexical Content and Design Case of 12000 words in English and work is under way Study. Presented at From Localization to Language to enable runtime user-defined queries on the Processing, Second Regional Training of PAN Local- available XML tags. The contents of this lexicon ization Project. Online presentation version: are based on the 21 volume Urdu Lughat http://panl10n.net/Presentations/Cambodia/Shafiq/Le xicalContent&Design.pdf.

100 The 6th Workshop on Asian Languae Resources, 2008

Balanced Corpus of Contemporary Written Japanese

Kikuo Maekawa Dept. Language Research, National Institute for Japanese Language 10-2, Midori-cho, Tachikawa-, Tokyo 190-8561 JAPAN [email protected]

guistic variations using the results of internet Abstract crawling, because the information about the genre of texts and/or the writers are usually missing. Construction of 100 million words bal- Moreover, the amount of retrieved texts can often anced corpus of contemporary written be too large to be classified by hand. Japanese is underway at the National Insti- There is also a problem of skewed distribution tute for Japanese Language. The unique of texts caused by copyright protection. Copyright- property of the corpus consists in that the protected materials, especially literary works, will majority of its sample texts are selected not usually show up in the Internet. randomly from well-defined statistical To solve these problems in Japanese corpus lin- populations covering wide range of written guistics, National Institute for Japanese Language texts. (NIJL, hereafter) has launched a corpus compila- tion project in the spring of 2006, aiming at public 1 Introduction release of Japan’s first 100 million words balanced A serious problem in corpus-based analysis of the corpus in the year of 2011. The corpus is named Japanese language is the lack of reliable balanced the Balanced Corpus of Contemporary Written corpus. Most of the corpus-based studies of Japanese, or BCCWJ. contemporary Japanese are based upon the PUBLICATION LIBRARY analyses of text archive of newspaper articles, (PRODUCTION) (CIRCULATION) archive of copyright-expired literary work (Aozora SUB-CORPUS SUB-CORPUS Bunko), or crawling of the Internet text. Books, Magazines, and Books Putting aside the problems of the copyright- Newspapers, 35 million 30 million words. expired literary works, which are definitely too old words. 2001-2005 1986-2005 to be the material for the study of the contempo- SPECIAL PURPOSE SUB-CORPUS rary Japanese, lack of balanced corpus imposes Whitepaper, Diet minute, Web text, Textbooks, etc. 35 two mutually related problems on linguistic studies. million words. 1975-2005 Most of newspaper articles are written by newspa- per writers who are very much aware of the estab- lished writing style (and orthography) of newspa- Figure 1. The three components of the BCCWJ. per articles. Accordingly, it is the genre of text where variations of all sorts are suppressed to the 2 Design of the BCCWJ minimum level. As shown in the figure 1, the BCCWJ consists of 3 On the other hand, the results of internet crawl- component sub-corpora, viz., ‘publication’, ’li- ing using search engines like Google are very brary’, and ‘special purpose’ sub-corpora. much likely to include texts covering wide range of texts. It is also expected that considerable amount 2.1 Publication sub-corpus of linguistic variations are to be observed. The upper left-hand sub-corpus of the figure 1 is It is, however, very difficult, if not impossible, called ‘publication’ sub-corpus. This is also called to conduct analyses of style difference and/or lin-

101 The 6th Workshop on Asian Languae Resources, 2008

‘production’ sub-corpus. As the name suggests, Secondly, the library sub-corpus covers the pe- this sub-corpus represents the production, as op- riod of time 1986-2005 (1986 was the year when posed to the reception aspect of contemporary the ISBN book classification system was adopted written Japanese. The sub-corpus consists of sam- by most of major publishing companies), while the ples extracted randomly from the statistical popula- period of time covered by the publication sub- tion covering the whole body of books, magazines, corpus is 2001-2005. and newspapers published during 2001-2005. The population was constructed using the 2.3 Special-purpose sub-corpus sources that are publicly available; J-BISC (Japan The last sub-corpus is called ‘special purpose’ or Biblio Disc) and Periodicals in Print in Japan ‘out-of-population’ sub-corpus. This is the aggre- were used as the sources for books and magazines gate of various special purpose mini corpora, and, respectively. The data for newspapers was avail- unlike the former two sub-corpora, some of the able from the association of newspaper companies mini corpora are not sampled using the technique (Nihon Shinbun Kyoukai). The total number of of random sampling (because the populations can characters involved in the population was esti- not be defined). mated and samples for the BCCWJ were drawn in The mini corpora currently include texts of gov- the way that each character in the population had ernmental white papers, Internet text (Yahoo! Ja- the same chance of being sampled. pan’s bulletin board Chiebukuro), minutes of the It is to be noted at this point that the composition national diet, school textbooks, and best-selling ratios among text genres (i.e. the ratio among sam- books of the past 30 years. Laws and academic ples of books, magazines, and newspapers) were papers will also be included. determined on the basis of publicly available data Most of these mini corpora contain about 5 mil- mentioned above. This makes crucial difference lion words, and will be utilized in the language from the designs of corpora like the Brown Corpus policy oriented activities of the NIJL, including the and the BNC, where the composition ratios of vari- revision of the National list of Chinese Characters ous genres were determined subjectively by spe- for Daily Usage (Jouyou kanji hyou). cialists of the English language without making reference to any objective data. 3 Funds and the current status The total size of the sub-corpus is supposed to be about 34.7 million words; and, 74.1, 16.1, and The compilation of the BCCWJ is supported by the 9.8% of the sub-corpus are to be devoted to the budget of the NIJL and the MEXT (ministry of samples of books, magazines, and newspapers re- education) Grant-in-Aid for Scientific Research spectively. Priority Area Program “Japanese Corpus” (2006- 2010) [1]. 2.2 Library sub-corpus As of September 2007, texts containing about 30 million words have been sampled and stored in the The second sub-corpus is called ‘library’ or ‘circu- NIJL server, and, the full-text query of the 10 mil- lation’ sub-corpus. The sampling population for lion words texts that are copyright cleared are pub- this sub-corpus was the whole books registered in licly available on the web [2]. at least 13 public libraries in the Tokyo Metropolis. Upon its completion, the BCCWJ will be the The population thus defined contains about world’s first balanced corpus that is designed and 335,000 books. According to our estimation, more compiled based upon rigid statistical sampling. than 48 billion characters are included in this popu- This will open up a new possibility in Japanese lation, which is nearly the same amount as the linguistics, and the design of language corpora in population of the book part of the publication sub- general. corpus. There are two important differences between the publication and library sub-corpora. Firstly, the References texts of the library sub-corpora represent those [1] http://www.tokuteicorpus.jp/ books that were accepted by a certain number of [2] http://www.kotonoha.gr.jp/demo/ readers, while the texts in the publication sub- corpora have no guarantee of the sort.

102 The 6th Workshop on Asian Languae Resources, 2008

A Basic framework to Build a Test Collection for the Vietnamese Text Catergorization

Viet Hoang-Anh, Thu Dinh-Thi-Phuong, Thang Huynh-Quyet Hanoi University of Technology, Vietnam [email protected], [email protected] organized as follows. Section 2 proposes our Abstract research on the requirements, models and techniques to build a Vietnamese test collection for The aim of this paper is to present a basic researches and experiments on Vietnamese text framework to build a test collection for a categorization. Section 3 presents our results with Vietnamese text categorization. The pre- BKTexts test collection. Lastly, the focus of our sented content includes our evaluations of ongoing research will be presented in section. some popular text categorization test col- lections, our researches on the require- 2 Model of building a test collection for ments, the proposed model and the tech- the Vietnamese text categorization niques to build the BKTexts - test collec- tion for a Vietnamese text categorization. Until now, there has not been a Vietnamese stan- The XML specification of both text and dard test collection for Vietnamese text categoriza- metadata of Vietnamese documents in the tion. Vietnamese documents used in previous stud- BKTexts also is presented. Our BKTexts ies of Vietnamese researchers are gathered by test collection is built with the XML speci- themselves and were not thoroughly checked. fication and currently has more than 17100 Moreover, all over the world, there have been a lot Vietnamese text documents collected from of test collections in many different languages, es- e-newspapers. pecially in English such as the Reuters-21578, the RCV1 and the 20NewsGroup1. Therefore, we in- 1 Introduction tend to build a Vietnamese standard test collection

Figure1. The system architecture to build a Vietnamese test collection for text categorization Natural Language Processing (NLP) for such for the Vietnamese text categorization. We defined popular languages as English, French, etc. has been a framework for building Vietnamese test collec- well studied with many achievements. In contrast, tions as follows. Basic requirements for a Viet- NLP for unpopular languages, such as Vietnamese, namese test collection text categorization has only been researched recently. It means that Our model to build a Vietnamese test collection expecting international scientists to care about our for text categorization is accomplished in four problems is not feasible in the near future. In this stages: collecting, auto coding, manual editing, and paper, we present our research results on that field, testing (Figure 1). especially on Vietnamese test collections for Vietnamese text categorization. This paper will be 1 Available from http://kdd.ics.uci.edu/

103 The 6th Workshop on Asian Languae Resources, 2008

From available resources, we gather Vietnam- and to objectively compare results with published ese documents for the test collection in accordance studies. with the scope and the structure of categories. Re- searchers usually use documents collected from e- newspapers because these documents are pre- processed and less ambiguous. Then an auto sys- tem tags documents in the XML (or SGML) for- matting specification. After being coded, documents are manual edited by editors. The editors would assign the categories they felt applicable. They also edit specification tags of formatted documents in order to completely and more precisely describe attributes of docu- ments. Lastly, to assess the accuracy of the test collection, we use some famous categorization al- gorithms such as SVM, k-NN, etc. Performing the test and correction several times, we will gradually obtain a finer and more precise test collection. The process ends when errors are below a permitted threshold. 3 The BKTexts test collection for Viet- namese text categorization With the model mentioned above, we are con- structing the BKTexts test collection for the first version. We collected about 17100 documents for the BKTexts from two e-newspapers http://www.vnexpress.net and http://www.vnn.vn. Fig.2. The XML specification of the BKTexts Categories are organized in a hierarchical structure of 10 main classifications and 37 sub-classes. Documents are marked up with XML tags and References given unique ID numbers. The XML specification David D. Lewis, Reuters-21578 Text Categorization of a document in the BKTexts test collection is Test Collection, www.daviddlewis.com, 1997. described in Figure 2. Building a successful Viet- namese test collection for text categorization has a David D. Lewis, Yiming Yang, Tony G.Rose, Fan Li, significant meaning. It will be a useful material for “RCV1: A new Benchmark Collection for Text Cate- gorization Research”, in: Journal of Machine Learn- any study on text categorization and Vietnamese ing Research 5, pp.361-397, 2004. processing in the future because it reduces a lot of manual work and time, as well as increases the ac- Huynh Quyet Thang, Dinh Thi Phuong Thu. Vietnam- curacy of experimental results. ese text categorization based on unsupervised learn- ing method and applying extended evaluating formu- 4 Conclusion and future work las for calculating the document similarity. Proceed- ings of The Second Vietnam National Symposium on We have presented our research results on defining ICT.RDA, Hanoi 24-25/9/2004, pp. 251-261 (in Viet- requirements, the model and techniques to build a namese) Vietnamese test collection for researches and ex- Dinh Thi Phưong Thu, Hoang Vinh Son, Huynh Quyet periments on Vietnamese text categorization. Cur- Thang. Proposed modifications of the CYK algo- rently, we continue building the BKTexts on a lar- rithm for the Vietnamese parsing. Journal of Com- ger scale for publishing widely in the near future. puter Science and Cybernetics, Volume 21, No. 4, This test collection enables researchers to test ideas 2005, pp. 323-336 (in Vietnamese)

104 The 6th Workshop on Asian Languae Resources, 2008

Enhanced Tools for Online Collaborative Language Resource Development

Virach Sornlertlamvanich Hitoshi Isahara Thatsanee Charoenporn National Institute of Information Suphanut Thayaboon and Communications Technology Chumpol Mokarat 3-5 Hikaridai, Seika-cho, soraku-gaun, Thai Computational Linguistics Lab. Kyoto, Japan 619-0289 NICT Asia Research Center, [email protected] Pathumthani, Thailand {virach, thatsanee, suphanut, chumpol}@tcllab.org

The rest of this paper is organized as follows: Abstract Section 2 describes the collaborative interface for revising the result of synset translation. Section 3 This paper reports our recent work of tool describes the tool for annotating Thai syntactic development for language resource con- dependency tree corpus. And, Section 4 concludes struction. To make a revision of Asian our work. WordNet which is automatically generated by using the existing English translation 2 Collaborative Tools for Asian WordNet dictionary, we propose an online collabora- Construction tive tool which can organize multiple trans- lations. To support the work of syntactic There are some efforts in developing Wordnets for dependency tree annotation, we develop an some of Asian languages, e.g. Chinese, Japanese, editing suite which integrates the utilities Korean, and Hindi. The number of languages that for word segmentation, POS tagging and have been successfully developed their Wordnets dependency tree into a sequence of editing. is still limited to some active research in the area. However, the extensive development of Wordnet 1 Introduction in other languages is of the efforts to support the NLP research and implementation. It is not only to Though WordNet was already used as a starting facilitate the implementation of NLP applications resource for developing many language WordNets, for the language, but also provide an inter-linkage the constructions of the WordNet for languages among the Wordnets for different languages to de- can be varied according to the availability of the velop multi-lingual applications. language resources. Some were developed from We adopt the proposed criteria for automatic scratch, and some were developed from the combi- synset assignment for Asian languages which has nation of various existing lexical resources. limited language resources. Based on the result This paper presents an online collaborative tool from the above synset assignment algorithm, we particularly to facilitate the construction of the introduce KUI (Knowledge Unifying Initiator), Asian WordNet which is automatically generated (Sornlertlamvanich et al., 2007) to establish an by using the existing resources having only Eng- online collaborative work in refining the WorNets. lish equivalents and the lexical synonyms. KUI is a community software tool which allows In addition, to support the work of syntactic de- registered members including language experts to pendency tree annotation, we develop an editing revise and vote for the synset assignment. The sys- suite which integrates the utilities for word seg- tem manages the synset assignment according to mentation, POS tagging and dependency tree. The the preferred score obtained from the revision tool is organized in 4 steps, namely, sentence se- process. As a result, the community-based Word- lection, word segmentation, POS tagging, and syn- Nets will be accomplished and exported into the tactic dependency tree annotation. original form of WordNet database. Via the synset

105 The 6th Workshop on Asian Languae Resources, 2008

ID assigned in the WordNet, the system can gener- ate a cross language WordNet. Through this effort, a translated version of Asian WordNet can be es- tablished. Table 1 shows a record of WordNet displayed for translation in KUI interface. English entry to- gether with its part-of-speech, synset, and gloss are provided if exists. The members will examine the assigned lexical entry and decide whether to vote for it or propose a new translation. Car [Options] POS : NOUN Synset : auto, automobile, machine, motorcar Figure 2. Syntactic Dependency Tree Annotation Gloss : a motor vehicle with four wheels; usually propelled by an internal combustion engine; 3.1 POS Annotation Table 1. A record for a synset The result from the automatic word segmentation and POS tagging program is generated with alter- native POSs for revision. A dropdown list of POSs is provided for annotator to correct the POS. Since word segmentation is processed together with POS tagging, the annotator is also provided a GUI to merge or to divide the proposed word unit.

3.2 Syntactic Dependency Tree Annotation The result from POS annotation in Section 3.1 is passed to define the syntactic dependency between Figure 1. KUI interface (www.tcllab.org/kui) words. The dependency is assigned to form a Figure 1 illustrates the translation page of KUI phrase and a sentence respectively. The final out- In the working area, the login member can partici- put will be marked in the XML manner and shown pate in proposing a new translation or voting for as a tree for confirmation. the preferred translation to revise the synset as- signment. Statistics of the progress as well as many useful functions such as item search, chat, and list of online participants are also provided to under- stand the progress of work and to work online with 4 Conclusion other members. Our current work on the web-based collaborative 3 Tool for Constructing a Syntactic De- tool for Asian WordNet construction and tool for pendency Tree Annotated Corpus Syntactic Dependency Tree Annotation are devel- oped as an open platform for online contribution. The tool is organized in 4 steps, namely, sentence A user-friendly interface and self-organizing utili- selection, word segmentation, POS tagging, and ties are intentionally prepared to support the online syntactic dependency tree annotation, shown in collaborative work. Figure 2. Sentence segmentation is yet another crucial issue for the Thai language. We, however, References Virach Sornlertlamvanich, Thatsanee Charoenporn, will not discuss about the issue in this work. The Kergit Robkop, and Hitoshi Isahara. 2007. Collabo- input is already a list of sentences provided for an- rative Platform for Multilingual Resource Develop- notator to select. ment and Intercultural Communication, IWIC2007, Springer, LNCS4568:91-102.

106 The 6th Workshop on Asian Languae Resources, 2008

Japanese Effort Toward Sharing Text and Speech Corpora

Shucihi Itahashi*+ Koiti Hasida+ *National Institute of Informatics +National Institute of Advanced 2-1-1 Hitotsubashi, Chiyoda-, Industrial Science and Technology Tokyo, Japan 101-8430 1-1-1 Umezono, Tsukuba, Ibaraki, [email protected] Japan 305-8568 [email protected]

for creating future value in information media, Abstract particularly speech media. NII is promoting this consortium together with GSK (Itahashi and Oh- This report introduces the activities of the suga, 2006). two organizations related to collection and distribution of text and speech cor- 2.1 SRC objective and activities pora in Japan. One is the Language Re- The SRC aims at the collection, distribution, in- source Association (GSK) and the other vestigation, research, and standardization of elec- is NII-Speech Resources Consortium tronic data and software tools that are necessary (NII-SRC). for the development of science, education, and industry concerning speech. The consortium will 1 Introduction contribute to the development of the information society through these activities. Although the need for shared speech and text The SRC investigates the existing speech re- data has long been acknowledged, its realization sources, catalogues them, and shows them on its has been slow to develop in Japan. The Language homepage in order to promote the research and Resource Association (GSK) was established in development of speech information processing. It 1999 in order to share and distribute text and also requests research institutions to offer their speech corpora. It was renovated as an NPO in speech resources to the SRC. Based on these ac- 2003 with an emphasis on text corpora. The Na- tivities, it urges the distribution, promotion, and tional Institute of Informatics (NII) has decided publicity of speech resources. The consortium to initiate the Speech Resources Consortium will conduct the additional production and distri- (SRC) in 2006, with the goal of creating future bution of the speech resources that are frequently value in information media, particularly speech requested. It also conducts investigations and re- media. search on speech resources. Users will be able to obtain the speech resources or data they need and 2 NII-Speech Resources Consortium use them through simple processes offered by the SRC. SRC distributes 23 speech corpora. The National Institute of Informatics (NII) was founded in Tokyo, Japan in April 2000 as an in- 2.2 Organization ter-university research institute organized to con- The SRC is made up of a chairperson, a re- duct comprehensive research on informatics and searcher, an adviser, a secretary, and a Speech to develop an advanced infrastructure for dis- Corpora Promotion Committee. The committee seminating scientific information. As a part of works to promote the development of the SRC. promoting the missions, NII decided in 2006 to Around 15 members have been invited to join the initiate the Speech Resources Consortium (SRC) committee from the fields of speech processing,

107 The 6th Workshop on Asian Languae Resources, 2008 linguistics, acoustics, speech and language corpus through the homepage. It mediates the user’s re- creation, and speech and language resource pro- quests to the producer or provider of the corpora. vision. The committee meets a few times a year. (3) Fee-based distribution: Making speech cor- pora usually requires some money, including roy- 3 Language Resource Association (GSK) alties. Some corpora cost users a certain amount of money to obtain, although they are not so ex- The GSK was renovated as an NPO in 2003 and pensive. is qualified as a corporate body and can mediate between the producer and users of a language 4 Present Issues corpora (Hasida and Tanaka, 2006). So far, we have established the GSK and NII- 3.1 GSK objective and activities SRC for the text and speech corpora, respectively, while the LDC and ELRA distributes both the The GSK aims at almost the same objectives as speech and text corpora. those of the NII-SRC mentioned in the preceding The GSK is supported by a project until the section, including both the text and speech end of this fiscal year, but it will loose its finan- corpora, but the text corpora are the main concern cial support at that point. The GSK has a rela- of GSK at present. tively small amount of corpora to be distributed, Prof. Y. Mikami of Nagaoka University of so it is still too early to stand on its own feet. The Technology proposed a project on the “Construc- NII-SRC activities will be supported by a project tion of Networks for Asian Linguistic Informa- for a few more years, but it can not gain profit tion Technology Resources” together with GSK. from distributing the corpora created by the re- This proposal was adopted as a three-year project searchers outside of NII. We will search for better starting from fiscal 2005. It is supported by the ways by which both the NII-SRC and GSK will Science and Technology Promotion & Coordina- be able to act as corpora agents like the LDC or tion Fund, and Yen15 million ($125,000) is ELRA. There are some more organizations re- available for GSK. The mission of the project is lated to language resources such as ATR, NICT, to create a network of qualified Asian partners to and NIJL. We need much more collaboration and specify and support the development of high pri- coordination among these. ority language resources for Asian languages. As a part of this project, the GSK organized the 5 Conclusion “Asian Language Resources Workshop 2007” in March. Twenty one people from 13 countries This report explained the activities of the GSK participated in the workshop. and the NII–Speech Resources Consortium (NII- The GSK activities are almost the same as SRC). GSK and NII-SRC will facilitate the dis- those of the NII-SRC. The GSK is made up of a tribution, promotion, and publicity of the lan- president, two vice presidents, 11 board members, guage resources, and in so doing, will contribute 25 steering committee members, a secretary, and to the information society in Japan and in Asia. two office clerks. GSK supplies two text corpora For further information, please refer to the fol- and a speech corpus; it plans to distribute seven lowing URLs. more text corpora soon. http://research.nii.ac.jp/src/eng/index.html http://www.gsk.or.jp/index_e.html 3.2 Corpora distribution system We have three systems of corpora distribution to References be conducted by the NII-SRC and GSK. (1) No- S. Itahashi, T. Ohsuga. 2006. Introduction of NII- fee distribution: As a rule, the cost of handling Speech resources Consortium, Proc. Oriental the corpora is to fall on the user, although the cor- COCOSDA-2006, Penang, Malaysia: 38-43. pus itself is free of charge. (2) Agency: The pro- ducers of the corpora entrust the SRC/GSK with K. Hasida, H. Tanaka. 2006. Nonprofit Organiza- receiving requests from the users. The SRC/GSK tion “Language Resource Association (GSK)”, advertises the corpora to speech researchers (in Japanese) Japanese Linguistics, 20:107-110.

108 Author Index

Bali, Kalika ...... 89 L, Sobha ...... 89 Bandyopadhyay, Sivaji ...... 1 Lee Nagano, Robin ...... 25 Baskaran, Sankaran ...... 89 Li, Wenjie ...... 17 Basu, Anupam ...... 57 Lu, Qin ...... 17 Bhattacharya, Tanmoy ...... 89 Bhattacharyya, Pushpak ...... 89 Maekawa, Kikuo ...... 101 Matsuda, Makiko ...... 25 Caminero, Rizza ...... 49 Mikami, Yoshiki ...... 25, 49 Charoenporn, Thatsanee ...... 105 Mitra, Pabitra ...... 9 Chimeddorj, Odbayar ...... 97 Mokarat, Chumpol ...... 105 Cui, Gaoying ...... 17 Murthy,Kavi Narayana ...... 41

Dasgupta, Tirthankar ...... 57 Nishimura, Ryo ...... 95 Dinh-Thi-Phuong, Thu ...... 103 Diwakar, Synny ...... 57 Okada, Yoshihiro ...... 95

Ekbal, Asif ...... 1 Phyo, Thin Zar ...... 33 Prasad, Rashmi ...... 73 G, Uma Maheswara Rao ...... 85 Goto, Hiroki ...... 25 R.J., Rama Sree ...... 85 Rau, D. Victoria ...... 81 Hasida, Koitiˆ ...... 107 Riza, Hammam ...... 93 Hayase, Yoshikazu ...... 25 Hoang-Anh, Vie ...... 103 S, Rajendran ...... 89 Htay, Hla Hla ...... 41 Saha, Sujan Kumar ...... 9 Husain, Samar ...... 73 Sarkar, Sudeshna ...... 9 Hussain, Sarmad ...... 99 Sharma, Dipti ...... 73 Huynh-Quyet, Thang ...... 103 Shukla, Sambit ...... 57 Sornlertlamvanich, Virach ...... 105 Isahara, Hitoshi ...... 105 Itahashi, Shuichi ...... 107 Tai, Yin-Sheng ...... 81 Takahashi, Tomoe ...... 25 Jaimai, Purev ...... 97 Thayaboon, Suphanut ...... 105 Jha, Girish Nath ...... 89 Joshi, Aravind ...... 73 Watanabe, Yasuhiko ...... 95 Webber, Bonnie ...... 65 K V, Subbarao ...... 89 K, Saravanan ...... 89 Yang, Meng-Chien ...... 81 K.V., Madhu Murthy ...... 85 Ko Ko, Wunna ...... 33 Zeyrek, Deniz ...... 65 Kumar, Sandeep ...... 57

109